Techniques for Training Large Neural Networks

Metadata

Author: openai.com
Title: Techniques for Training Large Neural Networks
Reference: https://openai.com/blog/techniques-for-training-large-neural-networks/
Category: #article

Page Notes

Highlights

An illustration of various parallelism strategies on a three-layer model. Each color refers to one layer and dashed lines separate different GPUs. — Updated on 2022-06-19 11:00:52 — Group: #Personal
- Annotation:
Data parallelism—run different subsets of the batch on different GPUs; Pipeline parallelism—run different layers of the model on different GPUs; Tensor parallelism—break up the math for a single operation such as a matrix multiplication to be split across GPUs; Mixture-of-Experts—process each example by only a fraction of each layer. — Updated on 2022-06-19 11:01:54 — Group: #Personal
Data Parallel training means copying the same parameters to multiple GPUs (often called “workers”) and assigning different examples to each to be processed simultaneously. Data parallelism alone still requires that your model fits into a single GPU’s memory, but lets you utilize the compute of many GPUs at the cost of storing many duplicate copies of your parameters. That being said, there are strategies to increase the effective RAM available to your GPU, such as temporarily offloading parameters to CPU memory between usages. — Updated on 2022-06-19 11:02:32 — Group: #Personal
With Pipeline Parallel training, we partition sequential chunks of the model across GPUs. Each GPU holds only a fraction of parameters, and thus the same model consumes proportionally less memory per GPU. — Updated on 2022-06-19 11:03:12 — Group: #Personal
GPipe has each worker process forward and backward passes consecutively and then aggregates gradients from multiple microbatches synchronously at the end. PipeDream instead schedules each worker to alternatively process forward and backward passes. — Updated on 2022-06-19 11:05:13 — Group: #Personal
- Annotation:
Pipeline parallelism splits a model “vertically” by layer. It’s also possible to “horizontally” split certain operations within a layer, which is usually called Tensor Parallel training. — Updated on 2022-06-19 11:07:00 — Group: #Personal
With either strategy, we can slice the weight matrix into even-sized “shards”, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the results. — Updated on 2022-06-19 11:07:06 — Group: #Personal
With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input. One example approach is to have many sets of weights and the network can choose which set to use via a gating mechanism at inference time. — Updated on 2022-06-19 11:07:36 — Group: #Personal
Different experts can be hosted on different GPUs, providing a clear way to scale up the number of GPUs used for a model. — Updated on 2022-06-19 11:07:43 — Group: #Personal
Checkpointing (also known as activation recomputation) stores any subset of activations, and recomputes the intermediate ones just-in-time during the backward pass. This saves a lot of memory at the computational cost of at most one additional full forward pass. — Updated on 2022-06-19 11:08:24 — Group: #Personal
Mixed Precision Training is to train models using lower-precision numbers (most commonly FP16). Modern accelerators can reach much higher FLOP counts with lower-precision numbers, and you also save on device RAM. — Updated on 2022-06-19 11:08:35 — Group: #Personal
Offloading is to temporarily offload unused data to the CPU or amongst different devices and later read it back when needed. — Updated on 2022-06-19 11:08:44 — Group: #Personal
Memory Efficient Optimizers have been proposed to reduce the memory footprint of the running state maintained by the optimizer, such as Adafactor. — Updated on 2022-06-19 11:08:58 — Group: #Personal
Compression also can be used for storing intermediate results in the network. For example, Gist compresses activations that are saved for the backward pass; DALL·E compresses the gradients before synchronizing them. — Updated on 2022-06-19 11:09:09 — Group: #Personal