Metadata
- Author: openai.com
- Title: Techniques for Training Large Neural Networks
- Reference: https://openai.com/blog/techniques-for-training-large-neural-networks/
- Category: #article
Page Notes
Highlights
-
An illustration of various parallelism strategies on a three-layer model. Each color refers to one layer and dashed lines separate different GPUs. — Updated on 2022-06-19 11:00:52 — Group: #Personal
-
Annotation:
-
Annotation:
-
Data parallelism—run different subsets of the batch on different GPUs; Pipeline parallelism—run different layers of the model on different GPUs; Tensor parallelism—break up the math for a single operation such as a matrix multiplication to be split across GPUs; Mixture-of-Experts—process each example by only a fraction of each layer. — Updated on 2022-06-19 11:01:54 — Group: #Personal
-
Data Parallel training means copying the same parameters to multiple GPUs (often called “workers”) and assigning different examples to each to be processed simultaneously. Data parallelism alone still requires that your model fits into a single GPU’s memory, but lets you utilize the compute of many GPUs at the cost of storing many duplicate copies of your parameters. That being said, there are strategies to increase the effective RAM available to your GPU, such as temporarily offloading parameters to CPU memory between usages. — Updated on 2022-06-19 11:02:32 — Group: #Personal
-
With Pipeline Parallel training, we partition sequential chunks of the model across GPUs. Each GPU holds only a fraction of parameters, and thus the same model consumes proportionally less memory per GPU. — Updated on 2022-06-19 11:03:12 — Group: #Personal
-
GPipe has each worker process forward and backward passes consecutively and then aggregates gradients from multiple microbatches synchronously at the end. PipeDream instead schedules each worker to alternatively process forward and backward passes. — Updated on 2022-06-19 11:05:13 — Group: #Personal
-
Annotation:
-
Annotation:
-
Pipeline parallelism splits a model “vertically” by layer. It’s also possible to “horizontally” split certain operations within a layer, which is usually called Tensor Parallel training. — Updated on 2022-06-19 11:07:00 — Group: #Personal
-
With either strategy, we can slice the weight matrix into even-sized “shards”, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the results. — Updated on 2022-06-19 11:07:06 — Group: #Personal
-
With the Mixture-of-Experts (MoE) approach, only a fraction of the network is used to compute the output for any one input. One example approach is to have many sets of weights and the network can choose which set to use via a gating mechanism at inference time. — Updated on 2022-06-19 11:07:36 — Group: #Personal
-
Different experts can be hosted on different GPUs, providing a clear way to scale up the number of GPUs used for a model. — Updated on 2022-06-19 11:07:43 — Group: #Personal
-
Checkpointing (also known as activation recomputation) stores any subset of activations, and recomputes the intermediate ones just-in-time during the backward pass. This saves a lot of memory at the computational cost of at most one additional full forward pass. — Updated on 2022-06-19 11:08:24 — Group: #Personal
-
Mixed Precision Training is to train models using lower-precision numbers (most commonly FP16). Modern accelerators can reach much higher FLOP counts with lower-precision numbers, and you also save on device RAM. — Updated on 2022-06-19 11:08:35 — Group: #Personal
-
Offloading is to temporarily offload unused data to the CPU or amongst different devices and later read it back when needed. — Updated on 2022-06-19 11:08:44 — Group: #Personal
-
Memory Efficient Optimizers have been proposed to reduce the memory footprint of the running state maintained by the optimizer, such as Adafactor. — Updated on 2022-06-19 11:08:58 — Group: #Personal
-
Compression also can be used for storing intermediate results in the network. For example, Gist compresses activations that are saved for the backward pass; DALL·E compresses the gradients before synchronizing them. — Updated on 2022-06-19 11:09:09 — Group: #Personal