I will be giving a talk at Pydata MCR on How to train your LLMs?.
The talk will cover parallelism strategies and memory optimization techniques for scaling the training of large language models. Here’s a short summary of the key topics,
- Training on single GPU : This showed a breakdown of memory consumption of components such as parameters, gradients, optimiser states and activation memory. These are stored during training.
- Activation Recomputation: A memory optimisation technique for activation memory.
- Gradient Accumulation: Gradient accumulation helps train on larger batches without using extra memory.
- Data parallelism: First and simplest degree of parallelism strategy. It speeds up training throughput by using multiple GPUs for training but does not reduce per-GPU memory requirements.
- ZeRO Sharding: Shards parameters, gradients, and optimizer states across GPUs, dramatically lowering per-GPU model state memory.
Slides from the talk are available here: