Pydata MCR talk on training LLMs

I ~~will be giving~~ gave a (my first) talk at Pydata MCR on How to train your LLMs?.

The talk ~~will cover~~ covered parallelism strategies and memory optimization techniques for scaling the training of large language models. Here’s a short summary of the key topics,

Training on single GPU : This shows a breakdown of memory consumption of components such as parameters, gradients, optimiser states and activation memory. These are stored during training.
Activation Recomputation: A memory optimisation technique for activation memory.
Gradient Accumulation: Gradient accumulation helps train on larger batches without using extra memory.
Data parallelism: First and simplest degree of parallelism strategy. It speeds up training throughput by using multiple GPUs for training but does not reduce per-GPU memory requirements.
ZeRO Sharding: Shards parameters, gradients, and optimizer states across GPUs, dramatically lowering per-GPU model state memory.

Slides from the talk are available here: