Ultra-scale Playbook - ZeRO Sharding
Notes on training LLMs using sharding strategies
Notes on training LLMs using sharding strategies
Notes on training LLMs using data parallelism strategy
Notes on Ultra-scale Playbook - training LLM on a single GPU
One of the projects I am looking forward to training an LLM from scratch. For this project, I am creating a mini LLM version of StarCoder2 model. StarCoder2 is a CodeLLM. To limit the scope here’s a brief outline of the project Mini LLM, maybe 200M parameter model as a start Focusing only on Python as a language The Chinchilla paper suggests using roughly 20x the number of parameters in tokens for optimal training. For my 200M parameter model, this means ~4B tokens would be ideal. To start, I’ll experiment with a smaller slice of 500M tokens. ...
It’s time to get the wallets out 💸💸💸💸 Useful Links Code Github: https://github.com/dudeperf3ct/minicode-llm/tree/main/codellm_pretrain/torch_titan Experiment W&B for project: https://wandb.ai/dudeperf3ct/torchtitan ...