Choosing a batch size and provider for LLM training

Notes on choosing appropriate batch size and compute for training LLMs

June 27, 2025 · 4 min · 752 words

Ultra-scale Playbook - ZeRO Sharding

Notes on training LLMs using sharding strategies

June 21, 2025 · 8 min · 1561 words

Ultra-scale Playbook - Data Parallelism

Notes on training LLMs using data parallelism strategy

May 17, 2025 · 5 min · 944 words

Ultra-scale Playbook - Train on a single GPU

Notes on Ultra-scale Playbook - training LLM on a single GPU

April 27, 2025 · 4 min · 803 words

Mini StarCoder2 - Data

One of the projects I am looking forward to training an LLM from scratch. For this project, I am creating a mini LLM version of StarCoder2 model. StarCoder2 is a CodeLLM. To limit the scope here’s a brief outline of the project Mini LLM, maybe 200M parameter model as a start Focusing only on Python as a language The Chinchilla paper suggests using roughly 20x the number of parameters in tokens for optimal training. For my 200M parameter model, this means ~4B tokens would be ideal. To start, I’ll experiment with a smaller slice of 500M tokens. ...

6 min · 1126 words · dudeperf3ct