Choosing a batch size and provider for LLM training
Notes on choosing appropriate batch size and compute for training LLMs
Notes on choosing appropriate batch size and compute for training LLMs
Notes on training LLMs using sharding strategies
Notes on training LLMs using data parallelism strategy
Notes on Ultra-scale Playbook - training LLM on a single GPU
One of the projects I am looking forward to training an LLM from scratch. For this project, I am creating a mini LLM version of StarCoder2 model. StarCoder2 is a CodeLLM. To limit the scope here’s a brief outline of the project Mini LLM, maybe 200M parameter model as a start Focusing only on Python as a language The Chinchilla paper suggests using roughly 20x the number of parameters in tokens for optimal training. For my 200M parameter model, this means ~4B tokens would be ideal. To start, I’ll experiment with a smaller slice of 500M tokens. ...