Llm | Blog

Ultra-scale Playbook - Train on a single GPU

Notes on Ultra-scale Playbook - training LLM on a single GPU

Mini StarCoder2 - Data

One of the projects I am looking forward to training an LLM from scratch. For this project, I am creating a mini LLM version of StarCoder2 model. StarCoder2 is a CodeLLM. To limit the scope here’s a brief outline of the project Mini LLM, maybe 200M parameter model as a start Focusing only on Python as a language The Chinchilla paper suggests using roughly 20x the number of parameters in tokens for optimal training. For my 200M parameter model, this means ~4B tokens would be ideal. To start, I’ll experiment with a smaller slice of 500M tokens. ...

Mini StarCoder2 - Pretraining (TorchTitan)

It’s time to get the wallets out 💸💸💸💸 Useful Links Code Github: https://github.com/dudeperf3ct/minicode-llm/tree/main/codellm_pretrain/torch_titan Experiment W&B for project: https://wandb.ai/dudeperf3ct/torchtitan ...

Mini StarCoder2 - Tokenizer

Now that there is a pretrained dataset containing Python source code in form of text, next task would be to create a tokenizer specific to the code. Tokenization Tokenization is the process of converting text into a numerical representation that a model can process. The simplest possible encoding is mapping each character to its ASCII value: >>> list("hello world".encode('ascii')) [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] ASCII works, but it is limited to 128 symbols. Modern text includes code comments, Unicode identifiers, and emojis. That’s where Unicode comes in. ...