Mini StarCoder2 - Data

One of the projects I am looking forward to training an LLM from scratch. For this project, I am creating a mini LLM version of StarCoder2 model. StarCoder2 is a CodeLLM. To limit the scope here’s a brief outline of the project Mini LLM, maybe 200M parameter model as a start Focusing only on Python as a language The Chinchilla paper suggests using roughly 20x the number of parameters in tokens for optimal training. For my 200M parameter model, this means ~4B tokens would be ideal. To start, I’ll experiment with a smaller slice of 500M tokens. ...

6 min · 1126 words · dudeperf3ct

Mini StarCoder2 - Pretraining (WIP)

Oh now it’s time to get the wallets out 💸💸💸💸

1 min · 10 words · dudeperf3ct

Mini StarCoder2 - Tokenizer

Now that there is a pretrained dataset containing Python source code in form of text, next task would be to create a tokenizer specific to the code. Tokenization Tokenization is the process of converting text into a numerical representation that a model can process. The simplest possible encoding is mapping each character to its ASCII value: >>> list("hello world".encode('ascii')) [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] ASCII works, but it is limited to 128 symbols. Modern text includes code comments, Unicode identifiers, and emojis. That’s where Unicode comes in. ...

13 min · 2667 words · dudeperf3ct