A collection of ideas I promised myself I’d learn … someday :)

Mini StarCoder2 - Data

One of the projects I am looking forward to training an LLM from scratch. For this project, I am creating a mini LLM version of StarCoder2 model. StarCoder2 is a CodeLLM. To limit the scope here’s a brief outline of the project Mini LLM, maybe 200M parameter model as a start Focusing only on Python as a language The Chinchilla paper suggests using roughly 20x the number of parameters in tokens for optimal training. For my 200M parameter model, this means ~4B tokens would be ideal. To start, I’ll experiment with a smaller slice of 500M tokens. ...

6 min · 1126 words · dudeperf3ct

Mini StarCoder2 - Pretraining (TorchTitan)

It’s time to get the wallets out 💸💸💸💸 Useful Links Code Github: https://github.com/dudeperf3ct/minicode-llm/tree/main/codellm_pretrain/torch_titan Experiment W&B for project: https://wandb.ai/dudeperf3ct/torchtitan ...

17 min · 3568 words · dudeperf3ct

Mini StarCoder2 - Tokenizer

Now that there is a pretrained dataset containing Python source code in form of text, next task would be to create a tokenizer specific to the code. Tokenization Tokenization is the process of converting text into a numerical representation that a model can process. The simplest possible encoding is mapping each character to its ASCII value: >>> list("hello world".encode('ascii')) [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] ASCII works, but it is limited to 128 symbols. Modern text includes code comments, Unicode identifiers, and emojis. That’s where Unicode comes in. ...

13 min · 2667 words · dudeperf3ct

Zed Extension for Camouflage

I am using Zed editor especially the Vim mode for quite some time. VSCode editor was my previous editor. The experience on VSCode was laggy and slow using VSCodeVim. I’ll blame my machine for that, rather than my not-so-great Vim skills. I quite like the Camouflage VSCode extension by zeybek. This extension hides the sensitive environment variables in the .env file by hiding their values. No more “oops” moment. This seems like a perfect opportunity to learn Rust - Create a similar extension for Zed editor. ...

2 min · 410 words · dudeperf3ct