Mini StarCoder2 - Tokenizer

Now that there is a pretrained dataset containing Python source code in form of text, next task would be to create a tokenizer specific to the code. Tokenization Tokenization is the process of converting text into a numerical representation that a model can process. The simplest possible encoding is mapping each character to its ASCII value: >>> list("hello world".encode('ascii')) [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100] ASCII works, but it is limited to 128 symbols. Modern text includes code comments, Unicode identifiers, and emojis. That’s where Unicode comes in. ...

13 min · 2667 words · dudeperf3ct