Now that there is a pretrained dataset containing Python source code in form of text, next task would be to create a tokenizer specific to the code.