Architecture choices
At ~30M parameters, LLM-V1.0Small-scale transformer language model for local and Colab training is intentionally trainable on student hardware. Eight transformer layersStacked self-attention and feed-forward blocks—the core repeating unit in GPT-style models with eight heads and 512-dimensional embeddings balance capacity vs. overfit risk on modest corpora.
An 8,192-token BPE vocabularyByte-pair encoding tokenizer with 8k merge tokens—compresses text into subword units for open-vocabulary modeling keeps embedding tables manageable while covering technical tokens I care about.