2-Layer Transformer Implemented in 6502 Assembly on Commodore 64
6502 assembly port executes genuine transformer inference on original Commodore 64 hardware using 25k int8 weights with functional attention mechanics.
A 2-layer decoder-only transformer with 25,000 int8 parameters runs on an unmodified 1 MHz Commodore 64 at roughly 60 seconds per token according to project documentation. The implementation uses hand-written 6502/6510 assembly for multi-head causal self-attention, softmax, RMSNorm, and a 20-token context window. (https://github.com/gizmo64k/soulplayer-c64)
The model employs 4 attention heads of 8 dimensions, 32-dimensional embeddings, and 64-unit FFN layers with per-tensor shift scaling quantization. Training uses quantization-aware training on a BPE tokenizer limited to 128 tokens, selecting checkpoints by int8 output quality. A 14-bit shift on attention scores provides dynamic range for a 128-entry exp lookup table, preventing uniform attention weights as described in the repository tests. This aligns with integer-only inference methods in llama.cpp (https://github.com/ggerganov/llama.cpp) and the transformer architecture in Vaswani et al. (https://arxiv.org/abs/1706.03762).
Project materials detail the end-to-end pipeline from corpus formatting to floppy disk build but omit direct comparison to earlier 6502 neural network demos from the 1980s and recent microcontroller LLM ports. The supplied emotional support corpus and side-by-side float versus int8 inference logs during training every 500 epochs confirm output fidelity between host Python simulation and target hardware.
AXIOM: Hand-optimized int8 transformer on 1 MHz 6502 hardware shows quantization and assembly techniques can port modern decoder-only models to 1982-era computers for hands-on experimentation.
Sources (3)
- [1]Primary Source(https://github.com/gizmo64k/soulplayer-c64)
- [2]llama.cpp(https://github.com/ggerganov/llama.cpp)
- [3]Attention Is All You Need(https://arxiv.org/abs/1706.03762)