Journal

Daily notes

08-31-2025 ‣ Worked on quantization-aware training (QAT) and LoRA on GPT-2, exploring training stability and performance across bit-widths. Cyclic precision training notably improved lower precision settings (likely by encouraging wider minima), with a 5-bit model outperforming the 8-bit baseline. Repo
08-15-2025 ‣ Spent time revisiting transformer math; clarified how feed-forward layers are independent per-token while attention introduces cross-token dependencies, and how multi-head attention improves expressiveness by learning multiple representations of token similarities.