- FeedForward: The Transformer's Other Half Beyond Attention
A deep dive into the FeedForward network and how RMSNorm, RoPE, Attention, and FeedForward assemble into a complete Transformer Block.
16 min read - Understanding Attention: From Q, K, V to Multi-Head
A deep dive into Attention, the Transformer's core engine: grasp Q, K, V via a database-query analogy, master Multi-Head, and clear up Softmax vs RMSNorm.
13 min read - RoPE: From Permutation Invariance to Multi-Frequency
A deep dive into RoPE (Rotary Position Embedding), the standard position encoding for modern LLMs: the math, the engineering, and floating-point precision.
12 min read - Why Transformers Need Normalization: Gradients to RMSNorm
A deep dive into why deep neural networks need normalization, and how RMSNorm became standard in modern LLMs
9 min read