Transformer Attention: Interactive Notes

A series of interactive explanations of the Transformer architecture, built for math and ECE graduate students with a linear algebra background. Each notebook introduces one layer of the mechanism, using small hand-traceable examples with live visualizations.

Standard neural architectures differ in how they let parts of the input influence each other. A CNN limits interaction to spatial neighbors through a fixed kernel; an MLP may process tokens independently or mix them via fully connected layers, but in either case, the pattern of interaction is determined by fixed weights — the same for every input. Attention makes the interaction pattern itself a function of the data: each token queries every other token, and the weight of that connection depends on how well their learned query and key vectors align — recomputed fresh for every new input. This content-dependent cross-token interaction leads to an O(N²) cost: since we cannot know in advance which token pairs will matter, all pairs must be evaluated. These notes trace that idea from a bare dot-product similarity matrix through multi-head attention, the transformer block, vision transformers (ViT), and the SWIN architecture — which recovers efficiency by restricting attention to local windows.

Notebook 1: Attention from Scratch
Dot-product similarity · Softmax and temperature · Self-attention with Q, K, V projections · What projections buy you
Notebook 2: Multi-Head Attention & Positional Encoding
Limits of one head · Split → Attend → Concat · Heads in action · Sinusoidal PE · RoPE
Notebook 3: The Transformer Block
Building blocks · Block structure and data flow · The residual stream
Notebook 4: Vision Transformer (ViT)
Patch embedding · 2D positional encoding · Classification token · DINO attention maps
Notebook 5: SWIN Transformer I — Local Window Attention
Quadratic cost · Window partitioning (W-MSA) · Shifted windows (SW-MSA) · Cyclic shift
Notebook 6: SWIN Transformer II — Hierarchy and Position Bias
Block pairs & stages · Patch merging · Relative position bias · Full architecture
Notebook 7: Operations at a Glance
Per-token vs. global vs. local mixing · O(N²) bottleneck · Operation reference table · ViT vs. SWIN block diagram · Residual stream