← Greg Buzzard

Transformer Attention: Interactive Notes

A series of interactive explanations of the Transformer architecture, built for math and ECE graduate students with a linear algebra background. Each notebook introduces one layer of the mechanism, using small hand-traceable examples with live visualizations.

Standard neural architectures differ in how they let parts of the input influence each other. A CNN limits interaction to spatial neighbors through a fixed kernel; an MLP may process tokens independently or mix them via fully connected layers, but in either case, the pattern of interaction is determined by fixed weights — the same for every input. Attention makes the interaction pattern itself a function of the data: each token queries every other token, and the weight of that connection depends on how well their learned query and key vectors align — recomputed fresh for every new input. This content-dependent cross-token interaction leads to an O(N2) cost: since we cannot know in advance which token pairs will matter, all pairs must be evaluated. These notes trace that idea from a bare dot-product similarity matrix through multi-head attention, the transformer block, vision transformers (ViT), and the SWIN architecture — which recovers efficiency by restricting attention to local windows.