A synthesis of Notebooks 1–6. Every named computation in a transformer is classified as per-token, global mixing, or local mixing — the taxonomy that explains both the power and the cost of attention.
Summary: Every operation in a transformer falls into one of three categories that differ in how they relate tokens to one another: per-token operations process each token independently; global mixing lets every token interact with every other; local mixing restricts interaction to a fixed-size window.
Why the distinction matters. In a standard MLP or CNN, the pattern of inter-neuron interaction is fixed by the learned weights and is the same for every input. Attention is different: the weight that token $i$ places on token $j$ is recomputed from the current embeddings of both tokens on every forward pass. This content-dependent mixing is what gives transformers their expressive power — but it comes at the cost of evaluating all $N^2$ query–key pairs. The three categories make this trade-off explicit: per-token operations are cheap and local; global mixing is powerful but expensive; local mixing (introduced by SWIN) recovers efficiency by accepting a spatial inductive bias.
Where interaction lives. Token interaction is a property of the attention sub-block and no other components. All other named operations — projections, LayerNorm, FFN, patch merging — are per-token: they transform each token's representation without consulting other tokens. This means that increasing model depth (more full transformer blocks) adds per-token capacity for free but only increases mixing through the additional attention sub-blocks.
Three architectures, one framework. The standard transformer (Notebooks 1–3) uses global mixing at every block. ViT (Notebook 4) applies the same idea to image patches, including a special [CLS] token that aggregates global information. SWIN (Notebooks 5–6) substitutes local mixing (W-MSA and SW-MSA) for global mixing, making attention cost linear in $N$ and enabling a feature hierarchy through patch merging. All three use identical per-token operations surrounding the attention sub-block.
Summary: The three mixing steps — $QK^T$, softmax, and weighted sum $AV$ — all scale as $O(N^2)$; SWIN's windowed counterparts reduce this to $O(NW^2)$, a 16× reduction at $N=784$ with $W=7$.
The $O(N^2)$ cost comes from $QK^T \in \mathbb{R}^{N \times N}$: each of the $N$ query vectors must be dotted against all $N$ key vectors, producing $N^2$ scores per head. At $N=784$ (ViT-S/8 on a 224×224 image with patch size 8) that is $784^2 \approx 614{,}000$ dot products per head per block. SWIN with $W=7$ ($W^2 = 49$ tokens per window) requires only $784 \times 49 = 38{,}416$ — a 16× reduction for the same sequence. The chart shows how this gap grows with $N$.
Log–log axes: full attention (slope 2, quadratic) vs. windowed attention (slope 1, linear). Dotted line: $N=784$.
Where the parameters live. The mixing operations ($QK^T$, softmax, $AV$) are parameter-free: they evaluate scores that the learned projections produce, but carry no weights of their own. Nearly all the learned weight matrices reside in the per-token operations — the $Q$/$K$/$V$/$W_O$ projections, FFN layers, and patch embedding. Token interaction is entirely determined by the six operations in the table below, with no dedicated parameters.
| Operation | Type | Cost (order) | Introduced |
|---|---|---|---|
| Attention scores $QK^T$ | Global mixing | $O(N^2 d_k)$ | NB1 |
| Softmax (normalize scores) | Global mixing | $O(N^2)$ | NB1 |
| Weighted sum $AV$ | Global mixing | $O(N^2 d_v)$ | NB1 |
| W-MSA (window self-attention) | Local mixing | $O(N W^2 d_k)$ | NB5 |
| SW-MSA (shifted window attention) | Local mixing | $O(N W^2 d_k)$ | NB5 |
| Relative position bias (SWIN) | Local mixing | $O(W^4)$ table; $O(N W^2)$ applied | NB6 |
Summary: A ViT block and a SWIN block have identical structure — per-token operations surrounding a single attention sub-block — and differ in only one place: ViT uses full self-attention (global mixing) while SWIN uses W-MSA or SW-MSA (local mixing).
The diagrams below trace data through one complete transformer block for each architecture. Every box is color-coded by operation type. The attention sub-block (row 3 in each column) is the only box that differs between ViT and SWIN. All surrounding operations are identical in structure, though they may have different learned weights.
Per-token Global mixing — O(N²) Local mixing — O(N · W²)
Pre-norm structure. In both architectures, LayerNorm is applied to the input of each sub-block (pre-norm formulation), so the residual path carries the unnormalized stream. Concretely: $X_1 = X + \mathrm{Attn}(\mathrm{LN}_1(X))$ and $X_2 = X_1 + \mathrm{FFN}(\mathrm{LN}_2(X_1))$. The two residual adds ("+ Residual" boxes) are per-token elementwise sums — no interaction, no parameters.
SWIN block pairs. SWIN always runs W-MSA and SW-MSA together as a pair: the regular-window pass establishes a fixed partition; the shifted-window pass bridges the boundaries it created. The diagram above shows one half of a pair. A full SWIN block pair applies the entire diagram twice — once with W-MSA in the attention slot, once with SW-MSA — giving every patch the opportunity to attend across window boundaries. See Notebook 5 for the detailed illustration.
Summary: The residual stream is the running token representation that flows through all blocks; every block reads from it, computes a delta (via mixing then per-token elaboration), and writes the delta back, so that each block's contribution is additive and persistent.
What the stream carries. Before the first block, the stream holds the patch embeddings plus positional encoding — an $N \times d$ matrix. Each block updates it in two sequential steps: $$X_1 = X + \Delta A(X), \qquad X_2 = X_1 + \Delta F(X_1),$$ where $X$ is the stream state entering the block, $\Delta A(X)$ is the attention output (which reads $X$ to form queries, keys, and values), and $\Delta F(X_1)$ is the FFN output applied to the post-attention state. The functional arguments make the dependence explicit: neither delta is a fixed vector independent of the current stream — each is computed from the state it receives. No block overwrites the stream; each only adds to it. This additive structure allows gradients to flow directly from the output back to the input embedding.
Mixing writes context; per-token elaborates it. Only $\Delta A(X)$ can introduce information from other token positions — it is the sole mechanism by which token $i$ learns about token $j$. $\Delta F(X_1)$ operates on each row of the current stream independently, applying a nonlinear transformation to elaborate and re-encode the representation, but it cannot move information between positions. This division of labor is the central organizing principle of the transformer block and is visible in the color pattern of Section 3: one red (or green) box surrounded by blue boxes.
Detailed treatment. Notebook 3, Section 3 traces the residual stream numerically through a concrete block, showing the stream panels $X \to X_1 \to X_2$ and the contribution panels $\Delta A$, $\Delta F$ for a four-token, $d=4$ example.