Notebook 7: Operations at a Glance

A synthesis of Notebooks 1–6. Every named computation in a transformer is classified as per-token, global mixing, or local mixing — the taxonomy that explains both the power and the cost of attention.


1. Three Kinds of Computation

Summary: Every operation in a transformer falls into one of three categories that differ in how they relate tokens to one another: per-token operations process each token independently; global mixing lets every token interact with every other; local mixing restricts interaction to a fixed-size window.

Per-token Each token processed independently. Cost scales as $O(N)$ or $O(Nd)$. No information moves between positions.
Global mixing Every token attends to every other token. Cost scales as $O(N^2)$. This is the bottleneck for long sequences.
Local mixing Tokens attend only within a window of size $W \times W$. Cost scales as $O(N W^2)$ — linear in $N$ for fixed $W$.

Why the distinction matters. In a standard MLP or CNN, the pattern of inter-neuron interaction is fixed by the learned weights and is the same for every input. Attention is different: the weight that token $i$ places on token $j$ is recomputed from the current embeddings of both tokens on every forward pass. This content-dependent mixing is what gives transformers their expressive power — but it comes at the cost of evaluating all $N^2$ query–key pairs. The three categories make this trade-off explicit: per-token operations are cheap and local; global mixing is powerful but expensive; local mixing (introduced by SWIN) recovers efficiency by accepting a spatial inductive bias.

Where interaction lives. Token interaction is a property of the attention sub-block and no other components. All other named operations — projections, LayerNorm, FFN, patch merging — are per-token: they transform each token's representation without consulting other tokens. This means that increasing model depth (more full transformer blocks) adds per-token capacity for free but only increases mixing through the additional attention sub-blocks.

Three architectures, one framework. The standard transformer (Notebooks 1–3) uses global mixing at every block. ViT (Notebook 4) applies the same idea to image patches, including a special [CLS] token that aggregates global information. SWIN (Notebooks 5–6) substitutes local mixing (W-MSA and SW-MSA) for global mixing, making attention cost linear in $N$ and enabling a feature hierarchy through patch merging. All three use identical per-token operations surrounding the attention sub-block.


2. Cost and the Mixing Operations

Summary: The three mixing steps — $QK^T$, softmax, and weighted sum $AV$ — all scale as $O(N^2)$; SWIN's windowed counterparts reduce this to $O(NW^2)$, a 16× reduction at $N=784$ with $W=7$.

The $O(N^2)$ cost comes from $QK^T \in \mathbb{R}^{N \times N}$: each of the $N$ query vectors must be dotted against all $N$ key vectors, producing $N^2$ scores per head. At $N=784$ (ViT-S/8 on a 224×224 image with patch size 8) that is $784^2 \approx 614{,}000$ dot products per head per block. SWIN with $W=7$ ($W^2 = 49$ tokens per window) requires only $784 \times 49 = 38{,}416$ — a 16× reduction for the same sequence. The chart shows how this gap grows with $N$.

Log–log axes: full attention (slope 2, quadratic) vs. windowed attention (slope 1, linear). Dotted line: $N=784$.

Where the parameters live. The mixing operations ($QK^T$, softmax, $AV$) are parameter-free: they evaluate scores that the learned projections produce, but carry no weights of their own. Nearly all the learned weight matrices reside in the per-token operations — the $Q$/$K$/$V$/$W_O$ projections, FFN layers, and patch embedding. Token interaction is entirely determined by the six operations in the table below, with no dedicated parameters.

Operation Type Cost (order) Introduced
Attention scores $QK^T$ Global mixing $O(N^2 d_k)$ NB1
Softmax (normalize scores) Global mixing $O(N^2)$ NB1
Weighted sum $AV$ Global mixing $O(N^2 d_v)$ NB1
W-MSA (window self-attention) Local mixing $O(N W^2 d_k)$ NB5
SW-MSA (shifted window attention) Local mixing $O(N W^2 d_k)$ NB5
Relative position bias (SWIN) Local mixing $O(W^4)$ table; $O(N W^2)$ applied NB6

3. A Transformer Block in Full

Summary: A ViT block and a SWIN block have identical structure — per-token operations surrounding a single attention sub-block — and differ in only one place: ViT uses full self-attention (global mixing) while SWIN uses W-MSA or SW-MSA (local mixing).

The diagrams below trace data through one complete transformer block for each architecture. Every box is color-coded by operation type. The attention sub-block (row 3 in each column) is the only box that differs between ViT and SWIN. All surrounding operations are identical in structure, though they may have different learned weights.

ViT Block SWIN Block Input X Input X LayerNorm₁ LayerNorm₁ Q, K, V projections Q, K, V projections Self-Attention QKᵀ / softmax / AV W-MSA or SW-MSA (windowed attention) only difference Output proj. WO Output proj. WO + Residual + Residual LayerNorm₂ LayerNorm₂ FFN FFN + Residual + Residual Output X Output X

Per-token Global mixing — O(N²) Local mixing — O(N · W²)

Pre-norm structure. In both architectures, LayerNorm is applied to the input of each sub-block (pre-norm formulation), so the residual path carries the unnormalized stream. Concretely: $X_1 = X + \mathrm{Attn}(\mathrm{LN}_1(X))$ and $X_2 = X_1 + \mathrm{FFN}(\mathrm{LN}_2(X_1))$. The two residual adds ("+ Residual" boxes) are per-token elementwise sums — no interaction, no parameters.

SWIN block pairs. SWIN always runs W-MSA and SW-MSA together as a pair: the regular-window pass establishes a fixed partition; the shifted-window pass bridges the boundaries it created. The diagram above shows one half of a pair. A full SWIN block pair applies the entire diagram twice — once with W-MSA in the attention slot, once with SW-MSA — giving every patch the opportunity to attend across window boundaries. See Notebook 5 for the detailed illustration.


4. The Residual Stream

Summary: The residual stream is the running token representation that flows through all blocks; every block reads from it, computes a delta (via mixing then per-token elaboration), and writes the delta back, so that each block's contribution is additive and persistent.

What the stream carries. Before the first block, the stream holds the patch embeddings plus positional encoding — an $N \times d$ matrix. Each block updates it in two sequential steps: $$X_1 = X + \Delta A(X), \qquad X_2 = X_1 + \Delta F(X_1),$$ where $X$ is the stream state entering the block, $\Delta A(X)$ is the attention output (which reads $X$ to form queries, keys, and values), and $\Delta F(X_1)$ is the FFN output applied to the post-attention state. The functional arguments make the dependence explicit: neither delta is a fixed vector independent of the current stream — each is computed from the state it receives. No block overwrites the stream; each only adds to it. This additive structure allows gradients to flow directly from the output back to the input embedding.

Mixing writes context; per-token elaborates it. Only $\Delta A(X)$ can introduce information from other token positions — it is the sole mechanism by which token $i$ learns about token $j$. $\Delta F(X_1)$ operates on each row of the current stream independently, applying a nonlinear transformation to elaborate and re-encode the representation, but it cannot move information between positions. This division of labor is the central organizing principle of the transformer block and is visible in the color pattern of Section 3: one red (or green) box surrounded by blue boxes.

Detailed treatment. Notebook 3, Section 3 traces the residual stream numerically through a concrete block, showing the stream panels $X \to X_1 \to X_2$ and the contribution panels $\Delta A$, $\Delta F$ for a four-token, $d=4$ example.


Key Takeaways

  1. Every transformer operation is one of three kinds: per-token (independent, $O(Nd)$), global mixing (all pairs, $O(N^2)$), or local mixing (windowed, $O(NW^2)$). All token-to-token information transfer is confined to the attention sub-block.
  2. The $O(N^2)$ bottleneck comes entirely from the $QK^T$ matrix: $N$ queries must be evaluated against $N$ keys, and this cannot be avoided in full self-attention. At $N=784$ (ViT-S/8 on 224×224) this means over 600,000 dot products per head per block.
  3. SWIN replaces global attention with windowed W-MSA and SW-MSA, reducing cost from $O(N^2)$ to $O(N W^2)$. For fixed $W=7$, this scales linearly with $N$ — not just a constant-factor improvement but a qualitatively different regime that enables high-resolution processing.
  4. Per-token operations (projections, LayerNorm, FFN, patch merging) hold nearly all the learned parameters but perform no mixing. Mixing capacity is determined solely by the number and type of attention sub-blocks.
  5. The residual stream is the thread connecting all operations: each block reads from the accumulated stream and adds two deltas — one from mixing (attention) and one from per-token elaboration (FFN). This additive structure means every layer's contribution is persistent and gradients can flow unimpeded to early layers.

References

  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," NeurIPS, 2017. arXiv:1706.03762
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale," ICLR, 2021. arXiv:2010.11929
  • Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," ICCV, 2021. arXiv:2103.14030