Notebook 4: Vision Transformer (ViT)

Applies the Transformer from Notebooks 1–3 directly to images by converting an image into a sequence of patch embeddings. Uses an 8×8 toy image ($d_{\text{model}}=8$, 4×4 patches in a 2×2 grid) for the mechanism, then shows real attention maps from a pretrained DINO model.

1. From Pixels to Tokens

Summary: A Transformer expects a sequence of vectors. ViT creates that sequence by splitting the image into non-overlapping patches, flattening each patch into a vector, and projecting to $d_{\text{model}}$ dimensions.

Toy 8×8 grayscale image. Bold lines mark the four 4×4 patch boundaries.

Patch splitting. The image is divided into a grid of non-overlapping $P \times P$ patches. For a $H \times W$ image this gives $\tfrac{HW}{P^2}$ patches. In the toy example, $H=W=8$, $P=4$ gives $4$ patches arranged in a $2\times 2$ grid, labeled $p_{00}, p_{01}, p_{10}, p_{11}$ in row-major order. In practice, ViT typically uses $P=16$ or $P=32$ on $224\times 224$ images, giving $196$ or $49$ tokens.

Flattening and projection. Each $P\times P$ patch (here $4\times 4 = 16$ pixels) is vectorized (flattened) into a $P^2$-dimensional vector, then linearly projected to $d_{\text{model}}$ dimensions: $$e_i = \mathrm{flatten}(\text{patch}_i)\, W_{\text{proj}} + b, \qquad W_{\text{proj}} \in \mathbb{R}^{P^2 \times d_{\text{model}}}.$$ The four resulting embeddings form a $4 \times 8$ matrix — identical in format to the $4 \times d_{\text{model}}$ token matrix $X$ from Notebooks 1–3. From this point, the image sequence is handled exactly like a text sequence.

Patch embeddings $E$ before positional encoding ($W_{\text{proj}}$ seed 42, scale 0.3). Rows are patches; columns are the 8 embedding dimensions.

What the projection does. Because all pixels within each patch are identical in this toy image, the four flattened vectors are all constant — they differ only in their constant value ($0.9$, $0.4$, $0.6$, $0.1$). Multiplication by the random $W_{\text{proj}}$ mixes these into distinct 8-dimensional embeddings (shown above), one per patch, preserving the contrast between patches while projecting into the model's representation space. In a real model, $W_{\text{proj}}$ is learned jointly with all other parameters.

2. Positional Encoding for 2D Grids

Summary: Attention is permutation-invariant (Notebook 2). For images, position has two-dimensional structure — row and column — which a 2D positional encoding must capture.

2D sinusoidal PE for the four patches. Dims 0–3 encode row index; dims 4–7 encode column index (dashed separator). Rows $p_{00}$ and $p_{01}$ (both at $r=0$) have identical dims 0–3; patches $p_{00}$ and $p_{10}$ (both at $c=0$) have identical dims 4–7.

Why positional encoding is needed. As shown in Notebook 2, the attention mechanism treats its input as a set of tokens: shuffling rows of $X$ produces shuffled but otherwise identical attention outputs. Without positional encoding, a model cannot distinguish $p_{00}$ (top-left) from $p_{11}$ (bottom-right). For images, this is particularly harmful: nearby patches share visual context, and the 2D layout is essential structure.

2D sinusoidal encoding. The standard 1D sinusoidal PE from Notebook 2 assigns each sequence position $t$ a $d_{\text{model}}$-dimensional vector. To encode a 2D grid position $(r, c)$, we split $d_{\text{model}}$ in half: the first $d_{\text{model}}/2$ dimensions carry the 1D sinusoidal encoding of the row index $r$, and the last $d_{\text{model}}/2$ carry that of the column index $c$: $$\mathrm{PE}(r,c) = \bigl[\,\mathrm{PE}_{1\mathrm{D}}(r)\;\big|\;\mathrm{PE}_{1\mathrm{D}}(c)\,\bigr] \in \mathbb{R}^{d_{\text{model}}}.$$ Patches in the same row share the same row-PE block (dims $0$–$3$); patches in the same column share the same column-PE block (dims $4$–$7$). The PE is added to the patch embedding before the first transformer block: $\hat{e}_i = e_i + \mathrm{PE}(r_i, c_i)$.

In practice, many ViT variants use learned 2D positional embeddings rather than fixed sinusoidal ones. The principle is the same: add a position-dependent vector to each patch embedding before feeding the sequence into the transformer.

PE dot-product similarity $\mathrm{PE}(r_i,c_i) \cdot \mathrm{PE}(r_j,c_j)$. Patches sharing a row or column are more similar than diagonal neighbors.

Positional encoding preserves spatial adjacency. One useful property to check: does the PE assign more similar vectors to nearby patches? If it does, attention heads can learn to weight nearby context more heavily simply by preserving the PE structure in their $W_Q, W_K$ projections. The dot-product similarity matrix above shows that it does: the diagonal is $4.0$ (self-similarity, since each PE vector has norm $2$); patches that differ in exactly one grid coordinate ($p_{00}$–$p_{01}$ or $p_{00}$–$p_{10}$) have similarity $3.54$; the diagonal neighbor ($p_{00}$–$p_{11}$, differing in both row and column) has similarity $3.08$. Adjacency in the grid corresponds to closeness in PE space.

3. Application: Image Classification

Summary: Sections 1–2 convert an image to a token sequence. For classification, that sequence must be reduced to a single prediction. ViT does this by prepending a special $[\mathrm{CLS}]$ token that is trained to accumulate a global summary of all patches.

Full token sequence fed into the transformer: $[\mathrm{CLS}]$ (row 0, all zeros at initialization) followed by the four positionally-encoded patch embeddings. Dashed line separates the special token from the patch tokens.

From sequence to prediction. As described so far, the output of $N$ transformer blocks would be an $N_{\text{patches}} \times d_{\text{model}}$ matrix: one row per token. For classification we need a single fixed-size vector to pass to a linear classifier. One approach would be to pool the patch outputs (average or max over rows); ViT instead reserves one slot — the $[\mathrm{CLS}]$ token — whose sole purpose is to accumulate a global summary via attention. This gives the model a learnable aggregation strategy rather than a fixed one.

The $[\mathrm{CLS}]$ token: notation and origin. The notation $[\mathrm{CLS}]$ comes from BERT (Devlin et al. 2018), which introduced this token for sentence-level classification in NLP. The square brackets signal that it is a special marker token — not a word from the input vocabulary — and "CLS" abbreviates "classification." ViT adopted the same convention. The token is a learned parameter vector initialized here to zeros; it carries no patch content and receives no positional encoding. It is prepended to the patch sequence before the first transformer block, increasing the sequence length from $N_{\text{patches}}$ to $N_{\text{patches}} + 1$.

How $[\mathrm{CLS}]$ aggregates information. Inside each transformer block, $[\mathrm{CLS}]$ participates in attention like any other token: it can attend to all patches (querying patch content) and be attended to by all patches (contributing to their outputs). Over $N$ blocks it accumulates a representation shaped by whatever attention patterns are useful for the downstream task. With random weights, $[\mathrm{CLS}]$ attends to all tokens but not usefully; the selective, task-specific attention patterns emerge only during training.

Classification head. After all transformer blocks, only the $[\mathrm{CLS}]$ output (row $0$ of the final token matrix) is passed to the classification head — a single linear layer $[\mathrm{CLS}]_{\text{out}} \, W_{\text{head}} + b_{\text{head}}$ that maps the $d_{\text{model}}$-dimensional vector to class logits. The patch outputs are discarded at inference time.

Is $[\mathrm{CLS}]$ specific to classification? Yes. For dense prediction tasks (object detection, semantic segmentation) the entire patch token sequence is used, not just the CLS slot. The detection transformer DETr replaces $[\mathrm{CLS}]$ with a set of learned query tokens, one per candidate object. More recently, DINOv2 added register tokens — extra learned tokens similar in spirit to $[\mathrm{CLS}]$ that absorb artifacts caused by certain patches accumulating disproportionate attention. The $[\mathrm{CLS}]$ pattern is thus one solution to the aggregation problem, well-matched to single-label classification but not the only option.

4. Attention Maps on Images

Summary: With trained weights, each attention head develops a distinct focus. DINO-trained ViTs are known for especially interpretable attention: heads spontaneously learn to attend to semantically meaningful regions without any class labels during training.

DINO: self-supervised visual attention. DINO (Self-DIstillation with NO labels, Caron et al. 2021) trains a ViT using a self-supervised objective — no class labels, only the constraint that different crops of the same image produce consistent representations. Despite this, the last-block $[\mathrm{CLS}]$ attention heads develop semantically meaningful foci: one head may attend to the foreground object, another to the background, another to edges. This spontaneous structure reflects what is predictively useful about the image, not any supervised signal.

Reading the attention maps. Each map shows the attention weight from $[\mathrm{CLS}]$ to each of the $28 \times 28 = 784$ patch positions in the final transformer block. Bright regions are patches the model queries most when forming its summary representation. Different heads specialize in different aspects of the image.

Quadratic cost and the path to SWIN. Full attention over $N$ tokens requires computing an $N \times N$ matrix — $O(N^2)$ in both time and memory. For the DINO model above, each block attends over $784+1 = 785$ tokens, giving a $785 \times 785 \approx 616{,}000$-entry matrix per head per block. Scaling to higher-resolution images (say $448\times 448$ with $8\times 8$ patches gives $3136$ patches) makes this cost quickly prohibitive. The SWIN Transformer (Notebook 5) addresses this by restricting attention to local windows, reducing the cost to $O(N)$ while preserving the ability to aggregate global context across windows through a shifted-window scheme.

Key Takeaways

ViT applies an unmodified Transformer to image patches. Splitting an image into $P\times P$ patches and projecting each to $d_{\text{model}}$ dimensions produces a token sequence in exactly the format the Transformer already expects. No change to the attention mechanism is needed.
2D positional encoding extends the 1D sinusoidal PE to grid coordinates. The patch position $(r,c)$ is encoded by concatenating independent row and column PEs. Without it, the model cannot distinguish patches by location; with it, nearby patches have more similar PE vectors, which attention can exploit.
The $[\mathrm{CLS}]$ token is a learnable aggregation slot. Prepended to the patch sequence, it accumulates a global summary through attention over all $N$ blocks. For classification its output is passed to a linear head; for other tasks (detection, segmentation) different strategies are used.
Trained attention maps can be semantically interpretable. In DINO-trained ViTs, $[\mathrm{CLS}]$ attention heads spontaneously segment foreground, background, and object parts — without supervision. This interpretability is not guaranteed by architecture; it emerges from the self-supervised training objective.
Full attention is $O(N^2)$ in the number of tokens. For large images or fine patch sizes this becomes expensive. The SWIN Transformer (Notebook 5) restricts attention to local windows, recovering $O(N)$ complexity while still building global context through a shifted-window scheme.

References

A. Dosovitskiy et al., "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale," ICLR, 2021. arXiv:2010.11929
M. Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," ICCV, 2021. arXiv:2104.14294