Applies the Transformer from Notebooks 1–3 directly to images by converting an image into a sequence of patch embeddings. Uses an 8×8 toy image ($d_{\text{model}}=8$, 4×4 patches in a 2×2 grid) for the mechanism, then shows real attention maps from a pretrained DINO model.
Summary: A Transformer expects a sequence of vectors. ViT creates that sequence by splitting the image into non-overlapping patches, flattening each patch into a vector, and projecting to $d_{\text{model}}$ dimensions.
Toy 8×8 grayscale image. Bold lines mark the four 4×4 patch boundaries.
Patch splitting. The image is divided into a grid of non-overlapping $P \times P$ patches. For a $H \times W$ image this gives $\tfrac{HW}{P^2}$ patches. In the toy example, $H=W=8$, $P=4$ gives $4$ patches arranged in a $2\times 2$ grid, labeled $p_{00}, p_{01}, p_{10}, p_{11}$ in row-major order. In practice, ViT typically uses $P=16$ or $P=32$ on $224\times 224$ images, giving $196$ or $49$ tokens.
Flattening and projection. Each $P\times P$ patch (here $4\times 4 = 16$ pixels) is vectorized (flattened) into a $P^2$-dimensional vector, then linearly projected to $d_{\text{model}}$ dimensions: $$e_i = \mathrm{flatten}(\text{patch}_i)\, W_{\text{proj}} + b, \qquad W_{\text{proj}} \in \mathbb{R}^{P^2 \times d_{\text{model}}}.$$ The four resulting embeddings form a $4 \times 8$ matrix — identical in format to the $4 \times d_{\text{model}}$ token matrix $X$ from Notebooks 1–3. From this point, the image sequence is handled exactly like a text sequence.
Patch embeddings $E$ before positional encoding ($W_{\text{proj}}$ seed 42, scale 0.3). Rows are patches; columns are the 8 embedding dimensions.
What the projection does. Because all pixels within each patch are identical in this toy image, the four flattened vectors are all constant — they differ only in their constant value ($0.9$, $0.4$, $0.6$, $0.1$). Multiplication by the random $W_{\text{proj}}$ mixes these into distinct 8-dimensional embeddings (shown above), one per patch, preserving the contrast between patches while projecting into the model's representation space. In a real model, $W_{\text{proj}}$ is learned jointly with all other parameters.
Summary: Attention is permutation-invariant (Notebook 2). For images, position has two-dimensional structure — row and column — which a 2D positional encoding must capture.
2D sinusoidal PE for the four patches. Dims 0–3 encode row index; dims 4–7 encode column index (dashed separator). Rows $p_{00}$ and $p_{01}$ (both at $r=0$) have identical dims 0–3; patches $p_{00}$ and $p_{10}$ (both at $c=0$) have identical dims 4–7.
Why positional encoding is needed. As shown in Notebook 2, the attention mechanism treats its input as a set of tokens: shuffling rows of $X$ produces shuffled but otherwise identical attention outputs. Without positional encoding, a model cannot distinguish $p_{00}$ (top-left) from $p_{11}$ (bottom-right). For images, this is particularly harmful: nearby patches share visual context, and the 2D layout is essential structure.
2D sinusoidal encoding. The standard 1D sinusoidal PE from Notebook 2 assigns each sequence position $t$ a $d_{\text{model}}$-dimensional vector. To encode a 2D grid position $(r, c)$, we split $d_{\text{model}}$ in half: the first $d_{\text{model}}/2$ dimensions carry the 1D sinusoidal encoding of the row index $r$, and the last $d_{\text{model}}/2$ carry that of the column index $c$: $$\mathrm{PE}(r,c) = \bigl[\,\mathrm{PE}_{1\mathrm{D}}(r)\;\big|\;\mathrm{PE}_{1\mathrm{D}}(c)\,\bigr] \in \mathbb{R}^{d_{\text{model}}}.$$ Patches in the same row share the same row-PE block (dims $0$–$3$); patches in the same column share the same column-PE block (dims $4$–$7$). The PE is added to the patch embedding before the first transformer block: $\hat{e}_i = e_i + \mathrm{PE}(r_i, c_i)$.
In practice, many ViT variants use learned 2D positional embeddings rather than fixed sinusoidal ones. The principle is the same: add a position-dependent vector to each patch embedding before feeding the sequence into the transformer.
PE dot-product similarity $\mathrm{PE}(r_i,c_i) \cdot \mathrm{PE}(r_j,c_j)$. Patches sharing a row or column are more similar than diagonal neighbors.
Positional encoding preserves spatial adjacency. One useful property to check: does the PE assign more similar vectors to nearby patches? If it does, attention heads can learn to weight nearby context more heavily simply by preserving the PE structure in their $W_Q, W_K$ projections. The dot-product similarity matrix above shows that it does: the diagonal is $4.0$ (self-similarity, since each PE vector has norm $2$); patches that differ in exactly one grid coordinate ($p_{00}$–$p_{01}$ or $p_{00}$–$p_{10}$) have similarity $3.54$; the diagonal neighbor ($p_{00}$–$p_{11}$, differing in both row and column) has similarity $3.08$. Adjacency in the grid corresponds to closeness in PE space.
Summary: Sections 1–2 convert an image to a token sequence. For classification, that sequence must be reduced to a single prediction. ViT does this by prepending a special $[\mathrm{CLS}]$ token that is trained to accumulate a global summary of all patches.
Full token sequence fed into the transformer: $[\mathrm{CLS}]$ (row 0, all zeros at initialization) followed by the four positionally-encoded patch embeddings. Dashed line separates the special token from the patch tokens.
From sequence to prediction. As described so far, the output of $N$ transformer blocks would be an $N_{\text{patches}} \times d_{\text{model}}$ matrix: one row per token. For classification we need a single fixed-size vector to pass to a linear classifier. One approach would be to pool the patch outputs (average or max over rows); ViT instead reserves one slot — the $[\mathrm{CLS}]$ token — whose sole purpose is to accumulate a global summary via attention. This gives the model a learnable aggregation strategy rather than a fixed one.
The $[\mathrm{CLS}]$ token: notation and origin. The notation $[\mathrm{CLS}]$ comes from BERT (Devlin et al. 2018), which introduced this token for sentence-level classification in NLP. The square brackets signal that it is a special marker token — not a word from the input vocabulary — and "CLS" abbreviates "classification." ViT adopted the same convention. The token is a learned parameter vector initialized here to zeros; it carries no patch content and receives no positional encoding. It is prepended to the patch sequence before the first transformer block, increasing the sequence length from $N_{\text{patches}}$ to $N_{\text{patches}} + 1$.
How $[\mathrm{CLS}]$ aggregates information. Inside each transformer block, $[\mathrm{CLS}]$ participates in attention like any other token: it can attend to all patches (querying patch content) and be attended to by all patches (contributing to their outputs). Over $N$ blocks it accumulates a representation shaped by whatever attention patterns are useful for the downstream task. With random weights, $[\mathrm{CLS}]$ attends to all tokens but not usefully; the selective, task-specific attention patterns emerge only during training.
Classification head. After all transformer blocks, only the $[\mathrm{CLS}]$ output (row $0$ of the final token matrix) is passed to the classification head — a single linear layer $[\mathrm{CLS}]_{\text{out}} \, W_{\text{head}} + b_{\text{head}}$ that maps the $d_{\text{model}}$-dimensional vector to class logits. The patch outputs are discarded at inference time.
Is $[\mathrm{CLS}]$ specific to classification? Yes. For dense prediction tasks (object detection, semantic segmentation) the entire patch token sequence is used, not just the CLS slot. The detection transformer DETr replaces $[\mathrm{CLS}]$ with a set of learned query tokens, one per candidate object. More recently, DINOv2 added register tokens — extra learned tokens similar in spirit to $[\mathrm{CLS}]$ that absorb artifacts caused by certain patches accumulating disproportionate attention. The $[\mathrm{CLS}]$ pattern is thus one solution to the aggregation problem, well-matched to single-label classification but not the only option.
Summary: With trained weights, each attention head develops a distinct focus. DINO-trained ViTs are known for especially interpretable attention: heads spontaneously learn to attend to semantically meaningful regions without any class labels during training.
DINO: self-supervised visual attention. DINO (Self-DIstillation with NO labels, Caron et al. 2021) trains a ViT using a self-supervised objective — no class labels, only the constraint that different crops of the same image produce consistent representations. Despite this, the last-block $[\mathrm{CLS}]$ attention heads develop semantically meaningful foci: one head may attend to the foreground object, another to the background, another to edges. This spontaneous structure reflects what is predictively useful about the image, not any supervised signal.
Reading the attention maps. Each map shows the attention weight from $[\mathrm{CLS}]$ to each of the $28 \times 28 = 784$ patch positions in the final transformer block. Bright regions are patches the model queries most when forming its summary representation. Different heads specialize in different aspects of the image.
Quadratic cost and the path to SWIN. Full attention over $N$ tokens requires computing an $N \times N$ matrix — $O(N^2)$ in both time and memory. For the DINO model above, each block attends over $784+1 = 785$ tokens, giving a $785 \times 785 \approx 616{,}000$-entry matrix per head per block. Scaling to higher-resolution images (say $448\times 448$ with $8\times 8$ patches gives $3136$ patches) makes this cost quickly prohibitive. The SWIN Transformer (Notebook 5) addresses this by restricting attention to local windows, reducing the cost to $O(N)$ while preserving the ability to aggregate global context across windows through a shifted-window scheme.