Extends single-head attention to multiple independent heads, then adds positional encoding so the model can distinguish token order. Uses $d_{\text{model}} = 4$, $h = 2$ heads, $d_k = 2$ dimensions per head.
A single softmax produces one probability distribution per query — one perspective on the sequence. Two heads, each attending in its own subspace, produce two complementary views simultaneously.
All three use $W_Q = W_K = I$ and temperature $\tau = \sqrt{d_k} = \sqrt{2}$.
Setup for this notebook. In Notebook 1 we used $d_{\text{model}} = 3$-dimensional embeddings. Multi-head attention requires $d_{\text{model}}$ divisible by the number of heads $h$. Here we upgrade to $d_{\text{model}} = 4$ with $h = 2$ heads, giving each head a $d_k = 2$-dimensional subspace to work in. Our embedding matrix $X$ is designed so that dimensions $d_0, d_1$ encode one set of features (distinguishing "cat" from "sat") while $d_2, d_3$ encode another (distinguishing "The" from "down"). The gold line in $X$ marks the boundary between the two head subspaces.
One softmax = one perspective. The single-head map (left) computes $\text{softmax}(X X^\top / \tau)$, blending all four dimensions into one attention pattern. It correctly identifies that "The" and "down" are related and that "cat" and "sat" are related — but it can only express one weighting over keys per query.
Two heads, two perspectives. Head 1 (center) attends exclusively using $d_0, d_1$ — the cat/sat feature space. Head 2 (right) attends using $d_2, d_3$ — the The/down feature space. The patterns differ: Head 1 sees the cat–sat similarity more strongly; Head 2 sees the The–down similarity. Both views are computed simultaneously and concatenated into the output, giving the model richer information than any single head could provide.
Each head is assigned its own 2-dimensional subspace of the projected embedding space. The projection, split, per-head attention, and concatenation steps are shown below.
Head 1
Head 2
Multi-head attention: Split → Attend → Concat. The computation has five steps:
Why not just run $h$ independent single-head layers? The key is that the projection step happens before the split: $W_Q$, $W_K$, $W_V$ can rotate and mix all $d_{\text{model}}$ dimensions before partitioning. This means each head can attend in a learned combination of features, not just a fixed subset of the original embedding dimensions.
Different $W_Q$, $W_K$ projections produce different, complementary attention patterns in the two heads. Select a preset and adjust temperature to explore.
Head 1
Head 2
Heads in action. With the Identity preset ($W_Q = W_K = I$), the partition into heads is also a partition into feature subspaces:
The other presets show what different $W_K$ matrices do:
The temperature slider controls sharpness across both heads simultaneously. The default $\tau = \sqrt{d_k} = \sqrt{2} \approx 1.41$ is the standard scaling.
Attention is permutation-invariant: reordering tokens just permutes rows and columns of the attention matrix. Sinusoidal positional encoding breaks this symmetry by adding a position-dependent signal to each embedding before attention is computed.
Permuting tokens ("sat, The, down, cat") just permutes rows and columns of the attention matrix — the attention values themselves are unchanged.
Attention computes $\text{softmax}(QK^\top / \tau)$ using only dot products between token vectors. If you reorder the tokens, $Q$ and $K$ get the same rows in a different order, so $QK^\top$ gets rows and columns permuted in the same way. After row-wise softmax, the result is the attention matrix with rows and columns permuted identically — no attention value changes, only which token is in which row/column.
This means that without positional information, a transformer cannot distinguish "The cat sat down" from "sat The down cat". Every permutation of the input produces the same set of attention weights, just shuffled.
Each position gets a unique multi-scale fingerprint. Adding PE to X makes position information available to the attention mechanism.
Shared color scale; row labels for PE and X+PE include position index.
The fix is to add a position-dependent vector $\text{PE}(pos)$ to each token's embedding before computing attention:
$$\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad \text{PE}(pos, 2i{+}1) = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$The PE values are entirely determined by this formula — not random, and not learned. The frequencies decrease geometrically: $d_0$ and $d_1$ oscillate at frequency 1 (period $2\pi \approx 6.3$ positions), while $d_2$ and $d_3$ oscillate at frequency $1/100$ (period $\approx 628$ positions). This multi-scale structure gives every position a unique fingerprint across a wide range of sequence lengths.
In our small example with only 4 tokens, you can see this scale separation directly in the PE heatmap: $d_0$ and $d_1$ swing through nearly a quarter-cycle (values ranging from $-1$ to $+1$), while $d_2$ and $d_3$ barely move (changing by only $\approx 0.01$ per step — they need hundreds of positions to complete one cycle). The fast dimensions are what primarily distinguish positions 0–3 here; the slow dimensions become informative only for much longer sequences.
After adding PE, nearby tokens have more similar position vectors (their PE components are close on the sine curves), which shifts which token pairs attract. "The" in position 0 and "cat" in position 1 are now closer in the full $X + \text{PE}$ space than they were in the raw embedding space, so attention partially reflects proximity.
Sinusoidal PE encodes absolute positions, but attention primarily needs to know how far apart two tokens are. RoPE is an alternative that encodes position by rotating query and key vectors, so the attention score depends only on the relative offset $m - n$, not on the absolute positions of either token.
The limitation of sinusoidal PE. Sinusoidal PE gives every position a unique fingerprint, but the encoding is absolute. When PE is added to the embeddings, the dot product $(x_m + \text{PE}_m)^\top(x_n + \text{PE}_n)$ expands into four cross-terms that mix content and position in ways that are hard to disentangle. More fundamentally, the model must learn from data that positions 3 and 4 are nearby while 3 and 100 are far apart — the relative gap $m - n$ is not directly visible in the attention score. The model must infer it from the absolute position embeddings it saw during training, which makes generalization to longer sequences fragile.
RoPE: rotating instead of adding. RoPE (Su et al., 2021) encodes position differently: instead of adding a position vector to each embedding, it rotates the query and key vectors by a position-dependent angle before computing the dot product. In our $d_k = 2$ case (one 2D plane), the rotation matrix for position $m$ is:
$$R(m) = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$$The attention score between query at position $m$ and key at position $n$ becomes:
$$(R(m)\,\mathbf{q})^\top (R(n)\,\mathbf{k}) = \mathbf{q}^\top R(m-n)\,\mathbf{k}$$The rotation matrices combine and only their difference $m - n$ survives. Relative position is built directly into the dot product — no learning required, no cross-terms, and no dependence on the absolute positions of either token.
The left figure shows the same query vector $\mathbf{q} = [1,\, 0.3]$ rotated to each of the four token positions ($\theta = 0.5$). Each position gets a distinct direction in 2D. The right figure plots the pre-softmax attention score as a function of relative position $m - n$: it is a pure function of the offset, the same regardless of where in the sequence the query and key actually sit. With sinusoidal PE, by contrast, the score would depend on both $m$ and $n$ separately.
In practice, $d_k > 2$, so the rotation is applied independently in each consecutive pair of dimensions (multiple 2D planes), each with its own frequency $\theta$. This multi-scale structure gives RoPE the same frequency diversity as sinusoidal PE while preserving the relative-position property. RoPE is now the dominant positional encoding in modern large language models, including LLaMA, Mistral, and GPT-NeoX.