Notebook 2: Multi-Head Attention & Positional Encoding

Extends single-head attention to multiple independent heads, then adds positional encoding so the model can distinguish token order. Uses $d_{\text{model}} = 4$, $h = 2$ heads, $d_k = 2$ dimensions per head.


1. The Limits of One Attention Head

A single softmax produces one probability distribution per query — one perspective on the sequence. Two heads, each attending in its own subspace, produce two complementary views simultaneously.

Head 1
Head 2

All three use $W_Q = W_K = I$ and temperature $\tau = \sqrt{d_k} = \sqrt{2}$.

Setup for this notebook. In Notebook 1 we used $d_{\text{model}} = 3$-dimensional embeddings. Multi-head attention requires $d_{\text{model}}$ divisible by the number of heads $h$. Here we upgrade to $d_{\text{model}} = 4$ with $h = 2$ heads, giving each head a $d_k = 2$-dimensional subspace to work in. Our embedding matrix $X$ is designed so that dimensions $d_0, d_1$ encode one set of features (distinguishing "cat" from "sat") while $d_2, d_3$ encode another (distinguishing "The" from "down"). The gold line in $X$ marks the boundary between the two head subspaces.

One softmax = one perspective. The single-head map (left) computes $\text{softmax}(X X^\top / \tau)$, blending all four dimensions into one attention pattern. It correctly identifies that "The" and "down" are related and that "cat" and "sat" are related — but it can only express one weighting over keys per query.

Two heads, two perspectives. Head 1 (center) attends exclusively using $d_0, d_1$ — the cat/sat feature space. Head 2 (right) attends using $d_2, d_3$ — the The/down feature space. The patterns differ: Head 1 sees the cat–sat similarity more strongly; Head 2 sees the The–down similarity. Both views are computed simultaneously and concatenated into the output, giving the model richer information than any single head could provide.


2. Multi-Head: Split → Attend → Concat

Each head is assigned its own 2-dimensional subspace of the projected embedding space. The projection, split, per-head attention, and concatenation steps are shown below.

Head 1

Head 2

Multi-head attention pipeline diagram

Multi-head attention: Split → Attend → Concat. The computation has five steps:

  1. Project. Multiply $X$ by $W_Q$, $W_K$, $W_V$ (all $d_{\text{model}} \times d_{\text{model}}$) to get full projected matrices $Q$, $K$, $V$. With $W_Q = I$ above, $Q = X$ exactly.
  2. Assign heads. Partition $Q$, $K$, $V$ column-wise into $h$ non-overlapping subspaces of dimension $d_k = d_{\text{model}} / h$ each. Head $i$ receives columns $[i \cdot d_k,\; (i{+}1) \cdot d_k)$. The gold vertical line marks this boundary.
  3. Attend (per head). Each head independently computes $\text{softmax}(Q_{h_i} K_{h_i}^\top / \sqrt{d_k})$, producing its own attention weights and output.
  4. Concatenate. Stack the $h$ output matrices column-wise back into a single $n \times d_{\text{model}}$ matrix.
  5. Project again (optional). Multiply by $W_O$ to mix head outputs. We omit $W_O$ in this notebook to keep the focus on the splitting mechanism.

Why not just run $h$ independent single-head layers? The key is that the projection step happens before the split: $W_Q$, $W_K$, $W_V$ can rotate and mix all $d_{\text{model}}$ dimensions before partitioning. This means each head can attend in a learned combination of features, not just a fixed subset of the original embedding dimensions.


3. Heads in Action

Different $W_Q$, $W_K$ projections produce different, complementary attention patterns in the two heads. Select a preset and adjust temperature to explore.

Head 1

Head 2

1.41

Heads in action. With the Identity preset ($W_Q = W_K = I$), the partition into heads is also a partition into feature subspaces:

  • Head 1 uses $d_0, d_1$ of both queries and keys. In our embeddings, dims $d_0, d_1$ encode the cat–sat axis, so Head 1 produces high attention between "cat" and "sat".
  • Head 2 uses $d_2, d_3$. Dims $d_2, d_3$ encode the The–down axis, producing high attention between "The" and "down".

The other presets show what different $W_K$ matrices do:

  • Swap K blocks: Head 1's queries (dims 0–1) are matched against Head 1's keys from dims 2–3, creating cross-feature attention. The patterns change completely from the natural split.
  • Head 2 flipped: Negating Head 2's key subspace inverts all dot products. Tokens that attracted in Head 2 now repel, and vice versa. Head 1 is unchanged.
  • Random: In a real trained model, $W_Q$ and $W_K$ are learned to produce patterns that are useful for the task — unrelated to raw similarity.

The temperature slider controls sharpness across both heads simultaneously. The default $\tau = \sqrt{d_k} = \sqrt{2} \approx 1.41$ is the standard scaling.


4. Positional Encoding

Attention is permutation-invariant: reordering tokens just permutes rows and columns of the attention matrix. Sinusoidal positional encoding breaks this symmetry by adding a position-dependent signal to each embedding before attention is computed.

Permutation invariance

Permuting tokens ("sat, The, down, cat") just permutes rows and columns of the attention matrix — the attention values themselves are unchanged.

Attention computes $\text{softmax}(QK^\top / \tau)$ using only dot products between token vectors. If you reorder the tokens, $Q$ and $K$ get the same rows in a different order, so $QK^\top$ gets rows and columns permuted in the same way. After row-wise softmax, the result is the attention matrix with rows and columns permuted identically — no attention value changes, only which token is in which row/column.

This means that without positional information, a transformer cannot distinguish "The cat sat down" from "sat The down cat". Every permutation of the input produces the same set of attention weights, just shuffled.

Sinusoidal positional encoding

Each position gets a unique multi-scale fingerprint. Adding PE to X makes position information available to the attention mechanism.

Shared color scale; row labels for PE and X+PE include position index.

The fix is to add a position-dependent vector $\text{PE}(pos)$ to each token's embedding before computing attention:

$$\text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad \text{PE}(pos, 2i{+}1) = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

The PE values are entirely determined by this formula — not random, and not learned. The frequencies decrease geometrically: $d_0$ and $d_1$ oscillate at frequency 1 (period $2\pi \approx 6.3$ positions), while $d_2$ and $d_3$ oscillate at frequency $1/100$ (period $\approx 628$ positions). This multi-scale structure gives every position a unique fingerprint across a wide range of sequence lengths.

In our small example with only 4 tokens, you can see this scale separation directly in the PE heatmap: $d_0$ and $d_1$ swing through nearly a quarter-cycle (values ranging from $-1$ to $+1$), while $d_2$ and $d_3$ barely move (changing by only $\approx 0.01$ per step — they need hundreds of positions to complete one cycle). The fast dimensions are what primarily distinguish positions 0–3 here; the slow dimensions become informative only for much longer sequences.

Effect on attention

After adding PE, nearby tokens have more similar position vectors (their PE components are close on the sine curves), which shifts which token pairs attract. "The" in position 0 and "cat" in position 1 are now closer in the full $X + \text{PE}$ space than they were in the raw embedding space, so attention partially reflects proximity.


5. Rotary Positional Encoding (RoPE)

Sinusoidal PE encodes absolute positions, but attention primarily needs to know how far apart two tokens are. RoPE is an alternative that encodes position by rotating query and key vectors, so the attention score depends only on the relative offset $m - n$, not on the absolute positions of either token.

The limitation of sinusoidal PE. Sinusoidal PE gives every position a unique fingerprint, but the encoding is absolute. When PE is added to the embeddings, the dot product $(x_m + \text{PE}_m)^\top(x_n + \text{PE}_n)$ expands into four cross-terms that mix content and position in ways that are hard to disentangle. More fundamentally, the model must learn from data that positions 3 and 4 are nearby while 3 and 100 are far apart — the relative gap $m - n$ is not directly visible in the attention score. The model must infer it from the absolute position embeddings it saw during training, which makes generalization to longer sequences fragile.

RoPE: rotating instead of adding. RoPE (Su et al., 2021) encodes position differently: instead of adding a position vector to each embedding, it rotates the query and key vectors by a position-dependent angle before computing the dot product. In our $d_k = 2$ case (one 2D plane), the rotation matrix for position $m$ is:

$$R(m) = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix}$$

The attention score between query at position $m$ and key at position $n$ becomes:

$$(R(m)\,\mathbf{q})^\top (R(n)\,\mathbf{k}) = \mathbf{q}^\top R(m-n)\,\mathbf{k}$$

The rotation matrices combine and only their difference $m - n$ survives. Relative position is built directly into the dot product — no learning required, no cross-terms, and no dependence on the absolute positions of either token.

The left figure shows the same query vector $\mathbf{q} = [1,\, 0.3]$ rotated to each of the four token positions ($\theta = 0.5$). Each position gets a distinct direction in 2D. The right figure plots the pre-softmax attention score as a function of relative position $m - n$: it is a pure function of the offset, the same regardless of where in the sequence the query and key actually sit. With sinusoidal PE, by contrast, the score would depend on both $m$ and $n$ separately.

In practice, $d_k > 2$, so the rotation is applied independently in each consecutive pair of dimensions (multiple 2D planes), each with its own frequency $\theta$. This multi-scale structure gives RoPE the same frequency diversity as sinusoidal PE while preserving the relative-position property. RoPE is now the dominant positional encoding in modern large language models, including LLaMA, Mistral, and GPT-NeoX.


Key Takeaways

  1. One softmax = one perspective. A single attention head can only produce one probability distribution per query. Multi-head attention runs $h$ independent heads in parallel, each free to attend to different aspects of the input (Section 1).
  2. Splitting is not approximation. The weight matrices $W_Q$, $W_K$, $W_V$ project the full $d_{\text{model}}$-dimensional embedding before splitting into per-head subspaces. Each head operates in a learned 2D subspace, not a fixed subset of raw features (Section 2).
  3. Different heads learn different relationships. In a trained model, one head might track syntactic dependencies while another tracks semantic similarity. The preset experiments show how $W_K$ alone can produce orthogonal attention patterns from the same input (Section 3).
  4. Attention is permutation-invariant without PE. Reordering tokens just permutes rows and columns of the attention matrix — no information about order is available to the model. Positional encoding breaks this symmetry by adding position-dependent signals to the embeddings (Section 4).
  5. RoPE is an alternative to sinusoidal PE that encodes relative position directly. Sinusoidal PE adds absolute position fingerprints to embeddings; the model must then learn what "nearby" means. RoPE instead rotates $Q$ and $K$ by position-dependent angles, making the attention score a function of $m - n$ only — relative position is built into the dot product with no cross-terms (Section 5).