Notebook 1: Attention from Scratch

Builds up scaled dot-product attention step by step using small matrices you can trace by hand.
Target audience: math / ECE grad students with a linear algebra background.


1. Dot-Product Similarity

Summary: We compute pairwise dot products between 4 token embeddings to measure how similar they are.

Hover over any cell to see its value.

Words, tokens, and embeddings. In a transformer, input text is split into tokens — usually words or subword pieces. Each token is represented by an embedding: a numerical vector that encodes its meaning. In our example, the words "The", "cat", "sat", "down" are the tokens, and each row of $X$ is that token's 3-dimensional embedding. Real models use hundreds or thousands of dimensions, but the mechanics are identical.

Dot-product similarity. Each token's embedding is a row in $X$ and a column in $X^\top$. Entry $(i, j)$ of the similarity matrix is the dot product of row $i$ of $X$ with column $j$ of $X^\top$:

$$(\text{X} \cdot \text{X}^\top)_{ij} = \mathbf{x}_i \cdot \mathbf{x}_j = \|\mathbf{x}_i\| \, \|\mathbf{x}_j\| \cos\theta_{ij}$$

You can trace this visually: pick a row of $X$ on the left and the corresponding column of $X^\top$, dot them together, and you get the corresponding entry in the similarity matrix. The white lines help you track which row feeds into each entry.

Observe: Each token is most similar to itself, so the diagonal is largest in all rows except "down." In that case, "The" and "down" are highly correlated, but "The" has larger magnitude, so ("The", "down") gives a larger value than ("down", "down"). Also, "cat" and "sat" have high mutual similarity (1.47) because their embeddings both have large values in $d_1$ and $d_2$, pointing in similar directions. "The" and "sat" are slightly anti-correlated (−0.10) because "The" points along $d_0$ while "sat" points away from it.

In attention, we use this to ask: "How relevant is each key to this query?"


2. From Similarity to Attention Weights

Summary: We apply softmax to convert raw dot products into probability distributions, with a temperature parameter that controls sharpness.

1.0

From similarity to attention weights. Raw dot products can be any real number, but we need a probability distribution over keys for each query. The softmax function does this:

$$\text{attention}_{ij} = \text{softmax}\!\left(\frac{s_{ij}}{\tau}\right) = \frac{e^{\, s_{ij}/\tau}}{\displaystyle\sum_k e^{\, s_{ik}/\tau}}$$

where $s_{ij}$ is the raw similarity and $\tau$ is the temperature. This maps each row from $\mathbb{R}^n$ onto the probability simplex $\Delta^{n-1}$ (all entries non-negative, summing to 1).

Queries and keys. The axes on the attention heatmap are labeled "Queries" (rows) and "Keys" (columns). Think of the query as the token asking "who should I pay attention to?" and the keys as the candidates being considered. Each row of the attention matrix answers that question for one query token: it's a probability distribution over all the keys, saying how much each key matters to that query. We'll formalize this with separate $Q$/$K$/$V$ projections in Section 3, but the row-vs-column structure is already visible here.

Temperature controls how peaked the distribution is:

  • Low temperature ($\tau \to 0$) → attention concentrates on the single most similar key (approaches one-hot). The model becomes very selective.
  • High temperature ($\tau \to \infty$) → attention spreads uniformly across all keys (approaches $1/n$). The model treats all keys equally.
  • $\tau = 1$ is the "unscaled" baseline shown by default. In transformers, the standard scaling uses $\tau = \sqrt{d_k}$, which for our 3D embeddings is $\sqrt{3} \approx 1.73$.

Try dragging the slider: at $\tau = 0.1$ the attention map is nearly binary; at $\tau = 5.0$ it's nearly uniform.

Why isn't the attention matrix symmetric? The similarity matrix $X X^\top$ is symmetric ($s_{ij} = s_{ji}$), but the attention matrix is not, because softmax normalizes each row independently. Entry $\text{attention}_{ij}$ depends not just on $s_{ij}$ but on all the scores in row $i$. Different rows have different normalization constants, so even though $s_{ij} = s_{ji}$, we generally have $\text{attention}_{ij} \neq \text{attention}_{ji}$.

This asymmetry is a fundamental feature of attention: each token gets its own perspective on the sequence. Note that the specific patterns in this example have no semantic significance — they're an artifact of our hand-chosen embeddings in a 3-dimensional space. In a trained model, the embeddings and separate $Q$/$K$ projections are learned so that attention patterns capture meaningful relationships.


3. Self-Attention: Q, K, V Projections

Summary: In Sections 1–2, each token's embedding served as both the question and the answer. Self-attention introduces learned projections so that tokens can present different information depending on their role — asking, being searched, or providing content.

Self-attention computational flow diagram

Self-attention with $Q$, $K$, $V$ projections. This is called self-attention because queries, keys, and values all come from the same input:

$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$

In cross-attention (which appears in encoder–decoder models), $K$ and $V$ come from a different sequence — but the mechanics are otherwise identical.

Why separate projections? In Section 1, the raw embedding $X$ served as both the question and the answer: we computed $X X^\top$ directly. This forces each token to use the same vector for "What am I looking for?" and "What do I advertise to others?" Separate projections decouple these roles:

  • $Q$ (query): "What am I looking for?"
  • $K$ (key): "What do I advertise to others?"
  • $V$ (value): "What information do I provide when attended to?"

The same token can present a completely different face depending on which role it's playing.

Reading the diagram. Follow the flow from left to right:

  1. The input $X$ branches three ways through learned weight matrices $W_Q$, $W_K$, $W_V$ to produce $Q$, $K$, $V$.
  2. $Q$ and $K^\top$ multiply — $Q$ on the left, $K^\top$ below — to produce raw scores, scaled by $\sqrt{d_k}$ (the same temperature scaling from Section 2, with $\tau = \sqrt{d_k}$).
  3. Softmax converts each row of scores into a probability distribution: the attention weights.
  4. The attention weights multiply $V$, producing the output. Each row of the output is a weighted average of $V$'s rows, where the weights come from how well that token's query matched each key.

What the output means. Token $i$'s output vector isn't its own value — it's a blend of everyone's values, weighted by relevance. A token that strongly attends to one neighbor will have an output close to that neighbor's value vector. A token that spreads attention uniformly gets something like the average of all value vectors. This is how information flows between positions in a transformer.


4. What Projections Buy You

Summary: Different $W_Q$ and $W_K$ projections can create fundamentally different attention patterns from the same input. Compare raw attention ($X X^\top$) with projected attention ($Q K^\top$) under several preset projections.

1.73

What projections buy you. In Section 1, each token had a single embedding that determined all of its relationships. Projections break this constraint:

  • Identity: $W_Q = W_K = I$ reproduces raw similarity exactly — projections don't have to change anything.
  • Dimension focus: Zeroing out dimensions lets the model attend based on specific features. Focusing on $d_0$ makes "The" and "down" the dominant pair, completely changing the pattern from raw similarity where "cat"–"sat" dominated.
  • Flipped similarity: $W_K = -I$ inverts all relationships. Tokens that were similar now repel. After softmax, each token attends most to its least similar neighbor.
  • Random learned: A trained model discovers $W_Q$, $W_K$ that create useful patterns for the task — patterns that may bear no resemblance to raw similarity.

The key takeaway: the $\otimes$ symbols in the Section 3 diagram aren't just mechanical steps. They're where the model learns what to pay attention to, independently of what information to extract (which is $W_V$'s job).


Key Takeaways

  1. Similarity is a dot product. The dot product $\mathbf{x}_i \cdot \mathbf{x}_j = \|\mathbf{x}_i\|\,\|\mathbf{x}_j\|\cos\theta_{ij}$ measures how aligned two embeddings are. Arranging all pairwise dot products gives the similarity matrix $XX^\top$ (Section 1).
  2. Softmax turns scores into attention weights. Each row of raw scores maps onto the probability simplex via softmax, producing a distribution over keys for each query. Temperature $\tau$ controls sharpness: low $\tau$ concentrates attention, high $\tau$ spreads it uniformly, and $\tau = \sqrt{d_k}$ is the standard scaling that keeps gradients healthy in high dimensions (Section 2).
  3. Self-attention decouples roles with $Q$, $K$, $V$ projections. Rather than using the same embedding as both question and answer, learned weight matrices $W_Q$, $W_K$, $W_V$ let each token present different faces depending on its role — what to look for, what to advertise, and what to provide (Section 3).
  4. Projections are where the model learns what to attend to. Different $W_Q$, $W_K$ can focus on specific features, flip similarity, or create patterns impossible with raw dot products. The output for each token is a weighted blend of all value vectors, with weights determined by query–key compatibility (Section 4).
  5. The entire operation is differentiable — just matrix multiplies and softmax — which is why transformers can be trained end-to-end with gradient descent.

References

  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," NeurIPS, 2017. arXiv:1706.03762