Notebook 1: Attention from Scratch

Builds up scaled dot-product attention step by step using small matrices you can trace by hand.
Target audience: math / ECE grad students with a linear algebra background.


1. Dot-Product Similarity

We compute pairwise dot products between 4 token embeddings to measure how similar they are.

Hover over any cell to see its value.

Words, tokens, and embeddings. In a transformer, input text is split into tokens — usually words or subword pieces. Each token is represented by an embedding: a numerical vector that encodes its meaning. In our example, the words "The", "cat", "sat", "down" are the tokens, and each row of $X$ is that token's 3-dimensional embedding. Real models use hundreds or thousands of dimensions, but the mechanics are identical.

Dot-product similarity. Each token's embedding is a row in $X$ and a column in $X^\top$. Entry $(i, j)$ of the similarity matrix is the dot product of row $i$ of $X$ with column $j$ of $X^\top$:

$$(\text{X} \cdot \text{X}^\top)_{ij} = \mathbf{x}_i \cdot \mathbf{x}_j = \|\mathbf{x}_i\| \, \|\mathbf{x}_j\| \cos\theta_{ij}$$

You can trace this visually: pick a row of $X$ on the left and the corresponding column of $X^\top$, dot them together, and you get the corresponding entry in the similarity matrix. The white lines help you track which row feeds into each entry.

Observe: The diagonal is largest — each token is most similar to itself. "cat" and "sat" have high mutual similarity (1.47) because their embeddings both have large values in $d_1$ and $d_2$, pointing in similar directions. "The" and "sat" are slightly anti-correlated (−0.10) because "The" points along $d_0$ while "sat" points away from it.

In attention, we use this to ask: "How relevant is each key to this query?"


2. From Similarity to Attention Weights

We apply softmax to convert raw dot products into probability distributions, with a temperature parameter that controls sharpness.

1.0

From similarity to attention weights. Raw dot products can be any real number, but we need a probability distribution over keys for each query. The softmax function does this:

$$\text{attention}_{ij} = \text{softmax}\!\left(\frac{s_{ij}}{\tau}\right) = \frac{e^{\, s_{ij}/\tau}}{\displaystyle\sum_k e^{\, s_{ik}/\tau}}$$

where $s_{ij}$ is the raw similarity and $\tau$ is the temperature. This maps each row from $\mathbb{R}^n$ onto the probability simplex $\Delta^{n-1}$ (all entries non-negative, summing to 1).

Queries and keys. The axes on the attention heatmap are labeled "Queries" (rows) and "Keys" (columns). Think of the query as the token asking "who should I pay attention to?" and the keys as the candidates being considered. Each row of the attention matrix answers that question for one query token: it's a probability distribution over all the keys, saying how much each key matters to that query. We'll formalize this with separate $Q$/$K$/$V$ projections in Section 3, but the row-vs-column structure is already visible here.

Temperature controls how peaked the distribution is:

  • Low temperature ($\tau \to 0$) → attention concentrates on the single most similar key (approaches one-hot). The model becomes very selective.
  • High temperature ($\tau \to \infty$) → attention spreads uniformly across all keys (approaches $1/n$). The model treats all keys equally.
  • $\tau = 1$ is the "unscaled" baseline shown by default. In transformers, the standard scaling uses $\tau = \sqrt{d_k}$, which for our 3D embeddings is $\sqrt{3} \approx 1.73$.

Try dragging the slider: at $\tau = 0.1$ the attention map is nearly binary; at $\tau = 5.0$ it's nearly uniform.

Why isn't the attention matrix symmetric? The similarity matrix $X X^\top$ is symmetric ($s_{ij} = s_{ji}$), but the attention matrix is not, because softmax normalizes each row independently. Entry $\text{attention}_{ij}$ depends not just on $s_{ij}$ but on all the scores in row $i$. Different rows have different normalization constants, so even though $s_{ij} = s_{ji}$, we generally have $\text{attention}_{ij} \neq \text{attention}_{ji}$.

This asymmetry is a fundamental feature of attention: each token gets its own perspective on the sequence. Note that the specific patterns here have no semantic significance — they're an artifact of our hand-chosen embeddings in a 3-dimensional space. In a trained model, the embeddings and separate $Q$/$K$ projections are learned so that attention patterns capture meaningful relationships.


3. Self-Attention: Q, K, V Projections

In Sections 1–2, each token's embedding served as both the question and the answer. Self-attention introduces learned projections so that tokens can present different information depending on their role — asking, being searched, or providing content.

Self-attention computational flow diagram

Self-attention with $Q$, $K$, $V$ projections. This is called self-attention because queries, keys, and values all come from the same input:

$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$

In cross-attention (which appears in encoder–decoder models), $K$ and $V$ come from a different sequence — but the mechanics are otherwise identical.

Why separate projections? In Section 1, the raw embedding $X$ served as both the question and the answer: we computed $X X^\top$ directly. This forces each token to use the same vector for "What am I looking for?" and "What do I advertise to others?" Separate projections decouple these roles:

  • $Q$ (query): "What am I looking for?"
  • $K$ (key): "What do I advertise to others?"
  • $V$ (value): "What information do I provide when attended to?"

The same token can present a completely different face depending on which role it's playing.

Reading the diagram. Follow the flow from left to right:

  1. The input $X$ branches three ways through learned weight matrices $W_Q$, $W_K$, $W_V$ to produce $Q$, $K$, $V$.
  2. $Q$ and $K^\top$ multiply — $Q$ on the left, $K^\top$ below — to produce raw scores, scaled by $\sqrt{d_k}$ (the same temperature scaling from Section 2, with $\tau = \sqrt{d_k}$).
  3. Softmax converts each row of scores into a probability distribution: the attention weights.
  4. The attention weights multiply $V$, producing the output. Each row of the output is a weighted average of $V$'s rows, where the weights come from how well that token's query matched each key.

What the output means. Token $i$'s output vector isn't its own value — it's a blend of everyone's values, weighted by relevance. A token that strongly attends to one neighbor will have an output close to that neighbor's value vector. A token that spreads attention uniformly gets something like the average of all value vectors. This is how information flows between positions in a transformer.


4. What Projections Buy You

Different $W_Q$ and $W_K$ projections can create fundamentally different attention patterns from the same input. Compare raw attention ($X X^\top$) with projected attention ($Q K^\top$) under several preset projections.

1.73

What projections buy you. In Section 1, each token had a single embedding that determined all of its relationships. Projections break this constraint:

  • Identity: $W_Q = W_K = I$ reproduces raw similarity exactly — projections don't have to change anything.
  • Dimension focus: Zeroing out dimensions lets the model attend based on specific features. Focusing on $d_0$ makes "The" and "down" the dominant pair, completely changing the pattern from raw similarity where "cat"–"sat" dominated.
  • Flipped similarity: $W_K = -I$ inverts all relationships. Tokens that were similar now repel. After softmax, each token attends most to its least similar neighbor.
  • Random learned: A trained model discovers $W_Q$, $W_K$ that create useful patterns for the task — patterns that may bear no resemblance to raw similarity.

The key takeaway: the $\otimes$ symbols in the Section 3 diagram aren't just mechanical steps. They're where the model learns what to pay attention to, independently of what information to extract (which is $W_V$'s job).


Key Takeaways

  1. Similarity is a dot product. The dot product $\mathbf{x}_i \cdot \mathbf{x}_j = \|\mathbf{x}_i\|\,\|\mathbf{x}_j\|\cos\theta_{ij}$ measures how aligned two embeddings are. Arranging all pairwise dot products gives the similarity matrix $XX^\top$ (Section 1).
  2. Softmax turns scores into attention weights. Each row of raw scores maps onto the probability simplex via softmax, producing a distribution over keys for each query. Temperature $\tau$ controls sharpness: low $\tau$ concentrates attention, high $\tau$ spreads it uniformly, and $\tau = \sqrt{d_k}$ is the standard scaling that keeps gradients healthy in high dimensions (Section 2).
  3. Self-attention decouples roles with $Q$, $K$, $V$ projections. Rather than using the same embedding as both question and answer, learned weight matrices $W_Q$, $W_K$, $W_V$ let each token present different faces depending on its role — what to look for, what to advertise, and what to provide (Section 3).
  4. Projections are where the model learns what to attend to. Different $W_Q$, $W_K$ can focus on specific features, flip similarity, or create patterns impossible with raw dot products. The output for each token is a weighted blend of all value vectors, with weights determined by query–key compatibility (Section 4).
  5. The entire operation is differentiable — just matrix multiplies and softmax — which is why transformers can be trained end-to-end with gradient descent.