Builds up scaled dot-product attention step by step using small matrices you can trace by hand.
Target audience: math / ECE grad students with a linear algebra background.
We compute pairwise dot products between 4 token embeddings to measure how similar they are.
Hover over any cell to see its value.
Words, tokens, and embeddings. In a transformer, input text is split into tokens — usually words or subword pieces. Each token is represented by an embedding: a numerical vector that encodes its meaning. In our example, the words "The", "cat", "sat", "down" are the tokens, and each row of $X$ is that token's 3-dimensional embedding. Real models use hundreds or thousands of dimensions, but the mechanics are identical.
Dot-product similarity. Each token's embedding is a row in $X$ and a column in $X^\top$. Entry $(i, j)$ of the similarity matrix is the dot product of row $i$ of $X$ with column $j$ of $X^\top$:
$$(\text{X} \cdot \text{X}^\top)_{ij} = \mathbf{x}_i \cdot \mathbf{x}_j = \|\mathbf{x}_i\| \, \|\mathbf{x}_j\| \cos\theta_{ij}$$You can trace this visually: pick a row of $X$ on the left and the corresponding column of $X^\top$, dot them together, and you get the corresponding entry in the similarity matrix. The white lines help you track which row feeds into each entry.
Observe: The diagonal is largest — each token is most similar to itself. "cat" and "sat" have high mutual similarity (1.47) because their embeddings both have large values in $d_1$ and $d_2$, pointing in similar directions. "The" and "sat" are slightly anti-correlated (−0.10) because "The" points along $d_0$ while "sat" points away from it.
In attention, we use this to ask: "How relevant is each key to this query?"
We apply softmax to convert raw dot products into probability distributions, with a temperature parameter that controls sharpness.
From similarity to attention weights. Raw dot products can be any real number, but we need a probability distribution over keys for each query. The softmax function does this:
$$\text{attention}_{ij} = \text{softmax}\!\left(\frac{s_{ij}}{\tau}\right) = \frac{e^{\, s_{ij}/\tau}}{\displaystyle\sum_k e^{\, s_{ik}/\tau}}$$where $s_{ij}$ is the raw similarity and $\tau$ is the temperature. This maps each row from $\mathbb{R}^n$ onto the probability simplex $\Delta^{n-1}$ (all entries non-negative, summing to 1).
Queries and keys. The axes on the attention heatmap are labeled "Queries" (rows) and "Keys" (columns). Think of the query as the token asking "who should I pay attention to?" and the keys as the candidates being considered. Each row of the attention matrix answers that question for one query token: it's a probability distribution over all the keys, saying how much each key matters to that query. We'll formalize this with separate $Q$/$K$/$V$ projections in Section 3, but the row-vs-column structure is already visible here.
Temperature controls how peaked the distribution is:
Try dragging the slider: at $\tau = 0.1$ the attention map is nearly binary; at $\tau = 5.0$ it's nearly uniform.
Why isn't the attention matrix symmetric? The similarity matrix $X X^\top$ is symmetric ($s_{ij} = s_{ji}$), but the attention matrix is not, because softmax normalizes each row independently. Entry $\text{attention}_{ij}$ depends not just on $s_{ij}$ but on all the scores in row $i$. Different rows have different normalization constants, so even though $s_{ij} = s_{ji}$, we generally have $\text{attention}_{ij} \neq \text{attention}_{ji}$.
This asymmetry is a fundamental feature of attention: each token gets its own perspective on the sequence. Note that the specific patterns here have no semantic significance — they're an artifact of our hand-chosen embeddings in a 3-dimensional space. In a trained model, the embeddings and separate $Q$/$K$ projections are learned so that attention patterns capture meaningful relationships.
In Sections 1–2, each token's embedding served as both the question and the answer. Self-attention introduces learned projections so that tokens can present different information depending on their role — asking, being searched, or providing content.
Self-attention with $Q$, $K$, $V$ projections. This is called self-attention because queries, keys, and values all come from the same input:
$$Q = X W_Q, \quad K = X W_K, \quad V = X W_V$$In cross-attention (which appears in encoder–decoder models), $K$ and $V$ come from a different sequence — but the mechanics are otherwise identical.
Why separate projections? In Section 1, the raw embedding $X$ served as both the question and the answer: we computed $X X^\top$ directly. This forces each token to use the same vector for "What am I looking for?" and "What do I advertise to others?" Separate projections decouple these roles:
The same token can present a completely different face depending on which role it's playing.
Reading the diagram. Follow the flow from left to right:
What the output means. Token $i$'s output vector isn't its own value — it's a blend of everyone's values, weighted by relevance. A token that strongly attends to one neighbor will have an output close to that neighbor's value vector. A token that spreads attention uniformly gets something like the average of all value vectors. This is how information flows between positions in a transformer.
Different $W_Q$ and $W_K$ projections can create fundamentally different attention patterns from the same input. Compare raw attention ($X X^\top$) with projected attention ($Q K^\top$) under several preset projections.
What projections buy you. In Section 1, each token had a single embedding that determined all of its relationships. Projections break this constraint:
The key takeaway: the $\otimes$ symbols in the Section 3 diagram aren't just mechanical steps. They're where the model learns what to pay attention to, independently of what information to extract (which is $W_V$'s job).