Assembles multi-head attention and a feed-forward network into the repeating unit of every transformer, and shows how information flows through the block as a residual stream. Uses $d_{\text{model}} = 4$, $d_{\text{ff}} = 8$ (2× expansion), $h = 2$ heads, identity attention weights $W_Q = W_K = W_V = W_O = I$.
Summary: Three standard components — layer normalization, residual connections, and a feed-forward network — appear in every transformer block. This section briefly reviews each; the next section examines what is distinctive about their combination.
Layer Normalization. LayerNorm normalizes each token's embedding vector to zero mean and unit variance, independently for each token. Taking $x$ to be one row vector in $X$, we normalize using the sample mean and standard deviation of the components of $x$: $$\mathrm{LN}(x) = \frac{x - \mu}{\sigma}, \qquad \mu = \frac{1}{d}\sum_i x_i, \quad \sigma = \sqrt{\frac{1}{d}\sum_i (x_i{-}\mu)^2 + \varepsilon}.$$ LayerNorm operates on one row of $X$ at a time and does not mix information across tokens. Learnable per-dimension scale $\gamma$ and shift $\beta$ are omitted here (set to $\gamma = 1$, $\beta = 0$). The role of LayerNorm is training stability: deep networks with residual connections can accumulate large or shrinking activations; LayerNorm keeps the input to each component in a predictable range regardless of depth.
Residual connections. Rather than directly replacing $x$ with an output $f(x)$, a residual block computes $x + f(x)$. At initialization, with $f$ near zero (small random weights), the block is close to the identity — the input passes through nearly unchanged. This makes it easy to train deep stacks: gradients flow back through the skip connection directly, bypassing the residual block entirely, so the network can always fall back to "do nothing here."
Feed-forward network (FFN). The FFN re-maps each token's representation through a higher-dimensional intermediate space, allowing nonlinear feature combinations. Using the same weights for every token, it operates on one row of $X$ at a time and cannot mix information across tokens: $$\mathrm{FFN}(x) = \mathrm{ReLU}(x\,W_1 + b_1)\,W_2 + b_2.$$ $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ expands to a wider hidden layer, then $W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ compresses back. The typical expansion factor is $4\times$ ($d_{\text{ff}} = 4\,d_{\text{model}}$), though variants use $\tfrac{8}{3}\times$, $8\times$, or gated architectures (SwiGLU, GeGLU). Here we use $2\times$ expansion ($d_{\text{ff}} = 8$) to keep matrices hand-traceable while preserving the expand–activate–compress structure. The FFN is the only part of the standard transformer block that applies a learned nonlinearity to the features within a single token's representation. (In the ML literature this is called a position-wise FFN: the same weights are used at every position, with no mixing across positions.)
Summary: Attention mixes information across tokens; the FFN transforms each token independently. These two complementary operations, each wrapped in a residual connection, form one complete transformer block.
Step through each stage to trace data through the block. Colorscale fixed to max |value| across all stages for comparison.
Two complementary operations.
Pre-norm formulation. In the modern convention (used in GPT, LLaMA, and most current architectures), LayerNorm is applied to the input of each component before the computation. Two separate LayerNorm instances ($\mathrm{LN}_1$ and $\mathrm{LN}_2$, each with independent parameters) normalize before attention and before the FFN respectively: $$X_1 = X + \mathrm{Attn}\!\left(\mathrm{LN}_1(X)\right), \qquad X_2 = X_1 + \mathrm{FFN}\!\left(\mathrm{LN}_2(X_1)\right).$$ Notice that $\mathrm{LN}_1(X)$ is the input to the attention computation, but the residual adds back the original $X$. Hence, the residual structure is really applied to $\mathrm{Attn}\circ\mathrm{LN}_1$. The original 2017 paper used post-norm (LayerNorm after the residual); pre-norm is the modern standard.
This demo. We use $W_Q = W_K = W_V = W_O = I$, so the multi-head attention step is exactly the identity-weight column partition from Notebook 2: Head 1 attends in the $d_0, d_1$ subspace and Head 2 in $d_2, d_3$. FFN weights are random ($\mathtt{seed}=42$, scaled by $0.5$). The goal is to show the structure of the computation, not a trained result.
Summary: The Attention and FFN sub-blocks each add a perturbation to a shared stream. The full block output $X_2 = X + \Delta A + \Delta F$ is decomposed into three contributions in the figures below.
Residual stream: $X$ → $X_1 = X + \Delta A$ → $X_2 = X_1 + \Delta F$
Contributions (separate colorscale from stream row above): $\Delta A = \mathrm{Attn}(\mathrm{LN}_1(X))$ and $\Delta F = \mathrm{FFN}(\mathrm{LN}_2(X_1))$
The residual stream. One way to think about the steps in a transformer block is as a stream of representations. Each sub-block reads from the stream (via LN), computes a correction, and adds it back. The stream starts as $X$; attention contributes $\Delta A$; the FFN contributes $\Delta F$. The output is simply $$X_2 = X + \Delta A + \Delta F.$$ No contribution is ever erased: each update accumulates additively in the stream, and later layers process the growing sum — they see the combined total, not the individual contributions separately. The top row shows the stream at three snapshots; the middle row shows what each sub-block contributes. The two rows use separate colorscales: the stream scale spans $X$ through $X_2$ (which grows as contributions accumulate); the contribution scale spans $\Delta A$ and $\Delta F$ on the same range for direct comparison.
Perturbations, not replacements. The bottom row of the figure is a norm chart showing per-token $\ell^2$ norms of $X$ (blue), $\Delta A$ (orange), and $\Delta F$ (green). In this demo (identity attention, random FFN) the deltas are comparable in size to $X$. Notice that $\Delta A$ is nearly the same magnitude for all four tokens, while $\Delta F$ varies considerably — attention combines information from all the tokens, while the FFN reacts to each token's own content separately. In a trained model, weights are typically initialized to produce small corrections, and training adjusts them so that each sub-block contributes something useful. The additive structure means any sub-block can learn to "do nothing" by driving its output toward zero.
Extending the stream. A full transformer concatenates $N$ identical Attn/FFN pairs. The $X_2$ output from one FFN becomes the input $X$ for the next Attn. The residual stream accumulates contributions from every pair: after $N$ pairs, the output is $X_{out}$ plus $N$ pairs of attention and FFN corrections. Depth adds capacity not by transforming representations wholesale but by allowing more correction passes — each block can route information differently and transform each token further given the context built up by earlier blocks.