Notebook 3: The Transformer Block

Assembles multi-head attention and a feed-forward network into the repeating unit of every transformer, and shows how information flows through the block as a residual stream. Uses $d_{\text{model}} = 4$, $d_{\text{ff}} = 8$ (2× expansion), $h = 2$ heads, identity attention weights $W_Q = W_K = W_V = W_O = I$.


1. Building Blocks

Summary: Three standard components — layer normalization, residual connections, and a feed-forward network — appear in every transformer block. This section briefly reviews each; the next section examines what is distinctive about their combination.

Layer Normalization. LayerNorm normalizes each token's embedding vector to zero mean and unit variance, independently for each token. Taking $x$ to be one row vector in $X$, we normalize using the sample mean and standard deviation of the components of $x$: $$\mathrm{LN}(x) = \frac{x - \mu}{\sigma}, \qquad \mu = \frac{1}{d}\sum_i x_i, \quad \sigma = \sqrt{\frac{1}{d}\sum_i (x_i{-}\mu)^2 + \varepsilon}.$$ LayerNorm operates on one row of $X$ at a time and does not mix information across tokens. Learnable per-dimension scale $\gamma$ and shift $\beta$ are omitted here (set to $\gamma = 1$, $\beta = 0$). The role of LayerNorm is training stability: deep networks with residual connections can accumulate large or shrinking activations; LayerNorm keeps the input to each component in a predictable range regardless of depth.

Residual connections. Rather than directly replacing $x$ with an output $f(x)$, a residual block computes $x + f(x)$. At initialization, with $f$ near zero (small random weights), the block is close to the identity — the input passes through nearly unchanged. This makes it easy to train deep stacks: gradients flow back through the skip connection directly, bypassing the residual block entirely, so the network can always fall back to "do nothing here."

Feed-forward network (FFN). The FFN re-maps each token's representation through a higher-dimensional intermediate space, allowing nonlinear feature combinations. Using the same weights for every token, it operates on one row of $X$ at a time and cannot mix information across tokens: $$\mathrm{FFN}(x) = \mathrm{ReLU}(x\,W_1 + b_1)\,W_2 + b_2.$$ $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ expands to a wider hidden layer, then $W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ compresses back. The typical expansion factor is $4\times$ ($d_{\text{ff}} = 4\,d_{\text{model}}$), though variants use $\tfrac{8}{3}\times$, $8\times$, or gated architectures (SwiGLU, GeGLU). Here we use $2\times$ expansion ($d_{\text{ff}} = 8$) to keep matrices hand-traceable while preserving the expand–activate–compress structure. The FFN is the only part of the standard transformer block that applies a learned nonlinearity to the features within a single token's representation. (In the ML literature this is called a position-wise FFN: the same weights are used at every position, with no mixing across positions.)


2. The Transformer Block

Summary: Attention mixes information across tokens; the FFN transforms each token independently. These two complementary operations, each wrapped in a residual connection, form one complete transformer block.

Transformer block data-flow diagram

Step through each stage to trace data through the block. Colorscale fixed to max |value| across all stages for comparison.

Two complementary operations.

  • From Notebook 1, we know that attention is a weighted sum: each output token is a linear combination of the value vectors of all input tokens. This mixes information across positions but applies no nonlinearity within a token.
  • The FFN, by contrast, applies a nonlinear transformation to each token independently — it processes each row of the matrix in isolation and cannot combine information from different positions.
  • Together they cover complementary ground: attention routes and aggregates context across the sequence; the FFN re-encodes each token's representation nonlinearly.

Pre-norm formulation. In the modern convention (used in GPT, LLaMA, and most current architectures), LayerNorm is applied to the input of each component before the computation. Two separate LayerNorm instances ($\mathrm{LN}_1$ and $\mathrm{LN}_2$, each with independent parameters) normalize before attention and before the FFN respectively: $$X_1 = X + \mathrm{Attn}\!\left(\mathrm{LN}_1(X)\right), \qquad X_2 = X_1 + \mathrm{FFN}\!\left(\mathrm{LN}_2(X_1)\right).$$ Notice that $\mathrm{LN}_1(X)$ is the input to the attention computation, but the residual adds back the original $X$. Hence, the residual structure is really applied to $\mathrm{Attn}\circ\mathrm{LN}_1$. The original 2017 paper used post-norm (LayerNorm after the residual); pre-norm is the modern standard.

This demo. We use $W_Q = W_K = W_V = W_O = I$, so the multi-head attention step is exactly the identity-weight column partition from Notebook 2: Head 1 attends in the $d_0, d_1$ subspace and Head 2 in $d_2, d_3$. FFN weights are random ($\mathtt{seed}=42$, scaled by $0.5$). The goal is to show the structure of the computation, not a trained result.


3. The Residual Stream

Summary: The Attention and FFN sub-blocks each add a perturbation to a shared stream. The full block output $X_2 = X + \Delta A + \Delta F$ is decomposed into three contributions in the figures below.

Residual stream:  $X$  →  $X_1 = X + \Delta A$  →  $X_2 = X_1 + \Delta F$

Contributions (separate colorscale from stream row above):  $\Delta A = \mathrm{Attn}(\mathrm{LN}_1(X))$  and  $\Delta F = \mathrm{FFN}(\mathrm{LN}_2(X_1))$

The residual stream. One way to think about the steps in a transformer block is as a stream of representations. Each sub-block reads from the stream (via LN), computes a correction, and adds it back. The stream starts as $X$; attention contributes $\Delta A$; the FFN contributes $\Delta F$. The output is simply $$X_2 = X + \Delta A + \Delta F.$$ No contribution is ever erased: each update accumulates additively in the stream, and later layers process the growing sum — they see the combined total, not the individual contributions separately. The top row shows the stream at three snapshots; the middle row shows what each sub-block contributes. The two rows use separate colorscales: the stream scale spans $X$ through $X_2$ (which grows as contributions accumulate); the contribution scale spans $\Delta A$ and $\Delta F$ on the same range for direct comparison.

Perturbations, not replacements. The bottom row of the figure is a norm chart showing per-token $\ell^2$ norms of $X$ (blue), $\Delta A$ (orange), and $\Delta F$ (green). In this demo (identity attention, random FFN) the deltas are comparable in size to $X$. Notice that $\Delta A$ is nearly the same magnitude for all four tokens, while $\Delta F$ varies considerably — attention combines information from all the tokens, while the FFN reacts to each token's own content separately. In a trained model, weights are typically initialized to produce small corrections, and training adjusts them so that each sub-block contributes something useful. The additive structure means any sub-block can learn to "do nothing" by driving its output toward zero.

Extending the stream. A full transformer concatenates $N$ identical Attn/FFN pairs. The $X_2$ output from one FFN becomes the input $X$ for the next Attn. The residual stream accumulates contributions from every pair: after $N$ pairs, the output is $X_{out}$ plus $N$ pairs of attention and FFN corrections. Depth adds capacity not by transforming representations wholesale but by allowing more correction passes — each block can route information differently and transform each token further given the context built up by earlier blocks.


Key Takeaways

  1. Attention and FFN are complementary. Attention mixes information across tokens (a weighted sum over all tokens); the FFN transforms each token independently using a learned nonlinearity. Together they handle both inter-token routing and per-token nonlinear transformation.
  2. The block output is an additive sum: $X_2 = X + \Delta A + \Delta F$. Each sub-block adds one correction to the residual stream; no contribution is ever erased. Later blocks process the accumulated sum of all prior updates — not the individual contributions separately.
  3. Pre-norm stabilizes training without disrupting the stream. $X_1 = X + \mathrm{Attn}(\mathrm{LN}_1(X))$ normalizes the Attention input but adds back to the un-normalized residual. This keeps activations in a predictable range as the stack grows deeper.
  4. Residual connections make the correction from each sub-block essentially optional. With small initialized weights, $f \approx 0$ and each block is near the identity. Training adjusts the corrections; any sub-block can learn to pass information through unchanged by driving its output toward zero.
  5. Stacking $N$ blocks gives $N$ rounds of "mix, then transform locally." Each block refines the residual stream: one attention pass aggregates context across positions, one FFN pass re-encodes each token in light of that context. Depth adds capacity by allowing more such refinement passes.

References

  • A. Vaswani et al., "Attention Is All You Need," NeurIPS, 2017. arXiv:1706.03762
  • J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer Normalization," arXiv, 2016. arXiv:1607.06450