Notebook 5: SWIN Transformer I

Extends the Vision Transformer of Notebook 4 to larger images by restricting attention to local spatial windows. Uses a 32×32 toy image with 4×4 patches, giving an 8×8 patch grid (64 patches, $d=6$, window size $W=4$, four windows of 16 patches each). Introduces windowed attention, shifted windows, and the cyclic-shift trick. Hierarchy, patch merging, and relative position bias are covered in Notebook 6.

1. Terminology

Summary: SWIN inherits the vocabulary of ViT and adds a new spatial grouping concept. This section pins down three terms — patch, patch grid, and window — that recur throughout the notebook.

Patch and token. Recall from Notebook 4 that a patch is a small rectangular tile of pixels extracted from the input image. Each patch is flattened and linearly projected to produce a $d$-dimensional embedding vector, which enters the transformer as one element of the input sequence. That embedding vector is the token for that patch. The two words are used interchangeably in the literature — you will see "patch token," "image token," and simply "patch" or "token" all referring to the same object. This notebook follows the same convention: patch and token mean the same thing; patch token is used when the context calls for extra clarity.

Patch grid. After extracting all patches, we can label each one by its spatial location $(r, c)$ — row and column in the original image. This two-dimensional arrangement is the patch grid, with dimensions $H_{\mathrm{grid}} \times W_{\mathrm{grid}}$ (rows × columns of patches). For example, a $48\times32$ image with $4\times4$ patches gives a $12\times8$ patch grid: 12 patch rows, 8 patch columns, 96 tokens in total. The patch grid is a spatial index, not a learned structure; it simply records where each token came from.

Window (SWIN sense). A window in SWIN is a contiguous $W\times W$ block of patch tokens on the patch grid — a local spatial neighborhood used as the scope of one attention computation. For example, with $W=4$ the top-left window contains the 16 tokens at grid positions $(r,c)$ with $r,c\in\{0,1,2,3\}$. This is different from the signal-processing sense of a sliding window, which moves continuously (or in unit steps) over a 1-D or 2-D signal to compute local statistics. SWIN windows do not slide or overlap: the grid is partitioned into $\frac{H_{\mathrm{grid}}}{W}\times\frac{W_{\mathrm{grid}}}{W}$ non-overlapping windows that tile it exactly, so each patch belongs to exactly one window at any given transformer block. Cross-window interaction is introduced by alternating the window partition between blocks (Section 4), not by making windows overlap.

2. The Quadratic Cost Problem

Summary: Full self-attention requires an $N\times N$ matrix for $N$ tokens. For high-resolution images $N$ can reach thousands, making global attention prohibitively expensive. SWIN replaces it with local window attention at cost $O(N \cdot W^2)$ — linear in $N$.

The cost of full attention. Notebook 4 showed that a 224×224 image with $8\times8$ patches produces $N = 784$ patch tokens, requiring a $784\times 784 \approx 614{,}000$-entry attention matrix per head per block. Scaling to finer patches quickly becomes impractical: with $4\times4$ patches the same image gives $N = 3136$ tokens and $N^2 \approx 9.8\text{ million}$ entries, just for one attention matrix.

Window attention: same machinery, smaller scope. SWIN's key insight is to run the exact same scaled dot-product attention from Notebooks 1–3 — $Q, K, V$ projections, softmax, weighted sum — but only within local $W\times W$ windows of the patch grid, rather than across the entire sequence. For $\frac{N}{W^2}$ windows each containing $W^2$ tokens, the total attention cost is $$\frac{N}{W^2} \times (W^2)^2 = N \cdot W^2,$$ which scales linearly in $N$. With $W=7$ and $N=3136$, this is $3136 \times 49 \approx 154{,}000$ entries — a $64\times$ reduction from the $9.8$ million required by full attention. The attention mechanism itself is unchanged; only its scope is restricted.

3. Window Partitioning

Summary: The patch grid is divided into non-overlapping $W\times W$ windows. Each window runs its own independent multi-head self-attention (W-MSA); patch tokens attend only to other tokens in the same window.

The toy example. Our 32×32 image is divided into 4×4-pixel patches, giving an 8×8 patch grid of 64 tokens. With window size $W=4$, the grid partitions into a 2×2 arrangement of four windows, each containing $W^2 = 16$ tokens. The attention computation within each window is a $16\times16$ matrix — identical to the attention blocks of Notebooks 1–3, just applied to 16 tokens instead of 4.

Window 0 (top-left) Window 1 (top-right) Window 2 (bottom-left) Window 3 (bottom-right)

Regular window partition: 4 windows of $4\times4 = 16$ patches each. Thick lines mark window boundaries. Intensity within each color encodes the mean patch brightness (darker = dimmer patch). Each window's patches use a distinct pair of spatial Fourier modes, giving each window a genuinely different attention structure.

Window:

Attention weight matrix within the selected window (16×16). Tokens are labeled by local position (row, col) within the window.

Independent attention per window. Each of the four windows runs the same W-MSA computation: project to $Q,K,V$, compute $16\times16$ attention scores, softmax, weighted sum of values. The four computations are independent — they can run in parallel, and a patch in window $0$ has no direct interaction with any patch in windows 1, 2, or 3. In practice, all windows in a layer share the same $W_Q, W_K, W_V$ projection weights — they are layer parameters, not window parameters. Toggle the window selector above to verify: the four attention matrices have visibly different structure because each window's patches carry different spatial frequency content (encoded via distinct Fourier modes in this toy example), even though every window uses the same projections. The W-MSA abbreviation (Window Multi-head Self-Attention) is used throughout the SWIN paper and hereafter.

4. The Locality Problem and Shifted Windows

Summary: Fixed-location windows prevent queries between patches in different windows, regardless of depth. SWIN restores cross-window information flow by shifting the window grid by $(W/2,\,W/2)$ patches in alternating transformer blocks.

The isolation problem. With the regular window partition, patches at the boundary between two windows are spatially adjacent but can never interact directly — not even after many transformer blocks, because each block applies the same fixed partition. In the 8×8 toy grid with $W=4$: the patch at grid position $(3,3)$ (bottom-right of window $0$) is immediately adjacent to the patch at $(3,4)$ (bottom-left of window 1), yet they never attend to each other.

Block $\ell$: W-MSA
Regular partition — 4 complete windows, labeled $0$–$3$. Patches at adjacent window boundaries never interact.

Block $\ell+1$: SW-MSA
Shifted by $(2,2)$ — labels show the new (shifted) window each region belongs to. Colors show the original regular window each patch came from.

Shifted-window attention (SW-MSA). SWIN alternates between two window configurations in successive transformer blocks. Block $\ell$ uses the regular partition (W-MSA). Block $\ell+1$ shifts the window grid by $(\lfloor W/2\rfloor,\, \lfloor W/2\rfloor) = (2, 2)$ patches before partitioning (SW-MSA). In the right panel, the large number in each region is its shifted window index: notice that the four corner regions all land in shifted window $0$, the four edge strips in windows $1$ and $2$, and the centre block in window $3$. Each shifted window therefore contains patches from multiple original windows, giving them the opportunity to attend to one another. After two blocks — one regular, one shifted — every patch has had the opportunity to interact with patches up to $2W$ positions away in each direction.

Staggered brick analogy. A brick wall is strongest when each course offsets by half a brick width from the layer below, so each brick bridges the joint in the previous course. SWIN's regular/shifted alternation works the same way: the shifted layer bridges exactly the boundaries that isolated patches in the regular layer.

Boundary strips. The boundary strips (rows/cols $0$–$1$ and $6$–$7$) create partial windows at the image edges; the next section explains how SWIN handles these efficiently.

5. The Cyclic Shift

Summary: Naively shifting the window grid creates irregular partial windows at the image boundary. SWIN handles this by cyclically rolling the patch grid before windowing, so all windows remain a uniform $W\times W$ size, then masking out attention between spatially non-adjacent patches within artificial boundary windows.

Shifted grid
(before roll)

→

After cyclic roll
(uniform windows)

→

Inverse roll
(spatial layout restored)

Colors show the original regular window each patch came from. After rolling, window boundaries (thick lines) are uniform 4×4. The boundary windows now contain patches that are not spatially adjacent (e.g., orange to the left of blue) — these non-adjacent pairs are blocked from attending to each other by the attention mask.

The boundary problem. A naive shift of $(2,2)$ on an 8×8 grid moves rows 6–7 and cols 6–7 outside the grid boundary, creating smaller fragments: strips of $2\times4$, $4\times2$, and $2\times2$ patches that do not fill a complete $4\times4$ window. Processing windows of different sizes requires branching code and breaks efficient batched computation.

Cyclic roll (torch.roll). Instead of discarding the boundary strips, SWIN wraps them around to the opposite edge — treating the patch grid as if it lives on a torus. Patches that fall off the right edge reappear on the left; patches that fall off the bottom reappear at the top. After rolling, the window grid is again perfectly regular: all windows are exactly $W\times W = 16$ patches, and the same batched code path handles every block.

Attention masking. The price of the roll is that some windows now contain patches that are spatially non-adjacent in the original image. For example, after rolling the toy 8×8 grid by $(2,2)$, the top-left window contains patches from all four original quadrants. These should not attend to one another — doing so would create spurious long-range connections through the torus seam. SWIN adds a large negative constant ($-10^4$ in practice; we use $-10$ in the toy example) to the attention scores for non-adjacent pairs before the softmax. After softmax, those entries become effectively $0$. The masking logic itself is mechanical index bookkeeping; the key idea is simply the roll. See Liu et al. (2021), Appendix A, for the full derivation.

Left: Window $0$ after cyclic roll. Colors show each patch's original window; numbers are the row-major (rasterized) index used on the mask axes.
Right: Attention mask. Colored strips along the top and left show each index's origin. Dark = masked (non-adjacent pair); light = valid. None of the four quadrants in Window 0 are spatially adjacent, so all cross-quadrant attention is blocked. Other boundary windows (e.g., Window 1) straddle only a horizontal or vertical seam and allow partial cross-quadrant attention.
Note: For a large image, most windows after the cyclic roll remain interior windows whose sub-regions are all spatially adjacent, so their masks allow all attention pairs — the mask is all-valid and has no effect.

Raw attention scores $QK^\top/\!\sqrt{d}$ before masking. All 16×16 entries have meaningful values.

Attention weights after masking + softmax. The mask zeros out cross-quadrant entries; each token attends only within its own sub-region.

After attention. Once attention is computed in the rolled coordinate frame, the inverse roll (shift back by $(-2,-2) \bmod 8$) restores each patch to its correct spatial position. From the outside, the SW-MSA block takes a token sequence, applies rolled windowed attention with masking, and returns a sequence in the original order — exactly what the next block expects.

Key Takeaways

Window attention reduces complexity from $O(N^2)$ to $O(N\cdot W^2)$. Restricting each attention computation to a local $W\times W$ window replaces one large $N\times N$ matrix with many small $W^2\times W^2$ matrices. Total entries scale linearly in $N$, making SWIN practical for high-resolution images where full ViT attention would be prohibitive.
Shifted windows (SW-MSA) restore cross-window information flow. By offsetting the window grid by $(W/2, W/2)$ in alternating blocks, patches that were isolated in one block can interact in the next. A W-MSA/SW-MSA block pair gives every patch access to a $2W\times2W$ neighborhood across two steps, without any increase in per-block complexity.
The cyclic shift is an engineering trick that makes SW-MSA computationally uniform. Rolling the patch grid before windowing keeps all windows at exactly $W\times W$; attention masking then blocks spurious interactions between spatially non-adjacent patches in boundary windows. The key idea is the shift; the roll and mask are implementation details.
Projection weights $W_Q, W_K, W_V$ are shared across all windows in a layer. The four windows at any given block use identical projections; their visibly different attention patterns arise from different spatial content in each window, not from different weight matrices. Window-specific behavior is entirely data-driven.
W-MSA and SW-MSA always appear as a matched pair. A single W-MSA block alone leaves window boundaries fixed; adding SW-MSA immediately after bridges those exact boundaries, giving every patch the opportunity to interact with patches up to $2W$ steps away in each direction. Notebook 6 shows how these pairs compose into stages to build a multi-scale feature hierarchy.

References

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," ICCV, 2021. arXiv:2103.14030