Attention Is All You Need

The 2017 paper that launched the Transformer architecture and changed AI forever. Here's every concept, explained so clearly you'll never forget it.

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin — NeurIPS 2017

↓

01The Problem Before Transformers

Before 2017, if you wanted a neural network to handle sequences (text, speech, time-series), you used Recurrent Neural Networks (RNNs), specifically LSTMs or GRUs. They had a fatal flaw.

🧠 Analogy β€” The Assembly Line Problem

Imagine translating a book by passing it through workers on an assembly line. Worker 1 reads word 1, writes a note, and passes it to Worker 2. Worker 2 reads word 2 and the note, writes a new note, passes it on. By Worker 500, the note about word 1 is mostly forgotten. That's an RNN.

Now imagine instead that every worker can see every word of the book simultaneously, and each worker specializes in a different type of pattern. That's a Transformer.

Two Core Problems with RNNs

🐌

Sequential = Slow

RNNs process one token at a time. You MUST finish step t before starting step t+1. You can't parallelize this. Training on long sequences takes forever.

πŸ§ πŸ’¨

Long-Range Forgetting

Information from early tokens must survive through every hidden state. Over long sequences, early information degrades β€” vanishing gradient problem.

RNN: Sequential Processing (slow, forgetful) h₁ "The" hβ‚‚ "cat" h₃ "sat" Β·Β·Β· hβ‚™ "mat" ← must wait for all prior Transformer: Parallel Processing (fast, all-seeing)
RNNs process sequentially. Transformers process all positions in parallel.
πŸ’‘ Key Insight

The paper's breakthrough: you don't need recurrence at all. Replace it entirely with attention β€” a mechanism that lets every token directly look at every other token. This is parallel, fast, and doesn't forget.

02The Transformer Architecture

The Transformer follows the classic encoder-decoder pattern, but replaces all recurrence with self-attention and feed-forward layers.

🧠 Analogy β€” The Conference Interpreter

Encoder = A team of analysts who read the entire source document and create a rich, interconnected understanding of it. Each analyst specializes in different aspects (syntax, meaning, context).

Decoder = A translator who writes the output word by word, but at each word can consult all the analysts AND review what they've written so far.

The Full Pipeline

1

Input Embedding

Convert each input token (word/subword) into a 512-dimensional vector. Multiply by √512 β‰ˆ 22.6 to scale up.

2

Add Positional Encoding

Since there's no recurrence, inject position info using sine/cosine waves. "The" at position 0 gets a different signal than "the" at position 50.

3

Encoder Stack (Γ—6 layers)

Each layer: Multi-Head Self-Attention β†’ Add & Normalize β†’ Feed-Forward Network β†’ Add & Normalize. Every token attends to every other token.

4

Decoder Stack (Γ—6 layers)

Each layer: Masked Self-Attention β†’ Add & Norm β†’ Cross-Attention (attend to encoder output) β†’ Add & Norm β†’ FFN β†’ Add & Norm.

5

Linear + Softmax

Project decoder output to vocabulary size, apply softmax to get probability of each possible next token.

Key Hyperparameters (Base Model)

ParameterValueWhat It Means
dmodel512Dimension of every vector flowing through the model
N (layers)6Both encoder and decoder have 6 stacked layers
h (heads)88 parallel attention heads per layer
dk = dv64Each head works with 64-dim keys/values (512Γ·8)
dff2048Inner dimension of the feed-forward network
Total params65MBase model. "Big" model: 213M parameters

Residual Connections & Layer Norm

Every sub-layer uses a residual connection: the sub-layer's output is added back to its input, then layer-normalized. This is critical β€” it lets gradients flow straight through during backpropagation and stabilizes training.

Output = LayerNorm( x + SubLayer(x) )
Residual connection + layer normalization pattern
🧠 Analogy

Think of residual connections like a highway bypass. The sub-layer can learn "what to add" to the existing representation rather than having to rebuild it from scratch. If the sub-layer learns nothing useful, the original signal passes through unharmed.

03Scaled Dot-Product Attention

This is the heart of the paper. Understand this, and you understand Transformers.

The Q, K, V Framework

🧠 Analogy β€” The Library Search

Imagine a library with a card catalog:

Query (Q) = Your search question: "I need information about cats"

Key (K) = The label on each book's catalog card: "Feline Biology", "Dog Training", "Cat Behavior"

Value (V) = The actual content of each book

You compare your Query to each Key (how relevant is this book?), get a relevance score, then read a weighted mix of the Values (books), paying most attention to the most relevant ones.

The Formula Step by Step

Attention(Q, K, V) = softmax( QKT / √dk ) · V
Equation 1 β€” Scaled Dot-Product Attention
1

Compute Dot Products: QKT

For each query, compute the dot product with every key. This gives a raw "similarity score." If Q and K point in similar directions β†’ high score β†’ these tokens are relevant to each other.

2

Scale by √dk

Divide by √64 = 8. Why? Without scaling, when dk is large, dot products become huge, pushing softmax into regions with near-zero gradients. Scaling keeps values in a well-behaved range.

3

Apply Softmax

Convert scores to probabilities (0 to 1, summing to 1). High scores β†’ high attention weight. This is the "attention distribution" β€” how much each position pays attention to every other.

4

Multiply by V

Weighted sum of value vectors. Positions with high attention weight contribute more to the output. The result: a context-aware representation of each token.

πŸ”¬ Interactive: See Attention in Action

Click a word to see which other words it would attend to. Brightness = attention weight.

Why "Scaled" β€” A Numerical Example

Suppose dk = 64. If q and k components are random with mean 0, variance 1:

The dot product qΒ·k = Ξ£qα΅’kα΅’ has variance = dk = 64.

So typical dot products range from roughly -16 to +16 (a few standard deviations).

softmax([16, 0, -16]) β‰ˆ [1.0, 0.0, 0.0] β€” completely saturated! Gradients vanish.

After scaling: divide by 8 β†’ values range β‰ˆ -2 to +2.

softmax([2, 0, -2]) β‰ˆ [0.67, 0.09, 0.24] β€” smooth gradients, learnable.

Concrete Numerical Walkthrough

Let's trace attention for a tiny example with dk=4 and 3 tokens: "I love cats"

# Suppose after linear projection we get: Q = [[1, 0, 1, 0], # query for "I" [0, 1, 1, 0], # query for "love" [1, 1, 0, 1]] # query for "cats" K = [[1, 1, 0, 0], # key for "I" [0, 1, 1, 1], # key for "love" [1, 0, 1, 0]] # key for "cats" # Step 1: QK^T (each query dotted with each key) scores = [[1, 1, 2], # "I" scores vs [I, love, cats] [1, 2, 1], # "love" scores vs [I, love, cats] [1, 1, 1]] # "cats" scores vs [I, love, cats] # Step 2: Scale by √d_k = √4 = 2 scaled = [[0.5, 0.5, 1.0], [0.5, 1.0, 0.5], [0.5, 0.5, 0.5]] # Step 3: Softmax (each row sums to 1) weights = [[0.26, 0.26, 0.48], # "I" attends most to "cats" [0.26, 0.48, 0.26], # "love" attends most to "love" [0.33, 0.33, 0.33]] # "cats" attends equally # Step 4: Multiply by V β†’ weighted sum of value vectors # output[0] = 0.26*V["I"] + 0.26*V["love"] + 0.48*V["cats"]

04Multi-Head Attention

One attention "head" can only focus on one type of relationship at a time. The solution: run multiple attention heads in parallel, each learning a different pattern.

🧠 Analogy β€” The Expert Panel

Imagine reading the sentence "The animal didn't cross the street because it was too tired."

Head 1 (Grammar Expert): "it" β†’ looks at "animal" (pronoun resolution)

Head 2 (Semantics Expert): "tired" β†’ looks at "animal" (who is tired?)

Head 3 (Syntax Expert): "cross" β†’ looks at "street" (verb-object)

Head 4 (Negation Expert): "didn't" β†’ looks at "cross" (what's negated?)

Each head extracts a different type of relationship. Combined, they capture a rich understanding.

The Mechanics

MultiHead(Q,K,V) = Concat(head₁, ..., headh) Β· WO
where headi = Attention(QWiQ, KWiK, VWiV)
Multi-Head Attention formula

Dimension Split Walkthrough

# d_model = 512, h = 8 heads # Each head gets: d_k = d_v = 512 / 8 = 64 dimensions Input: x ∈ ℝseq_len Γ— 512 # For each head i: Qi = x Β· WiQ # (seq_len Γ— 512) Β· (512 Γ— 64) β†’ seq_len Γ— 64 Ki = x Β· WiK # same dimensions Vi = x Β· WiV # same dimensions headi = Attention(Qi, Ki, Vi) # β†’ seq_len Γ— 64 # Concatenate all 8 heads: Concat = [head₁; headβ‚‚; ...; headβ‚ˆ] # β†’ seq_len Γ— 512 # Final projection: Output = Concat Β· WO # (seq_len Γ— 512) Β· (512 Γ— 512) β†’ seq_len Γ— 512 # Total cost β‰ˆ same as single-head with full d_model!
πŸ’‘ Why This is Clever

By splitting into 8 heads of 64 dimensions each (instead of 1 head of 512 dimensions), you get 8 different learned attention patterns at roughly the same computational cost as a single head. It's like getting 8 experts for the price of 1.

Three Ways Attention Is Used

1. Encoder Self-Attention

Every input token attends to every other input token. Q, K, V all come from the encoder's previous layer. Learns relationships within the source sentence.

2. Masked Decoder Self-Attention

Each output token attends to previous output tokens only. Future tokens are masked (set to -∞ before softmax). Preserves autoregressive property.

3. Encoder-Decoder Cross-Attention

Queries come from decoder, Keys and Values from encoder output. This is how the decoder "reads" the source β€” the bridge between input and output.

πŸ”¬ Interactive: Multi-Head Attention Patterns

Toggle between different attention heads to see how each learns different patterns.

05Positional Encoding

Without recurrence, the Transformer has no notion of order. "The cat sat on the mat" and "mat the on sat cat The" would look identical! Positional encodings fix this.

🧠 Analogy

Imagine you're at a concert with assigned seating. The positional encoding is like your seat number β€” it tells the model where each word "sits" in the sequence. But instead of a single number, it's a unique pattern of sine waves, like a musical chord unique to each position.

The Formula

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)
Sinusoidal positional encoding

What this means: For each position in the sequence, you create a 512-dimensional vector. Even dimensions (0, 2, 4...) use sine; odd dimensions (1, 3, 5...) use cosine. Each dimension has a different frequency, ranging from very fast oscillations to very slow ones.

Why sine/cosine? Because for any fixed offset k, PEpos+k can be expressed as a linear function of PEpos. This means the model can easily learn to attend to "3 positions back" or "5 positions ahead" β€” relative position is baked into the math.

πŸ”¬ Interactive: Positional Encoding Heatmap

Each row is a position (0–49). Each column is a dimension. Color shows the PE value. Notice: low dimensions oscillate fast, high dimensions oscillate slowly.

Dimensions shown: 64 Positions: 50

06Feed-Forward Network & Training

Position-wise FFN

After attention, each token's representation passes through a simple 2-layer neural network, applied independently to each position:

FFN(x) = max(0, xW₁ + b₁)Wβ‚‚ + bβ‚‚
Two linear transforms with ReLU activation
🧠 Analogy

Attention is like a meeting where everyone shares information. The FFN is like each person going back to their desk and thinking about what they heard β€” processing it independently. The input is 512β†’2048β†’512 dimensions (expand, process, compress).

Training Setup

πŸ“Š

Data

EN-DE: 4.5M sentence pairs, ~37K BPE tokens. EN-FR: 36M sentences, 32K word-piece vocabulary.

⚑

Hardware

8 NVIDIA P100 GPUs. Base model: 12 hours (100K steps). Big model: 3.5 days (300K steps).

πŸ“ˆ

Optimizer

Adam with β₁=0.9, Ξ²β‚‚=0.98. Custom learning rate schedule with 4000 warmup steps.

πŸ›‘οΈ

Regularization

Dropout (P=0.1) on sub-layers and embeddings. Label smoothing (Ξ΅=0.1) β€” hurts perplexity but helps BLEU.

The Learning Rate Schedule

One of the paper's subtle innovations β€” a "warmup then decay" schedule:

lr = dmodel-0.5 Β· min(step-0.5, step Β· warmup-1.5)
Learning rate increases linearly for 4000 steps, then decays as 1/√step

πŸ”¬ Learning Rate Schedule Visualization

Warmup phase (0–4K steps) followed by inverse-sqrt decay

Label Smoothing

Instead of training the model to output probability 1.0 for the correct token and 0.0 for all others, label smoothing distributes a small amount (Ξ΅=0.1) of probability mass across all tokens.

# Without label smoothing (hard targets): target = [0, 0, 0, 1.0, 0, 0, ...] # 100% on correct token # With label smoothing Ξ΅=0.1: target = [0.0001, 0.0001, 0.0001, 0.9, 0.0001, ...] # 90% on correct, 10% spread across vocab # Makes model less overconfident β†’ better generalization

07Results β€” Crushing the Competition

Machine Translation BLEU Scores

πŸ“Š BLEU Score Comparison

πŸ’‘ Why These Results Were Shocking

The Transformer didn't just beat previous models β€” it did so at a fraction of the training cost. The big model achieved 28.4 BLEU on EN-DE (2+ BLEU above the previous best ensemble) while using only 2.3Γ—10¹⁹ FLOPs. The previous best ensemble used 1.8Γ—10²⁰ FLOPs β€” nearly 8Γ— more compute.

Ablation Study β€” What Matters?

Number of Heads

1 head: 24.9 BLEU. 8 heads: 25.8 (best). 32 heads: 25.4 (too many hurts). Sweet spot is 8.

Key Dimension dk

Reducing dk hurts quality. The compatibility function needs enough capacity to determine relevance.

Model Size

Bigger is better: dmodel=1024 beats 512. dff=4096 beats 2048. More params = more capacity.

Dropout

Critical for preventing overfitting. Without dropout: 24.6 BLEU. With P=0.1: 25.8 BLEU.

English Constituency Parsing

To prove generality, they applied the Transformer to parsing β€” converting sentences into tree structures. With just a 4-layer model and minimal tuning, it achieved 92.7 F1, competitive with specialized parsers trained specifically for this task. This showed the architecture wasn't just a translation trick.

08Why Self-Attention Wins

The paper compares self-attention to recurrent and convolutional layers on three criteria:

Layer Type Complexity/Layer Sequential Ops Max Path Length
Self-Attention O(nΒ² Β· d) O(1) O(1) βœ“ Best
Recurrent (RNN) O(n Β· dΒ²) O(n) βœ— Worst O(n)
Convolutional O(k Β· n Β· dΒ²) O(1) O(logk(n))

Max Path Length = O(1)

Any token can directly attend to any other token in a single layer. No need to pass through intermediate steps. "The" at position 1 can directly connect to "mat" at position 100.

Sequential Ops = O(1)

All attention computations happen in parallel (matrix multiplications). No waiting. RNNs need O(n) sequential steps β€” for a 1000-token sequence, that's 1000Γ— slower.

Interpretable

Attention weights are visible and meaningful. You can literally see what the model is "looking at," as shown in the paper's attention visualizations (Figures 3-5).

πŸ’‘ The Trade-off

Self-attention complexity is O(nΒ² Β· d) β€” quadratic in sequence length. For very long sequences (n > d), this is actually more expensive than RNNs O(n Β· dΒ²). This is why later work (Longformer, Linformer, Flash Attention) focused on making attention sub-quadratic. But for typical sentence lengths in translation (n < 100), this isn't an issue.

09The Legacy β€” Why This Paper Changed Everything

The Transformer architecture didn't just improve machine translation. It became the foundation for virtually all of modern AI.

BERT (2018)

Encoder-only Transformer. Pre-train on masked language modeling, fine-tune for any NLP task. Revolutionized NLP.

Encoder Only

GPT Series (2018–)

Decoder-only Transformer. Autoregressive language models that became ChatGPT, GPT-4, and the LLM revolution.

Decoder Only

Vision Transformer (2020)

Applied attention to image patches. Showed Transformers work for vision too, not just text.

Vision

AlphaFold 2 (2020)

Used attention for protein structure prediction. Won CASP14 by a landslide. Transformers for biology.

Biology

DALL-E, Stable Diffusion

Text-to-image models using Transformer components. Attention bridges language and vision.

Multimodal

Claude, Gemini, etc.

Modern LLMs are all Transformer-based. The architecture from this 2017 paper powers today's AI revolution.

LLMs

The Complete Mental Model

Here's the complete Transformer in one mental picture:

Input tokens ↓ [Embedding + Positional Encoding] ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ ENCODER (Γ—6) ──────────────┐ β”‚ Multi-Head Self-Attention β”‚ β”‚ ↓ (+ residual, + layer norm) β”‚ β”‚ Feed-Forward Network (512β†’2048β†’512) β”‚ β”‚ ↓ (+ residual, + layer norm) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ (encoder output: rich representations) ↓ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ DECODER (Γ—6) ──────────────┐ β”‚ Masked Multi-Head Self-Attention β”‚ β”‚ ↓ (+ residual, + layer norm) β”‚ β”‚ Multi-Head Cross-Attention (Q=decoder, β”‚ β”‚ K,V=encoder output) β”‚ β”‚ ↓ (+ residual, + layer norm) β”‚ β”‚ Feed-Forward Network β”‚ β”‚ ↓ (+ residual, + layer norm) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ [Linear β†’ Softmax β†’ Next Token Probabilities]