Input Embedding
Convert each input token (word/subword) into a 512-dimensional vector. Multiply by β512 β 22.6 to scale up.
The 2017 paper that launched the Transformer architecture and changed AI forever. Here's every concept, explained so clearly you'll never forget it.
Before 2017, if you wanted a neural network to handle sequences (text, speech, time-series), you used Recurrent Neural Networks (RNNs), specifically LSTMs or GRUs. They had a fatal flaw.
Imagine translating a book by passing it through workers on an assembly line. Worker 1 reads word 1, writes a note, and passes it to Worker 2. Worker 2 reads word 2 and the note, writes a new note, passes it on. By Worker 500, the note about word 1 is mostly forgotten. That's an RNN.
Now imagine instead that every worker can see every word of the book simultaneously, and each worker specializes in a different type of pattern. That's a Transformer.
RNNs process one token at a time. You MUST finish step t before starting step t+1. You can't parallelize this. Training on long sequences takes forever.
Information from early tokens must survive through every hidden state. Over long sequences, early information degrades β vanishing gradient problem.
The paper's breakthrough: you don't need recurrence at all. Replace it entirely with attention β a mechanism that lets every token directly look at every other token. This is parallel, fast, and doesn't forget.
The Transformer follows the classic encoder-decoder pattern, but replaces all recurrence with self-attention and feed-forward layers.
Encoder = A team of analysts who read the entire source document and create a rich, interconnected understanding of it. Each analyst specializes in different aspects (syntax, meaning, context).
Decoder = A translator who writes the output word by word, but at each word can consult all the analysts AND review what they've written so far.
Convert each input token (word/subword) into a 512-dimensional vector. Multiply by β512 β 22.6 to scale up.
Since there's no recurrence, inject position info using sine/cosine waves. "The" at position 0 gets a different signal than "the" at position 50.
Each layer: Multi-Head Self-Attention β Add & Normalize β Feed-Forward Network β Add & Normalize. Every token attends to every other token.
Each layer: Masked Self-Attention β Add & Norm β Cross-Attention (attend to encoder output) β Add & Norm β FFN β Add & Norm.
Project decoder output to vocabulary size, apply softmax to get probability of each possible next token.
| Parameter | Value | What It Means |
|---|---|---|
| dmodel | 512 | Dimension of every vector flowing through the model |
| N (layers) | 6 | Both encoder and decoder have 6 stacked layers |
| h (heads) | 8 | 8 parallel attention heads per layer |
| dk = dv | 64 | Each head works with 64-dim keys/values (512Γ·8) |
| dff | 2048 | Inner dimension of the feed-forward network |
| Total params | 65M | Base model. "Big" model: 213M parameters |
Every sub-layer uses a residual connection: the sub-layer's output is added back to its input, then layer-normalized. This is critical β it lets gradients flow straight through during backpropagation and stabilizes training.
Think of residual connections like a highway bypass. The sub-layer can learn "what to add" to the existing representation rather than having to rebuild it from scratch. If the sub-layer learns nothing useful, the original signal passes through unharmed.
This is the heart of the paper. Understand this, and you understand Transformers.
Imagine a library with a card catalog:
Query (Q) = Your search question: "I need information about cats"
Key (K) = The label on each book's catalog card: "Feline Biology", "Dog Training", "Cat Behavior"
Value (V) = The actual content of each book
You compare your Query to each Key (how relevant is this book?), get a relevance score, then read a weighted mix of the Values (books), paying most attention to the most relevant ones.
For each query, compute the dot product with every key. This gives a raw "similarity score." If Q and K point in similar directions β high score β these tokens are relevant to each other.
Divide by β64 = 8. Why? Without scaling, when dk is large, dot products become huge, pushing softmax into regions with near-zero gradients. Scaling keeps values in a well-behaved range.
Convert scores to probabilities (0 to 1, summing to 1). High scores β high attention weight. This is the "attention distribution" β how much each position pays attention to every other.
Weighted sum of value vectors. Positions with high attention weight contribute more to the output. The result: a context-aware representation of each token.
Click a word to see which other words it would attend to. Brightness = attention weight.
Suppose dk = 64. If q and k components are random with mean 0, variance 1:
The dot product qΒ·k = Ξ£qα΅’kα΅’ has variance = dk = 64.
So typical dot products range from roughly -16 to +16 (a few standard deviations).
softmax([16, 0, -16]) β [1.0, 0.0, 0.0] β completely saturated! Gradients vanish.
After scaling: divide by 8 β values range β -2 to +2.
softmax([2, 0, -2]) β [0.67, 0.09, 0.24] β smooth gradients, learnable.
Let's trace attention for a tiny example with dk=4 and 3 tokens: "I love cats"
One attention "head" can only focus on one type of relationship at a time. The solution: run multiple attention heads in parallel, each learning a different pattern.
Imagine reading the sentence "The animal didn't cross the street because it was too tired."
Head 1 (Grammar Expert): "it" β looks at "animal" (pronoun resolution)
Head 2 (Semantics Expert): "tired" β looks at "animal" (who is tired?)
Head 3 (Syntax Expert): "cross" β looks at "street" (verb-object)
Head 4 (Negation Expert): "didn't" β looks at "cross" (what's negated?)
Each head extracts a different type of relationship. Combined, they capture a rich understanding.
By splitting into 8 heads of 64 dimensions each (instead of 1 head of 512 dimensions), you get 8 different learned attention patterns at roughly the same computational cost as a single head. It's like getting 8 experts for the price of 1.
Every input token attends to every other input token. Q, K, V all come from the encoder's previous layer. Learns relationships within the source sentence.
Each output token attends to previous output tokens only. Future tokens are masked (set to -β before softmax). Preserves autoregressive property.
Queries come from decoder, Keys and Values from encoder output. This is how the decoder "reads" the source β the bridge between input and output.
Toggle between different attention heads to see how each learns different patterns.
Without recurrence, the Transformer has no notion of order. "The cat sat on the mat" and "mat the on sat cat The" would look identical! Positional encodings fix this.
Imagine you're at a concert with assigned seating. The positional encoding is like your seat number β it tells the model where each word "sits" in the sequence. But instead of a single number, it's a unique pattern of sine waves, like a musical chord unique to each position.
What this means: For each position in the sequence, you create a 512-dimensional vector. Even dimensions (0, 2, 4...) use sine; odd dimensions (1, 3, 5...) use cosine. Each dimension has a different frequency, ranging from very fast oscillations to very slow ones.
Why sine/cosine? Because for any fixed offset k, PEpos+k can be expressed as a linear function of PEpos. This means the model can easily learn to attend to "3 positions back" or "5 positions ahead" β relative position is baked into the math.
Each row is a position (0β49). Each column is a dimension. Color shows the PE value. Notice: low dimensions oscillate fast, high dimensions oscillate slowly.
After attention, each token's representation passes through a simple 2-layer neural network, applied independently to each position:
Attention is like a meeting where everyone shares information. The FFN is like each person going back to their desk and thinking about what they heard β processing it independently. The input is 512β2048β512 dimensions (expand, process, compress).
EN-DE: 4.5M sentence pairs, ~37K BPE tokens. EN-FR: 36M sentences, 32K word-piece vocabulary.
8 NVIDIA P100 GPUs. Base model: 12 hours (100K steps). Big model: 3.5 days (300K steps).
Adam with Ξ²β=0.9, Ξ²β=0.98. Custom learning rate schedule with 4000 warmup steps.
Dropout (P=0.1) on sub-layers and embeddings. Label smoothing (Ξ΅=0.1) β hurts perplexity but helps BLEU.
One of the paper's subtle innovations β a "warmup then decay" schedule:
Instead of training the model to output probability 1.0 for the correct token and 0.0 for all others, label smoothing distributes a small amount (Ξ΅=0.1) of probability mass across all tokens.
The Transformer didn't just beat previous models β it did so at a fraction of the training cost. The big model achieved 28.4 BLEU on EN-DE (2+ BLEU above the previous best ensemble) while using only 2.3Γ10ΒΉβΉ FLOPs. The previous best ensemble used 1.8Γ10Β²β° FLOPs β nearly 8Γ more compute.
1 head: 24.9 BLEU. 8 heads: 25.8 (best). 32 heads: 25.4 (too many hurts). Sweet spot is 8.
Reducing dk hurts quality. The compatibility function needs enough capacity to determine relevance.
Bigger is better: dmodel=1024 beats 512. dff=4096 beats 2048. More params = more capacity.
Critical for preventing overfitting. Without dropout: 24.6 BLEU. With P=0.1: 25.8 BLEU.
To prove generality, they applied the Transformer to parsing β converting sentences into tree structures. With just a 4-layer model and minimal tuning, it achieved 92.7 F1, competitive with specialized parsers trained specifically for this task. This showed the architecture wasn't just a translation trick.
The paper compares self-attention to recurrent and convolutional layers on three criteria:
| Layer Type | Complexity/Layer | Sequential Ops | Max Path Length |
|---|---|---|---|
| Self-Attention | O(nΒ² Β· d) | O(1) | O(1) β Best |
| Recurrent (RNN) | O(n Β· dΒ²) | O(n) β Worst | O(n) |
| Convolutional | O(k Β· n Β· dΒ²) | O(1) | O(logk(n)) |
Any token can directly attend to any other token in a single layer. No need to pass through intermediate steps. "The" at position 1 can directly connect to "mat" at position 100.
All attention computations happen in parallel (matrix multiplications). No waiting. RNNs need O(n) sequential steps β for a 1000-token sequence, that's 1000Γ slower.
Attention weights are visible and meaningful. You can literally see what the model is "looking at," as shown in the paper's attention visualizations (Figures 3-5).
Self-attention complexity is O(nΒ² Β· d) β quadratic in sequence length. For very long sequences (n > d), this is actually more expensive than RNNs O(n Β· dΒ²). This is why later work (Longformer, Linformer, Flash Attention) focused on making attention sub-quadratic. But for typical sentence lengths in translation (n < 100), this isn't an issue.
The Transformer architecture didn't just improve machine translation. It became the foundation for virtually all of modern AI.
Encoder-only Transformer. Pre-train on masked language modeling, fine-tune for any NLP task. Revolutionized NLP.
Encoder OnlyDecoder-only Transformer. Autoregressive language models that became ChatGPT, GPT-4, and the LLM revolution.
Decoder OnlyApplied attention to image patches. Showed Transformers work for vision too, not just text.
VisionUsed attention for protein structure prediction. Won CASP14 by a landslide. Transformers for biology.
BiologyText-to-image models using Transformer components. Attention bridges language and vision.
MultimodalModern LLMs are all Transformer-based. The architecture from this 2017 paper powers today's AI revolution.
LLMsHere's the complete Transformer in one mental picture: