Attention Heatmap

Hover a row to see where that token attends

Attention Viz

Sentence

6 tokens

Heads

Step

About

Attention lets each token “look at” every other token. Each token produces a Query (what I’m looking for), a Key (what I offer), and a Value (what I contribute).

Raw scores (Q·K¹) are divided by √dᵏ to keep gradients stable, then softmax turns them into weights that sum to 1 per row. Multi-head attention runs several heads in parallel, each learning different relationships.

What Is the Attention Mechanism?

The attention mechanism is the core innovation behind transformers and modern large language models like GPT and BERT. At its heart, attention solves a fundamental problem in sequence processing: how does a model decide which parts of the input are relevant to each other? Instead of processing tokens one at a time from left to right, as recurrent neural networks (RNNs) do, attention allows every element in a sequence to look at every other element simultaneously and determine what information is most relevant.

This parallel computation is what makes transformers so powerful and efficient. A word at the end of a sentence can directly attend to a word at the beginning without information having to pass through dozens of intermediate steps. This ability to capture long-range dependencies in a single operation revolutionized natural language processing and has since been adopted in computer vision, speech recognition, protein folding, and many other domains.

Queries, Keys, and Values

The attention mechanism operates through three learned linear projections that transform each input token into three distinct vectors: a Query (Q), a Key (K), and a Value (V). The intuition is straightforward. The Query represents "what am I looking for?" -- it encodes what information the current token needs from the rest of the sequence. The Key represents "what do I contain?" -- it advertises what kind of information each token offers. The Value represents "what information do I carry?" -- it holds the actual content that gets passed along when a token is attended to.

These three projections are learned during training, which means the model discovers for itself what constitutes a useful query, what makes a good key, and what value information is worth propagating. This learned decomposition is one of the reasons attention is so flexible: the same mechanism can learn to track syntactic relationships, semantic similarity, positional patterns, or entirely abstract features depending on the task.

How Attention Scores Are Computed

Step 1: Dot Products

The raw attention score between token i and token j is computed as the dot product of token i's Query vector and token j's Key vector: Q_i · K_j. A high score means the query and key are well-aligned in the learned embedding space, indicating that token i should "attend" strongly to token j. A low or negative score means the two tokens are less relevant to each other. This produces a full matrix of scores where every token is compared against every other token.

Step 2: Scaling

The raw dot-product scores are divided by the square root of the key dimension, √d_k. This scaled dot-product attention is essential for stable training. Without scaling, when the dimensionality of the key vectors is large, the dot products tend to grow in magnitude, pushing the subsequent softmax function into regions where its gradients are extremely small. This "saturated softmax" problem would make learning painfully slow or stall entirely. The scaling factor keeps the variance of the scores roughly constant regardless of dimension size.

Step 3: Softmax Normalization

After scaling, the softmax function is applied to each row of the score matrix. Softmax converts each row of raw scores into a probability distribution that sums to 1. This means each token distributes its total attention budget across all other tokens in the sequence. Tokens with high scores receive more attention weight, while tokens with low scores receive negligible weight. The resulting attention weights are then used to compute a weighted sum of the Value vectors, producing the final output for each position.

Multi-Head Attention

Rather than computing a single attention function, transformers run multiple attention heads in parallel, each with its own learned Q, K, and V projection matrices. This is called multi-head attention, and it is one of the key design choices that makes transformers so effective. Each head can learn to focus on different types of relationships: one head might capture syntactic dependencies like subject-verb agreement, another might learn semantic similarity between words, and yet another might track positional or structural patterns.

The outputs of all heads are concatenated and passed through a final linear projection to produce the layer's output. This design gives the model multiple "representation subspaces" to work with simultaneously, dramatically increasing its expressive power without a proportional increase in computational cost. Research has shown that different heads in trained models do indeed specialize, and analyzing which head attends to what has become an important tool for interpretability in modern AI research.

Why Attention Transformed NLP

Before the transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need," sequence models relied on recurrent or convolutional operations that processed tokens sequentially. RNNs and LSTMs struggled with long-range dependencies because information had to pass through many time steps, suffering from vanishing or exploding gradients along the way. Attention removes this bottleneck entirely: every position in the sequence is directly accessible to every other position in a single computational step.

This architectural change unlocked massive parallelism during training (since attention computations do not depend on sequential order), enabled models to scale to billions of parameters, and produced the foundation for breakthrough models including BERT, GPT, T5, and their successors. Today, attention-based transformers are the dominant architecture not only in NLP but increasingly across all of machine learning, powering everything from chatbots and code generators to image synthesis and scientific discovery.

Explore the heatmap above to see how different tokens attend to each other. Change the sentence, adjust the number of heads, and step through the computation stages to build intuition for how the attention mechanism works in practice.

Attention Mechanism Visualizer