From Bottleneck to Retrieval
The viewer will understand why attention was introduced and how text becomes vector inputs that attention can work with.
Attention Mechanisms, Deeper: attention was introduced to let models focus on the most relevant tokens, while text becomes vector inputs attention can work with. By the end, you'll know: why attention matters, how text becomes vectors, and how scores shape focus. Start with the bottleneck. If you compress an entire prefix into one fixed-size state, later prediction has to rely on whatever survived that compression. Attention changes that. For the current token, the model can look back and select the prior representations that matter most. So what is the first thing to predict here? Not a label, but the effect of selective access. When the model needs a pronoun, a topic, or a local dependency, it does not reread everything equally. It computes a learned retrieval over the visible sequence and pulls only the relevant parts forward. That is the core shift. The model is no longer forced to store all useful history in one bottlenecked vector. It can distribute information across tokens, then recover it on demand. The representation at each position becomes a place where context is gathered, not just compressed. And that matters because the next prediction is often decided by a small subset of earlier tokens. Attention gives the model a way to identify those components, route information from them, and ignore the rest. So the question is not whether history exists. It is which parts of history should be active right now. Before attention can do anything, the text has to become vectors. You start with token IDs, which are discrete indices from the vocabulary. Those IDs are not yet meaning-bearing in a geometric sense; they are just addresses. Then an embedding lookup maps each ID to a continuous vector. Now the model has numbers it can compare, project, and combine. Attention works on that vector space, so this step is what turns raw text into something the mechanism can operate on. At this point there is still no explicit syntax engine and no built-in reference resolution. The model has positions and embeddings, and that is enough to begin learning relationships from data. So if you were predicting what attention sees first, the answer is not words. It is vectors arranged by token position.
How Attention Computes
The viewer will understand the core attention mechanism: queries, keys, values, and the scaled dot-product that turns similarity into weighted retrieval.
Now we move into the mechanism itself. Self-attention begins by splitting each token representation into three learned views. One projection becomes the query, one becomes the key, and one becomes the value. The query is what this position is looking for. The key is what each visible position offers for matching. The value is the content that will actually be retrieved if the match is strong enough. Keep those roles separate, because that separation is the whole point. If you ask what to predict next, the query answers that by expressing the current need. If you ask which earlier token is relevant, the keys answer that by exposing matchable features. And if you ask what gets carried forward, the values answer that by holding the information that will be mixed into the output. This is not a single comparison between whole tokens. It is a learned interaction between three projections of every token. That lets the model decide, position by position, whether a relationship is useful for the current update or should be ignored. So the practical question is: can you identify the three parts in a running attention step? You should be able to point to the query as the request, the key as the match surface, and the value as the payload. That is the selective routing structure attention uses. Once queries and keys exist, the computation becomes concrete. You take a query vector and compare it with each visible key vector using a dot product. Larger similarity means stronger evidence that this position should matter for the current update. But raw dot products can grow too large, especially in higher dimensions. So the score is scaled before normalization. That keeps the numbers in a range where training stays stable and the comparisons remain usable across layers and sequence lengths. Then the scores are normalized into a distribution. Now the model has weights that sum to one over the visible tokens, and those weights become a differentiable routing rule. The output is not a hard choice; it is a soft allocation of influence across the sequence. So if you were asked to predict the effect of a stronger query-key match, the answer is a larger weight in the final mix. If the match is weak, that position contributes little. The mechanism is simple in structure: compare, scale, normalize, then route information by weight. And because the whole path is differentiable, training can tune the projections that produce those scores. The model is not hand-coded to attend to a specific token. It learns which similarities should count, and by how much, from the prediction objective itself. Take one sentence and follow one token through a single attention step. Suppose the current token is a pronoun that needs its antecedent. The query from that position scores the earlier tokens, and the highest weight lands on the noun it refers to. Now the update happens. The model does not copy the noun token itself. It forms a weighted sum of the value vectors from the visible positions. The output at the pronoun position becomes a contextual vector that carries the retrieved information forward. That is the key result to notice: the token representation changes because it has looked back. The same surface token can produce different vectors in different sentences, because the values mixed into it are different each time. Context is not stored in the token alone. It is assembled at the moment of attention.