Attention Mechanisms, Deeper

Attention mechanisms deeper. Attention was introduced

00:00/02:36

1 / 3

Quiz

Color

Back to lessons

From Bottleneck to Retrieval

by Certisured

Certisured is an Edtech delivering high impact career transition courses and placements on advanced frontier technologies like AI, Data Science & Engineering

www.certisured.com

The viewer will understand why attention was introduced and how text becomes vector inputs that attention can work with.

Loading comments…

Attention Mechanisms, Deeper

4 episodes

Attention Mechanisms, Deeper — full transcript

From Bottleneck to Retrieval

The viewer will understand why attention was introduced and how text becomes vector inputs that attention can work with.

Attention Mechanisms, Deeper: attention was introduced to let models focus on the most relevant tokens, while text becomes vector inputs attention can work with. By the end, you'll know: why attention matters, how text becomes vectors, and how scores shape focus. Start with the bottleneck. If you compress an entire prefix into one fixed-size state, later prediction has to rely on whatever survived that compression. Attention changes that. For the current token, the model can look back and select the prior representations that matter most. So what is the first thing to predict here? Not a label, but the effect of selective access. When the model needs a pronoun, a topic, or a local dependency, it does not reread everything equally. It computes a learned retrieval over the visible sequence and pulls only the relevant parts forward. That is the core shift. The model is no longer forced to store all useful history in one bottlenecked vector. It can distribute information across tokens, then recover it on demand. The representation at each position becomes a place where context is gathered, not just compressed. And that matters because the next prediction is often decided by a small subset of earlier tokens. Attention gives the model a way to identify those components, route information from them, and ignore the rest. So the question is not whether history exists. It is which parts of history should be active right now. Before attention can do anything, the text has to become vectors. You start with token IDs, which are discrete indices from the vocabulary. Those IDs are not yet meaning-bearing in a geometric sense; they are just addresses. Then an embedding lookup maps each ID to a continuous vector. Now the model has numbers it can compare, project, and combine. Attention works on that vector space, so this step is what turns raw text into something the mechanism can operate on. At this point there is still no explicit syntax engine and no built-in reference resolution. The model has positions and embeddings, and that is enough to begin learning relationships from data. So if you were predicting what attention sees first, the answer is not words. It is vectors arranged by token position.

How Attention Computes

The viewer will understand the core attention mechanism: queries, keys, values, and the scaled dot-product that turns similarity into weighted retrieval.

Now we move into the mechanism itself. Self-attention begins by splitting each token representation into three learned views. One projection becomes the query, one becomes the key, and one becomes the value. The query is what this position is looking for. The key is what each visible position offers for matching. The value is the content that will actually be retrieved if the match is strong enough. Keep those roles separate, because that separation is the whole point. If you ask what to predict next, the query answers that by expressing the current need. If you ask which earlier token is relevant, the keys answer that by exposing matchable features. And if you ask what gets carried forward, the values answer that by holding the information that will be mixed into the output. This is not a single comparison between whole tokens. It is a learned interaction between three projections of every token. That lets the model decide, position by position, whether a relationship is useful for the current update or should be ignored. So the practical question is: can you identify the three parts in a running attention step? You should be able to point to the query as the request, the key as the match surface, and the value as the payload. That is the selective routing structure attention uses. Once queries and keys exist, the computation becomes concrete. You take a query vector and compare it with each visible key vector using a dot product. Larger similarity means stronger evidence that this position should matter for the current update. But raw dot products can grow too large, especially in higher dimensions. So the score is scaled before normalization. That keeps the numbers in a range where training stays stable and the comparisons remain usable across layers and sequence lengths. Then the scores are normalized into a distribution. Now the model has weights that sum to one over the visible tokens, and those weights become a differentiable routing rule. The output is not a hard choice; it is a soft allocation of influence across the sequence. So if you were asked to predict the effect of a stronger query-key match, the answer is a larger weight in the final mix. If the match is weak, that position contributes little. The mechanism is simple in structure: compare, scale, normalize, then route information by weight. And because the whole path is differentiable, training can tune the projections that produce those scores. The model is not hand-coded to attend to a specific token. It learns which similarities should count, and by how much, from the prediction objective itself. Take one sentence and follow one token through a single attention step. Suppose the current token is a pronoun that needs its antecedent. The query from that position scores the earlier tokens, and the highest weight lands on the noun it refers to. Now the update happens. The model does not copy the noun token itself. It forms a weighted sum of the value vectors from the visible positions. The output at the pronoun position becomes a contextual vector that carries the retrieved information forward. That is the key result to notice: the token representation changes because it has looked back. The same surface token can produce different vectors in different sentences, because the values mixed into it are different each time. Context is not stored in the token alone. It is assembled at the moment of attention.

Attention Inside Transformers

The viewer will understand how self-attention is used in transformer layers, why masking matters in decoders, and how multiple heads and depth expand representational power.

Inside a Transformer, self-attention is not a standalone trick. It sits inside a block with feed-forward sublayers, residual connections, and normalization. The attention step gathers context; the rest of the block transforms and stabilizes that representation for deeper processing. In decoder-only models, there is one more constraint: causal masking. A position can attend only to earlier tokens, never to future ones. That preserves autoregressive prediction, because the model must build the next token from what is already available, not from information it has not seen yet. So if you apply this to a new situation, ask which positions are visible before the mask is applied. The answer determines the entire routing pattern. Attention is powerful, but the architecture decides what information is legally in scope. Now the single-head picture expands. With multi-head attention, the model runs several attention operations in parallel, each with its own learned projections. One head may track local reference, another may track long-range dependency, and another may focus on structural cues. The important point is that the heads do not compete for a single interpretation. They each produce their own weighted retrieval, and those outputs are combined. That gives the model multiple views of the same token sequence at the same time. Then depth matters. A later layer receives contextual vectors already shaped by earlier attention steps. It can attend again, now over richer representations, and compose simpler relations into more abstract ones. That is how the system moves from token-to-token matching to layered structure. So what should you predict when the model has more heads and more layers? Not just more attention, but more kinds of attention and more stages of refinement. Parallel heads broaden the search. Depth composes the results into higher-order features. If you identify the components here, you have the full stack: projections per head, per-head routing, concatenation or combination, then repeated layers with normalization and residual paths. The architecture is not one attention event. It is a sequence of attention events building on each other.

Attention Mechanisms, Deeper

From Bottleneck to Retrieval

Attention Mechanisms, Deeper

From Bottleneck to Retrieval

How Attention Computes

Attention Inside Transformers

What Learning Tunes

Attention Mechanisms, Deeper

From Bottleneck to Retrieval

How Attention Computes

Attention Inside Transformers

What Learning Tunes

Attention Inside Transformers

What Learning Tunes