2025-08-30 01:45:00
vinithavn.medium.com
What is attention?
In any autoregressive model, the prediction of the future tokens is based on some preceding context. However, not all the tokens within this context equally contribute to the prediction, because some tokens might be more relevant than others. The attention mechanism addresses this by allowing the model to concentrate on the important context words selectively, while generating each output word or token. Consider the popular example that explains the attention mechanism.
“The animal didn’t cross the street because it was too tired”.
In this sentence, the pronoun “it” could refer to either “animal” or “street”. Attention helps the model to associate “it” with “animal” rather than “street” by weighing the relative importance of each word. This helps the model to understand the relationships between words and capture the contextual meaning in various NLP tasks.
How is attention calculated?
There are various types of attention mechanisms today, beginning with the Multi-Head Attention (MHA), which introduced the attention concept in the seminal paper. More recently, advanced variants like Multi-Latent Head Attention (MHLA) have been employed in popular models like Deepseek. This blog aims to cover the fundamentals of each attention mechanism, including the core ideas, advantages, limitations, etc.
Key Concepts in Attention Mechanisms
Before diving into specific types of attention, we need to understand some fundamental concepts that underpin all the various attention mechanisms.
The main idea behind the attention mechanism is to dynamically weigh, and focus on relevant parts of inputs. Attention is required in both the encoding and decoding stages. But in this blog, we will be discussing this from a decoder’s point of view.
During each generation step, we need to understand the attention weights, which help us to get a better contextual representation for the next word prediction. At its core, attention operates through three fundamental components — queries, keys, and values — that work together with attention scores to create a flexible, context-aware vector representation.
- Query (Q): The query is a vector that represents the current token for which the model wants to compute attention.
- Key (K): Keys are vectors that represent the elements in the context against which the query is compared, to determine the relevance.
- Attention Scores: These are computed using Query and Key vectors to determine the amount of attention to be paid to each context token.
- Value (V): Values are the vectors that represent the actual contextual information. After calculating the attention scores using Query and Key vectors, these scores are applied against Value vectors to get the final context vector
- KV Caching: Since the key and value vectors are for previous tokens, we can skip this computation for those tokens that are already calculated. KV caching stores the precomputed keys and values from the previous computations, which helps in faster decoding in autoregressive models by reusing the cached vectors. However, the Query vectors cannot be cached, since they are calculated for the current token.
To understand how each of these vectors are scores are calculated you can refer to this blog.
The high-level concepts remain consistent across all types of attention mechanisms. However, the key difference lies in how efficiently each of them executes the attention process without compromising on performance. Innovations focus on computational speed, reducing memory usage, improving scalability across longer sequences, etc.
Now, let’s dive into each of these techniques
Multi-Head Attention (MHA)
In multi-head attention, for computing the attention weights for the ith token, first, a query vector is calculated for that token. To calculate the attention weights for the token, this query vector is compared with all the preceding tokens. For that, key vectors are calculated for all the preceding tokens. These comparisons will generate an attention score, which is then used to produce a weighted score for each token using the corresponding value vectors.
In multi-head attention, this process is repeated in parallel across multiple attention “heads”. Each head has its own query, value, and key vectors, using which it calculates the relationship between the words. The final output context vector will be the concatenated output from all the attention heads.
Now, this seems straightforward. However, as the context grows, the number of Key and Value vectors will increase dramatically, because these vectors need to be calculated and stored for all the context tokens. For a sequence length of n, each query vector must be compared against all n key vectors and then perform the weighted combination using n value vectors. This results in a quadratic complexity in both computation and memory.
KV cache can help in reducing the computation and memory overhead during inference. But as the context grows, the size of the cache grows linearly with sequence length to store all the keys and values for all the preceding tokens. KV cache reduces the redundant computations, but will not reduce the fundamental cost of attending to all the previous tokens.
Models using MHA – Bert, RoBerta, T5, etc.
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.