Within Self attention

How Attention Resolves Ambiguous Words

Attention weights help the same word lean toward different nearby or distant clues when its meaning depends on context.

On this page

  • Why the same token can mean different things
  • How query key scores become context weights
  • The bank example and other meaning shifts
Preview for How Attention Resolves Ambiguous Words

Introduction

When a word has multiple possible meanings, attention weights help a Transformer decide which meaning fits the current context. A word such as “bank” could refer to a financial institution or the side of a river. Rather than assigning one fixed meaning to the token, the model compares it with other tokens in the sentence and gives different amounts of attention to different contextual clues. The resulting attention weights act like a relevance map, indicating which surrounding words should influence the interpretation most strongly. This is one of the key ways self-attention turns a generic token into a context-sensitive representation. [arXiv+2ApX Machine Learning]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

Word Context illustration 1 In the broader mechanism of self-attention, every token can connect directly to nearby or distant words. For ambiguous words, the important question is not whether a connection exists, but which connections receive the greatest weight. Those weights help the model select the context that resolves uncertainty and supports the correct meaning. [arXiv]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

Why the Same Token Can Mean Different Things

Many words are polysemous: they carry several related or unrelated meanings. Humans resolve this naturally through context. Modern language models do something similar by creating contextual representations rather than relying on a single stored definition for each word. Research on contextualised embeddings, including BERT-based systems, shows that the same word can occupy different regions of the model’s representation space depending on surrounding context, allowing distinct meanings to emerge from usage rather than from a fixed dictionary entry. [arXiv+2Pair Code]arxiv.orgDoes BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized EmbeddingsSeptember 23, 2019…Published: September 23, 2019

Consider these sentences:

  • “She deposited money at the bank.”
  • “They sat on the bank of the river.”

The token “bank” begins as the same vocabulary item in both cases. What changes is the context available around it. Words such as “deposited” and “money” provide evidence for one meaning, while “river” provides evidence for another. Self-attention allows the model to measure the relevance of these clues directly. [IBM]ibm.comWhat is an attention mechanism?An attention mechanism is a machine learning technique that directs deep learning models, like transfor…

The result is that the representation of “bank” after attention is no longer a generic word embedding. It becomes a context-dependent representation that reflects the specific meaning needed in that sentence. [arXiv]arxiv.orgDoes BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized EmbeddingsSeptember 23, 2019…Published: September 23, 2019

How Query-Key Scores Become Context Weights

The mechanism begins when the ambiguous token generates a query vector. Every token in the sentence also provides a key vector and a value vector. The model compares the query with each key using a similarity calculation based on scaled dot products. Larger similarity scores indicate that a token may contain information useful for interpreting the current word. [arXiv+2Dive into Deep Learning]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

The scores are then passed through a softmax function. This converts the raw scores into attention weights that sum to one. Tokens with higher relevance receive larger weights, while less relevant tokens receive smaller ones. The model then forms a weighted combination of the value vectors, allowing highly relevant context to contribute more strongly to the final representation. [DEV Community+3arXiv+3Medium]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

A simplified interpretation is:

  1. The ambiguous word asks, “What information do I need?”
  2. Other words advertise, “Here is the information I contain.”
  3. Similarity scores estimate relevance.
  4. Attention weights determine influence.
  5. The weighted information is combined into a new context-aware representation.

Because every token can be considered simultaneously, the model is not restricted to immediate neighbours. A useful clue several words away can receive more weight than a nearby but irrelevant word. [arXiv+2PixelBank]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

Word Context illustration 2

The Bank Example and Other Meaning Shifts

The “bank” example illustrates how attention can favour different clues in different contexts.

In “She deposited money at the bank”, attention may assign substantial weight to words such as “deposited” and “money”. These tokens strongly signal the financial sense. Their value vectors contribute heavily to the updated representation of “bank”. [arXiv]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

In “They sat on the bank of the river”, attention can shift towards “river”. Even though the token is identical, the attention pattern changes because the query-key similarities change. The resulting representation moves towards the geographical meaning instead of the financial one. [ApX Machine Learning]apxml.comAp X Machine Learning Scaled Dot-Product Attention Mechanism Scaled Dot-Product Attention is the core calculation mechanism for self-atteIt computes how much each element in a sequence should attend to every …Read more

The same process helps with many other ambiguities:

  • “Bat” can mean an animal or sports equipment.
  • “Crane” can refer to a bird or a machine.
  • “Pitch” can describe a musical tone, a sports field, or a sales presentation.

In each case, attention weights help identify which surrounding words contain the strongest evidence for the intended interpretation. [arXiv]arxiv.orgDoes BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized EmbeddingsSeptember 23, 2019…Published: September 23, 2019

Why Attention Is Helpful but Not the Whole Story

A common simplification is that attention alone performs word-sense disambiguation. In practice, the situation is more subtle. Research examining attention during machine translation found that attention does not always place most of its weight directly on the context words that humans might expect. Instead, important contextual information can already be encoded inside the hidden representations produced by earlier layers. [arXiv]arxiv.orgAn Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine TranslationOctober 17, 2018…Published: October 17, 2018

Studies of BERT and related models also show that some attention heads appear to specialise in linguistic relationships such as syntax, reference tracking, or structural dependencies. Different heads can focus on different aspects of the sentence, providing multiple perspectives that together support interpretation. [arXiv]arxiv.orgarXiv What Does BERT Look At? An Analysis of BERT's AttentionWhat Does BERT Look At? An Analysis of BERT's AttentionJune 11, 2019…Published: June 11, 2019

This means that attention weights are best understood as part of a larger system. They help route information from relevant context, but the model’s ability to resolve ambiguity also depends on the contextual representations learned throughout training. [arXiv]arxiv.orgAn Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine TranslationOctober 17, 2018…Published: October 17, 2018

What Changes After Attention Has Done Its Work

The most important outcome is that the model no longer treats a word as having one fixed meaning. After attention combines information from relevant context, the token becomes a contextualised representation tailored to its current usage. Two occurrences of the same word can therefore end up with very different internal representations even though they share the same vocabulary entry. [arXiv+2Pair Code]arxiv.orgDoes BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized EmbeddingsSeptember 23, 2019…Published: September 23, 2019

From the perspective of understanding artificial intelligence, this is one of the central strengths of self-attention. Attention weights allow a model to identify which contextual clues matter most for an ambiguous word and to use those clues when constructing meaning. Instead of relying on rigid definitions, the model continually reshapes word representations according to context, making language understanding far more flexible and precise. [arXiv+2IBM]arxiv.orgAttention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a…

Word Context illustration 3

Amazon book picks

Further Reading

Books and field guides related to How Attention Resolves Ambiguous Words. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/html/1706.03762v7
    Source snippet

    Attention Is All You NeedWe call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries a...

  2. Source: ibm.com
    Link: https://www.ibm.com/think/topics/attention-mechanism
    Source snippet

    What is an attention mechanism?An attention mechanism is a [machine learning]({{ 'machine-learning/' | relative_url }}) technique that directs [deep learning]({{ 'deep-learning/' | relative_url }})...

  3. Source: pixelbank.dev
    Link: https://pixelbank.dev/concepts/attention
    Source snippet

    Attention Is All You Need — TransformersSelf-attention is the core operation of the Transformer. It allows every token in a sequ...

  4. Source: arxiv.org
    Link: https://arxiv.org/abs/1909.10430
    Source snippet

    Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized EmbeddingsSeptember 23, 2019...

    Published: September 23, 2019

  5. Source: medium.com
    Link: https://medium.com/%40saraswatp/understanding-scaled-dot-product-attention-in-transformer-models-5fe02b0f150c
    Source snippet

    ghts to compute a weighted sum of the values.Read more...

  6. Source: dev.to
    Title: decodingattention is all you need 2eog
    Link: https://dev.to/kaustubhyerkade/decodingattention-is-all-you-need-2eog
    Source snippet

    DEV CommunityDecoding the "Attention Is All You Need"......27 May 2025 — Scaled Dot-Product Attention For each token, attention scores ar...

    Published: May 2025

  7. Source: arxiv.org
    Link: https://arxiv.org/abs/1810.07595
    Source snippet

    An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine TranslationOctober 17, 2018...

    Published: October 17, 2018

  8. Source: arxiv.org
    Title: arXiv What Does BERT Look At? An Analysis of BERT’s Attention
    Link: https://arxiv.org/abs/1906.04341
    Source snippet

    What Does BERT Look At? An Analysis of BERT's AttentionJune 11, 2019...

    Published: June 11, 2019

  9. Source: arxiv.org
    Title: arXiv Do Attention Heads in BERT Track Syntactic Dependencies?
    Link: https://arxiv.org/abs/1911.12246

  10. Source: jvgd.medium.com
    Title: scaled dot product attention explained 3fb250f1acd1
    Link: https://jvgd.medium.com/scaled-dot-product-attention-explained-3fb250f1acd1
    Source snippet

    This key idea of this attention mechanism is to...Read more...

  11. Source: medium.com
    Title: The paper explicitly motivates this scaling. Step 3: softmax.Read more
    Link: https://medium.com/%40adnanmasood/attention-is-all-you-need-explained-like-youre-smart-and-busy-2a3d7436144f
    Source snippet

    Attention Is All You Need, explained like you're smart and...Because dot products grow with dimension; scaling keeps softmax gradients h...

  12. Source: kumarshivam-66534.medium.com
    Link: https://kumarshivam-66534.medium.com/mastering-attention-mechanisms-how-transformers-understand-context-in-[generative-ai
    Source snippet

    Attention Mechanisms: How Transformers...Words are converted into numerical vectors that capture semantic meaning. These vectors can com...

  13. Source: medium.com
    Link: https://medium.com/%40kavierim/transformers-from-scratch-part-2-scaled-dot-product-attention-6c0634ce79af
    Source snippet

    uld attend to every other element.Read more...

  14. Source: medium.com
    Link: https://medium.com/synapse-dev/understanding-bert-transformer-attention-isnt-all-you-need-5839ebd396db
    Source snippet

    Understanding BERT Transformer: Attention isn't all you needBERT uses 12 separate attention mechanism for each layer. In other cases, a g...

  15. Source: arxiv.org
    Link: https://arxiv.org/html/2312.00680v1
    Source snippet

    Contextualized Word Senses: From Attention to...The neural architectures of language models are becoming increasingly complex, especiall...

  16. Source: apxml.com
    Link: https://apxml.com/courses/introduction-to-transformer-models/chapter-2-self-attention-[multi-head
    Source snippet

    It computes how much each element in a sequence should attend to every...Read more...

  17. Source: pair-code.github.io
    Link: https://pair-code.github.io/interpretability/context-atlas/blogpost/
    Source snippet

    Language, Context, and Geometry in Neural NetworksThe crisp clusters seen in the visualizations above suggest that BERT may create simple...

  18. Source: d2l.ai
    Title: Dive into Deep Learning11.3
    Link: https://d2l.ai/chapter_attention-mechanisms-and-transformers/attention-scoring-functions.html
    Source snippet

    11.3. Attention Scoring FunctionsTo ensure that the variance of the dot product still remains 1 regardless of vector length, we use the s...

  19. Source: jaketae.github.io
    Link: https://jaketae.github.io/study/transformer/
    Source snippet

    Attention is All You Need20 Jan 2021 — The short answer is that multi-head self-attention is nothing but a parallel repetition of self-at...

Additional References

  1. Source: researchgate.net
    Link: https://www.researchgate.net/publication/334117983_An_Analysis_of_Attention_Mechanisms_The_Case_of_Word_Sense_Disambiguation_in_Neural_Machine_Translation
    Source snippet

    (PDF) An Analysis of Attention Mechanisms: The Case of...1 Nov 2018 — Existing research on the interpretability of Transformers in machi...

  2. Source: youtu.be
    Link: https://youtu.be/1il-s4mgNdI?si=XaVxj6bsdy3VkgEX
    Source snippet

    If you're interested in the herculean task of interpreting what these large networks might actually be doing, the Transformer Circuits po...

  3. Source: youtube.com
    Link: https://www.youtube.com/watch?v=0PjHri8tc1c
    Source snippet

    L19.4.2 Self-Attention and Scaled Dot-Product AttentionWe are now introducing three trainable weight matrices that are multiplied with th...

  4. Source: youtube.com
    Link: https://www.youtube.com/watch?v=eMlx5fFNoYc&vl=en
    Source snippet

    Attention in transformers, step-by-step | Deep Learning...The attention block allows the model to move information encoded in one embedd...

  5. Source: youtube.com
    Link: https://www.youtube.com/watch?v=mlk0rddP3L4&list=PLuhqtP7jdD8CftMk831qdE8BlIteSaNzD&t=0s
    Source snippet

    "✔ Complete Logistic Regression Playlist: [https://www.youtube.com/watch?v=U1omz0B9FTw&list=PLuhqtP7jdD8Chy7QIo5U0zzKP8-emLdny&t=0s..."](https://www.youtube.com/watch?v=U1omz0B9FTw&list=PLuhqtP7jdD8Chy7QIo5U0zzKP8-emLdny&t=0s...")...

  6. Source: youtube.com
    Link: https://www.youtube.com/watch?v=U1omz0B9FTw&list=PLuhqtP7jdD8Chy7QIo5U0zzKP8-emLdny&t=0s
    Source snippet

    "✔ Complete Linear Regression Playlist: [https://www.youtube.com/watch?v=nwD5U2WxTdk&list=PLuhqtP7jdD8AFocJuxC6_Zz0HepAWL9cF&t=0s..."](https://www.youtube.com/watch?v=nwD5U2WxTdk&list=PLuhqtP7jdD8AFocJuxC6_Zz0HepAWL9cF&t=0s...")...

  7. Source: aryanupadhyay.com
    Title: scaled dot product attention explained why we divide by dₖ in transformers
    Link: https://www.aryanupadhyay.com/post/scaled-dot-product-attention-explained-why-we-divide-by-d%E2%82%96-in-transformers
    Source snippet

    In the original research paper Attention Is All You Need, the core idea remains the same, but there is...Read more...

  8. Source: reddit.com
    Link: https://www.reddit.com/r/MachineLearning/comments/qidpqx/d_how_to_truly_understand_attention_mechanism_in/
    Source snippet

    However it is not that easy to fully understand, and in my opinion, somewhat unintuitive...

  9. Source: youtube.com
    Link: https://www.youtube.com/watch?v=E5Z7FQp7AQQ&list=PLuhqtP7jdD8CD6rOWy20INGM44kULvrHu&t=0s
    Source snippet

    "✔ Complete Neural Network: [https://www.youtube.com/watch?v=mlk0rddP3L4&list=PLuhqtP7jdD8CftMk831qdE8BlIteSaNzD&t=0s..."](https://www.youtube.com/watch?v=mlk0rddP3L4&list=PLuhqtP7jdD8CftMk831qdE8BlIteSaNzD&t=0s...")...

  10. Source: stats.stackexchange.com
    Title: I’ve tried searching online, but all the resources
    Link: https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms
    Source snippet

    exactly are keys, queries, and values in attention...13 Aug 2019 — How should one understand the keys, queries, and values that are ofte...

Topic Tree

Follow this branch

Parent topic

Self attention How does attention find the right context?

Related pages 2