Within Transformer shift

Why Transformers remember context differently

Transformers made next-word guesses stronger by letting tokens draw on relevant earlier context instead of relying on one compressed memory.

On this page

  • The limits of a single recurrent state
  • How attention creates shorter information paths
  • Why direct context helps next token guesses
Preview for Why Transformers remember context differently

Introduction

Language models always make predictions from context. The crucial question is how that context is represented. Before Transformers, recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks carried information forward through a single evolving hidden state: a compressed summary of everything seen so far. The Transformer changed next-token prediction by giving each token a more direct way to access relevant earlier information through self-attention rather than forcing all information through one memory bottleneck. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Direct Context illustration 1 This change mattered because many language tasks depend on relationships between words that may be far apart in a sequence. By creating direct connections between relevant tokens, Transformers shortened the path that information must travel and made it easier for models to use long-range context when predicting the next token. [NeurIPS Papers]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…

The limits of a single recurrent state

In a recurrent model, text is processed one token at a time. After reading each token, the model updates an internal state that is supposed to contain everything important from the preceding context. The next prediction is based largely on this accumulated state.

This design creates a compression problem. As a sequence grows longer, more information must be packed into the same hidden representation. Even sophisticated recurrent architectures such as LSTMs were designed partly to reduce information loss, but they still rely on a single chain of state updates. Important details from earlier in the sequence can become weakened, distorted, or overshadowed by newer information. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Consider a sentence such as:

The report that the committee reviewed after months of discussion was finally approved.

To predict a word near the end, the model may need information introduced many tokens earlier. In a recurrent system, that information must survive every intermediate update step. If the relevant signal degrades along the way, prediction quality suffers.

The challenge becomes larger in long documents, conversations, or technical texts where critical information may appear hundreds or thousands of tokens before the point where it becomes relevant.

How attention creates shorter information paths

The key insight of self-attention is that a token does not have to retrieve information through a long chain of intermediate states. Instead, it can directly calculate which earlier tokens matter most and use them when constructing its representation. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

The original Transformer paper highlighted an important property: self-attention creates much shorter paths between positions in a sequence than recurrent architectures. In recurrent networks, information often travels through a chain whose length grows with sequence size. In self-attention, any two positions can interact through a constant number of computational steps. [NeurIPS Papers+2Medium]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…

Why does this matter?

When information must travel through many steps, learning the relationship becomes harder. Signals used during training have farther to travel, making long-range dependencies more difficult to learn. Shorter paths make these dependencies easier to capture. The Transformer paper explicitly identified path length as a major factor in learning long-distance relationships. [NeurIPS Papers]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…

A useful way to think about the difference is:

  • Recurrent model: information is passed along a relay race from token to token.
  • Transformer: information can be accessed through a direct lookup of relevant earlier tokens.

The second approach dramatically reduces the distance between related pieces of information.

Direct Context illustration 2

Why direct context helps next-token guesses

Next-token prediction improves when the model can identify which parts of the previous text are actually relevant to the upcoming token.

Self-attention allows the model to assign different weights to different earlier tokens. Instead of treating all past information as a single blended memory, the model can focus strongly on a few useful locations while largely ignoring irrelevant ones. [Codecademy]codecademy.comTransformer Architecture Explained With Self-Attention…Self-attention is a mechanism where each token in the input pays atte…

For example, imagine the context:

Sarah put the violin in its case before she carried it downstairs.

When predicting words related to “it”, the model benefits from linking the pronoun to “violin” rather than to every intervening word. Self-attention allows this connection to be formed directly. The model does not need to recover the relationship from a compressed summary created many steps earlier. [Medium]medium.comTransformers Explained Visually (Part 1): Overview of…The Transformer architecture uses self-attention by relating every word in…

This ability becomes especially important for:

  • Pronoun resolution.
  • Subject–verb agreement across long sentences.
  • Tracking entities across paragraphs.
  • Following instructions that reference earlier text.
  • Maintaining topic consistency over long contexts.

In each case, the next-token prediction depends on identifying a specific part of the earlier context rather than recalling a general summary.

A shift from compression to retrieval

One way to understand the Transformer’s impact is that it changed context handling from primarily compression-based memory to something closer to selective retrieval.

Recurrent models try to preserve useful information inside a continuously updated hidden state. Transformers still build internal representations, but self-attention allows tokens to retrieve information from relevant positions directly when needed. [Harvard NLP]nlp.seas.harvard.eduHarvard NLPThe Annotated TransformerApr 3, 2018 — The Transformer is the first transduction model relying entirely on self-attention to c…

This does not mean Transformers have perfect memory. Their attention mechanisms still face practical limits, especially as context windows grow very large. Standard self-attention also becomes computationally expensive because attention calculations scale quadratically with sequence length. That limitation has motivated many later efforts to make attention more efficient. [arXiv]arxiv.orgarXiv Linformer: Self-Attention with Linear ComplexityLinformer: Self-Attention with Linear ComplexityJune 8, 2020…Published: June 8, 2020

Even so, the fundamental change remained: prediction no longer depended entirely on squeezing the past into one evolving state.

Why this became a turning point

The Transformer did not alter the objective of language modelling. Models still learn by predicting the next token from previous tokens. What changed was the accessibility of contextual information.

By allowing direct interactions between relevant tokens, self-attention reduced information bottlenecks, shortened dependency paths, and made long-range relationships easier to learn. Those advantages helped Transformers generate more accurate next-token predictions across longer contexts than earlier recurrent architectures could typically manage. [arXiv+2NeurIPS Papers]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

That seemingly simple change—giving tokens direct access to relevant context instead of forcing everything through a single memory stream—became one of the central reasons why Transformer-based language models surpassed previous generations of next-token predictors. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Direct Context illustration 3

Amazon book picks

Further Reading

Books and field guides related to Why Transformers remember context differently. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/1706.03762
    Source snippet

    arXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b...

    Published: June 12, 2017

  2. Source: nlp.seas.harvard.edu
    Link: https://nlp.seas.harvard.edu/2018/04/03/attention.html
    Source snippet

    Harvard NLPThe Annotated TransformerApr 3, 2018 — The Transformer is the first transduction model relying entirely on self-attention to c...

  3. Source: papers.neurips.cc
    Title: 7181 attention is all you need
    Link: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
    Source snippet

    NeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye...

  4. Source: codecademy.com
    Link: https://www.codecademy.com/article/transformer-architecture-self-attention-mechanism
    Source snippet

    Transformer Architecture Explained With Self-Attention...Self-attention is a mechanism where each token in the input pays atte...

  5. Source: sh-tsang.medium.com
    Link: https://sh-tsang.medium.com/review-attention-is-all-you-need-transformer-96c787ecdec1
    Source snippet

    Attention Is All You Need (Transformer)The path length between long-range dependencies in the network. A self-attention layer connects...

  6. Source: dataturbo.medium.com
    Link: https://dataturbo.medium.com/transformer-attention-is-all-you-need-fe6205c5be33
    Source snippet

    Transformer Clear Explanation: Attention Is All You Need!“Maximum Path Length” denotes path length between long-range dependencies...

  7. Source: medium.com
    Link: https://medium.com/data-science/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452
    Source snippet

    Transformers Explained Visually (Part 1): Overview of...The Transformer architecture uses self-attention by relating every word in...

  8. Source: arxiv.org
    Title: arXiv Linformer: Self-Attention with Linear Complexity
    Link: https://arxiv.org/abs/2006.04768
    Source snippet

    Linformer: Self-Attention with Linear ComplexityJune 8, 2020...

    Published: June 8, 2020

  9. Source: arxiv.org
    Title: arXiv Generating Long Sequences with Sparse Transformers
    Link: https://arxiv.org/abs/1904.10509

  10. Source: medium.com
    Link: https://medium.com/%40richardhightower/why-language-is-hard-for-ai-and-how-transformers-changed-everything-d8a1fa299f1e
    Source snippet

    ge relationships. Modern extensions push boundaries for...Read more...

  11. Source: medium.com
    Link: https://medium.com/%40adityathiruvengadam/transformer-architecture-attention-is-all-you-need-aeccd9f50d09
    Source snippet

    Transformer Architecture: Attention Is All You NeedIt proposes to encode each position and applying the attention mechanism, to relate tw...

  12. Source: medium.com
    Link: https://medium.com/%40Andrew-Whitman/self-attention-and-the-transformer-f20010b1734e
    Source snippet

    Self-attention and the Transformer | by Andrew WhitmanThe Transformer requires less than CNNs and less than RNNs when the length of the i...

  13. Source: medium.com
    Link: https://medium.com/data-science/all-you-need-to-know-about-attention-and-transformers-in-depth-[understanding
    Source snippet

    w about the Attention mechanism including Self-Attention, Query, Keys, Values, Multi...

  14. Source: medium.com
    Link: https://medium.com/%40adnanmasood/attention-is-all-you-need-explained-like-youre-smart-and-busy-2a3d7436144f
    Source snippet

    ism, and reshaped modern language models. Adnan Masood, PhD.Read more...

  15. Source: medium.com
    Link: https://medium.com/data-science-collective/the-hidden-mathematics-behind-transformer-attention-why-self-attention-actually-works-c6587311bcdd
    Source snippet

    The Hidden Mathematics Behind Transformer AttentionThe O(n²) memory and computational complexity of self-attention becomes prohibitive fo...

  16. Source: medium.com
    Link: https://medium.com/%40chilldenaya/transformer-attention-is-all-you-need-a-paper-summary-d5fa82ff65de
    Source snippet

    sentence or elements in a sequence concerning each other.Read more...

  17. Source: medium.com
    Link: https://medium.com/%40marvelous_catawba_otter_200/attention-is-all-you-need-f9fe38d6e2fc
    Source snippet

    Attention Is All You Need | by Xupeng WangSelf-Attention allows any two positions in the sequence to interact directly in a single comput...

  18. Source: medium.com
    Title: Attention Is All You Need!
    Link: https://medium.com/data-science-collective/attention-is-all-you-need-661cb8db5f21
    Source snippet

    Demystifying the Transformer…Self-attention is the cornerstone of the Transformer architecture — the mechanism that allows the model to f...

  19. Source: arxiv.org
    Link: https://arxiv.org/html/1706.03762v7

  20. Source: arxiv.org
    Link: https://arxiv.org/pdf/1706.03762
    Source snippet

    1706.03762v7 [cs.CL] 2 Aug 2023by A Vaswani · 2017 · Cited by 235924 — Figure 3: An example of the attention mechanism following lo...

Additional References

  1. Source: researchgate.net
    Link: https://www.researchgate.net/publication/323598131_Self-Attention_with_Relative_Position_Representations
    Source snippet

    Self-Attention with Relative Position RepresentationsIn this work we present an alternative approach, extending the self-attention mechan...

  2. Source: researchgate.net
    Link: https://www.researchgate.net/publication/334115572_The_Annotated_Transformer
    Source snippet

    The Annotated TransformerThe notion of attention is inspired by a brain mechanism that tends to focus on distinctive parts of memory when...

  3. Source: linkedin.com
    Link: https://www.linkedin.com/posts/alexxubyte_the-most-important-paper-attention-is-all-activity-7404924500187865088-7LDp
    Source snippet

    Transformer Model Explained: Attention Is All You NeedTransformers solve this with self attention. All tokens communicate with each other...

  4. Source: pub.towardsai.net
    Link: https://pub.towardsai.net/attention-is-all-you-need-a-deep-dive-into-the-revolutionary-transformer-architecture-52734fb355dc
    Source snippet

    Deep Dive into the Revolutionary Transformer Architecture10 Apr 2025 — This paper introduced the Transformer architecture, a novel approa...

  5. Source: ai.stackexchange.com
    Title: why does the transformer do better than rnn and lstm in long range context depen
    Link: https://ai.stackexchange.com/questions/20075/why-does-the-transformer-do-better-than-rnn-and-lstm-in-long-range-context-depen
    Source snippet

    does the transformer do better than RNN and LSTM in...Apr 7, 2020 — I am reading the article How Transformers Work where the author writ...

  6. Source: openaccess.thecvf.com
    Link: https://openaccess.thecvf.com/content/WACV2024/papers/Nagar_SEMA_Semantic_Attention_for_Capturing_Long-Range_Dependencies_in_Egocentric_Lifelogs_WACV_2024_paper.pdf
    Source snippet

    Self-attention in transformers: To draw global dependen- cies between the input sequence, we take inspiration from.Read more...

  7. Source: algodaily.com
    Link: https://algodaily.com/lessons/attention-is-all-you-need-summarized
    Source snippet

    RNNs require O(n) sequential steps because each hidden state depends on the...Read more...

  8. Source: reddit.com
    Link: https://www.reddit.com/r/MachineLearning/comments/16l3vx2/discussion_question_on_the_paper_named/
    Source snippet

    [Discussion] Question on the paper named, SELF...I just read the paper named " SELF-ATTENTION DOES NOT NEED O(n 2) MEMORY" from Google...

  9. Source: packtpub.com
    Title: paper in two minutes attention is all you need
    Link: https://www.packtpub.com/en-us/learning/how-to-tutorials/paper-in-two-minutes-attention-is-all-you-need?srsltid=AfmBOophkcIOcH4nV42apMHwuemVXXUclV-cKBEuf6mtx1l5Vwuc138Y
    Source snippet

    Paper in Two minutes: Attention Is All You Need5 Apr 2018 — A self-attention layer connects all positions with a constant number of seque...

  10. Source: linkedin.com
    Link: https://www.linkedin.com/pulse/understanding-groundbreaking-attention-all-you-need-research-disansa-becnc
    Source snippet

    ly on an attention mechanism to draw global dependencies (...Read more...

Topic Tree

Follow this branch

Parent topic

Transformer shift Why attention made prediction scale

Related pages 2