Why Transformers remember context differently

Introduction

Language models always make predictions from context. The crucial question is how that context is represented. Before Transformers, recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks carried information forward through a single evolving hidden state: a compressed summary of everything seen so far. The Transformer changed next-token prediction by giving each token a more direct way to access relevant earlier information through self-attention rather than forcing all information through one memory bottleneck. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Direct Context illustration 1 This change mattered because many language tasks depend on relationships between words that may be far apart in a sequence. By creating direct connections between relevant tokens, Transformers shortened the path that information must travel and made it easier for models to use long-range context when predicting the next token. [NeurIPS Papers]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…

The limits of a single recurrent state

In a recurrent model, text is processed one token at a time. After reading each token, the model updates an internal state that is supposed to contain everything important from the preceding context. The next prediction is based largely on this accumulated state.

This design creates a compression problem. As a sequence grows longer, more information must be packed into the same hidden representation. Even sophisticated recurrent architectures such as LSTMs were designed partly to reduce information loss, but they still rely on a single chain of state updates. Important details from earlier in the sequence can become weakened, distorted, or overshadowed by newer information. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Consider a sentence such as:

The report that the committee reviewed after months of discussion was finally approved.

To predict a word near the end, the model may need information introduced many tokens earlier. In a recurrent system, that information must survive every intermediate update step. If the relevant signal degrades along the way, prediction quality suffers.

The challenge becomes larger in long documents, conversations, or technical texts where critical information may appear hundreds or thousands of tokens before the point where it becomes relevant.

How attention creates shorter information paths

The key insight of self-attention is that a token does not have to retrieve information through a long chain of intermediate states. Instead, it can directly calculate which earlier tokens matter most and use them when constructing its representation. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

The original Transformer paper highlighted an important property: self-attention creates much shorter paths between positions in a sequence than recurrent architectures. In recurrent networks, information often travels through a chain whose length grows with sequence size. In self-attention, any two positions can interact through a constant number of computational steps. [NeurIPS Papers+2Medium]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…

Why does this matter?

When information must travel through many steps, learning the relationship becomes harder. Signals used during training have farther to travel, making long-range dependencies more difficult to learn. Shorter paths make these dependencies easier to capture. The Transformer paper explicitly identified path length as a major factor in learning long-distance relationships. [NeurIPS Papers]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…

A useful way to think about the difference is:

Recurrent model: information is passed along a relay race from token to token.
Transformer: information can be accessed through a direct lookup of relevant earlier tokens.

The second approach dramatically reduces the distance between related pieces of information.

Direct Context illustration 2

Why direct context helps next-token guesses

Next-token prediction improves when the model can identify which parts of the previous text are actually relevant to the upcoming token.

Self-attention allows the model to assign different weights to different earlier tokens. Instead of treating all past information as a single blended memory, the model can focus strongly on a few useful locations while largely ignoring irrelevant ones. [Codecademy]codecademy.comTransformer Architecture Explained With Self-Attention…Self-attention is a mechanism where each token in the input pays atte…

For example, imagine the context:

Sarah put the violin in its case before she carried it downstairs.

When predicting words related to “it”, the model benefits from linking the pronoun to “violin” rather than to every intervening word. Self-attention allows this connection to be formed directly. The model does not need to recover the relationship from a compressed summary created many steps earlier. [Medium]medium.comTransformers Explained Visually (Part 1): Overview of…The Transformer architecture uses self-attention by relating every word in…

This ability becomes especially important for:

Pronoun resolution.
Subject–verb agreement across long sentences.
Tracking entities across paragraphs.
Following instructions that reference earlier text.
Maintaining topic consistency over long contexts.

In each case, the next-token prediction depends on identifying a specific part of the earlier context rather than recalling a general summary.

A shift from compression to retrieval

One way to understand the Transformer’s impact is that it changed context handling from primarily compression-based memory to something closer to selective retrieval.

Recurrent models try to preserve useful information inside a continuously updated hidden state. Transformers still build internal representations, but self-attention allows tokens to retrieve information from relevant positions directly when needed. [Harvard NLP]nlp.seas.harvard.eduHarvard NLPThe Annotated TransformerApr 3, 2018 — The Transformer is the first transduction model relying entirely on self-attention to c…

This does not mean Transformers have perfect memory. Their attention mechanisms still face practical limits, especially as context windows grow very large. Standard self-attention also becomes computationally expensive because attention calculations scale quadratically with sequence length. That limitation has motivated many later efforts to make attention more efficient. [arXiv]arxiv.orgarXiv Linformer: Self-Attention with Linear ComplexityLinformer: Self-Attention with Linear ComplexityJune 8, 2020…Published: June 8, 2020

Even so, the fundamental change remained: prediction no longer depended entirely on squeezing the past into one evolving state.

Why this became a turning point

The Transformer did not alter the objective of language modelling. Models still learn by predicting the next token from previous tokens. What changed was the accessibility of contextual information.

By allowing direct interactions between relevant tokens, self-attention reduced information bottlenecks, shortened dependency paths, and made long-range relationships easier to learn. Those advantages helped Transformers generate more accurate next-token predictions across longer contexts than earlier recurrent architectures could typically manage. [arXiv+2NeurIPS Papers]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

That seemingly simple change—giving tokens direct access to relevant context instead of forcing everything through a single memory stream—became one of the central reasons why Transformer-based language models surpassed previous generations of next-token predictors. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Direct Context illustration 3

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Artificial intelligence Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: artificial intelligence wall art

Browse similar on eBay.co.uk

Example eBay listing

Artificial intelligence Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: artificial intelligence wall art

Browse similar on eBay.co.uk

Example eBay listing

Copy of Artificial Intelligence Fra Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: artificial intelligence wall art

Browse similar on eBay.co.uk

Example eBay listing

artificial intelligence Framed Art Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: artificial intelligence wall art

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Link: https://arxiv.org/abs/1706.03762
Source snippet
arXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b...

Published: June 12, 2017
Source: nlp.seas.harvard.edu
Link: https://nlp.seas.harvard.edu/2018/04/03/attention.html
Source snippet
Harvard NLPThe Annotated TransformerApr 3, 2018 — The Transformer is the first transduction model relying entirely on self-attention to c...
Source: papers.neurips.cc
Title: 7181 attention is all you need
Link: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
Source snippet
NeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye...
Source: codecademy.com
Link: https://www.codecademy.com/article/transformer-architecture-self-attention-mechanism
Source snippet
Transformer Architecture Explained With Self-Attention...Self-attention is a mechanism where each token in the input pays atte...
Source: sh-tsang.medium.com
Link: https://sh-tsang.medium.com/review-attention-is-all-you-need-transformer-96c787ecdec1
Source snippet
Attention Is All You Need (Transformer)The path length between long-range dependencies in the network. A self-attention layer connects...
Source: dataturbo.medium.com
Link: https://dataturbo.medium.com/transformer-attention-is-all-you-need-fe6205c5be33
Source snippet
Transformer Clear Explanation: Attention Is All You Need!“Maximum Path Length” denotes path length between long-range dependencies...
Source: medium.com
Link: https://medium.com/data-science/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452
Source snippet
Transformers Explained Visually (Part 1): Overview of...The Transformer architecture uses self-attention by relating every word in...
Source: arxiv.org
Title: arXiv Linformer: Self-Attention with Linear Complexity
Link: https://arxiv.org/abs/2006.04768
Source snippet
Linformer: Self-Attention with Linear ComplexityJune 8, 2020...

Published: June 8, 2020
Source: arxiv.org
Title: arXiv Generating Long Sequences with Sparse Transformers
Link: https://arxiv.org/abs/1904.10509
Source: medium.com
Link: https://medium.com/%40richardhightower/why-language-is-hard-for-ai-and-how-transformers-changed-everything-d8a1fa299f1e
Source snippet
ge relationships. Modern extensions push boundaries for...Read more...
Source: medium.com
Link: https://medium.com/%40adityathiruvengadam/transformer-architecture-attention-is-all-you-need-aeccd9f50d09
Source snippet
Transformer Architecture: Attention Is All You NeedIt proposes to encode each position and applying the attention mechanism, to relate tw...
Source: medium.com
Link: https://medium.com/%40Andrew-Whitman/self-attention-and-the-transformer-f20010b1734e
Source snippet
Self-attention and the Transformer | by Andrew WhitmanThe Transformer requires less than CNNs and less than RNNs when the length of the i...
Source: medium.com
Link: https://medium.com/data-science/all-you-need-to-know-about-attention-and-transformers-in-depth-[understanding
Source snippet
w about the Attention mechanism including Self-Attention, Query, Keys, Values, Multi...
Source: medium.com
Link: https://medium.com/%40adnanmasood/attention-is-all-you-need-explained-like-youre-smart-and-busy-2a3d7436144f
Source snippet
ism, and reshaped modern language models. Adnan Masood, PhD.Read more...
Source: medium.com
Link: https://medium.com/data-science-collective/the-hidden-mathematics-behind-transformer-attention-why-self-attention-actually-works-c6587311bcdd
Source snippet
The Hidden Mathematics Behind Transformer AttentionThe O(n²) memory and computational complexity of self-attention becomes prohibitive fo...
Source: medium.com
Link: https://medium.com/%40chilldenaya/transformer-attention-is-all-you-need-a-paper-summary-d5fa82ff65de
Source snippet
sentence or elements in a sequence concerning each other.Read more...
Source: medium.com
Link: https://medium.com/%40marvelous_catawba_otter_200/attention-is-all-you-need-f9fe38d6e2fc
Source snippet
Attention Is All You Need | by Xupeng WangSelf-Attention allows any two positions in the sequence to interact directly in a single comput...
Source: medium.com
Title: Attention Is All You Need!
Link: https://medium.com/data-science-collective/attention-is-all-you-need-661cb8db5f21
Source snippet
Demystifying the Transformer…Self-attention is the cornerstone of the Transformer architecture — the mechanism that allows the model to f...
Source: arxiv.org
Link: https://arxiv.org/html/1706.03762v7
Source: arxiv.org
Link: https://arxiv.org/pdf/1706.03762
Source snippet
1706.03762v7 [cs.CL] 2 Aug 2023by A Vaswani · 2017 · Cited by 235924 — Figure 3: An example of the attention mechanism following lo...

Additional References

Source: researchgate.net
Link: https://www.researchgate.net/publication/323598131_Self-Attention_with_Relative_Position_Representations
Source snippet
Self-Attention with Relative Position RepresentationsIn this work we present an alternative approach, extending the self-attention mechan...
Source: researchgate.net
Link: https://www.researchgate.net/publication/334115572_The_Annotated_Transformer
Source snippet
The Annotated TransformerThe notion of attention is inspired by a brain mechanism that tends to focus on distinctive parts of memory when...
Source: linkedin.com
Link: https://www.linkedin.com/posts/alexxubyte_the-most-important-paper-attention-is-all-activity-7404924500187865088-7LDp
Source snippet
Transformer Model Explained: Attention Is All You NeedTransformers solve this with self attention. All tokens communicate with each other...
Source: pub.towardsai.net
Link: https://pub.towardsai.net/attention-is-all-you-need-a-deep-dive-into-the-revolutionary-transformer-architecture-52734fb355dc
Source snippet
Deep Dive into the Revolutionary Transformer Architecture10 Apr 2025 — This paper introduced the Transformer architecture, a novel approa...
Source: ai.stackexchange.com
Title: why does the transformer do better than rnn and lstm in long range context depen
Link: https://ai.stackexchange.com/questions/20075/why-does-the-transformer-do-better-than-rnn-and-lstm-in-long-range-context-depen
Source snippet
does the transformer do better than RNN and LSTM in...Apr 7, 2020 — I am reading the article How Transformers Work where the author writ...
Source: openaccess.thecvf.com
Link: https://openaccess.thecvf.com/content/WACV2024/papers/Nagar_SEMA_Semantic_Attention_for_Capturing_Long-Range_Dependencies_in_Egocentric_Lifelogs_WACV_2024_paper.pdf
Source snippet
Self-attention in transformers: To draw global dependen- cies between the input sequence, we take inspiration from.Read more...
Source: algodaily.com
Link: https://algodaily.com/lessons/attention-is-all-you-need-summarized
Source snippet
RNNs require O(n) sequential steps because each hidden state depends on the...Read more...
Source: reddit.com
Link: https://www.reddit.com/r/MachineLearning/comments/16l3vx2/discussion_question_on_the_paper_named/
Source snippet
[Discussion] Question on the paper named, SELF...I just read the paper named " SELF-ATTENTION DOES NOT NEED O(n 2) MEMORY" from Google...
Source: packtpub.com
Title: paper in two minutes attention is all you need
Link: https://www.packtpub.com/en-us/learning/how-to-tutorials/paper-in-two-minutes-attention-is-all-you-need?srsltid=AfmBOophkcIOcH4nV42apMHwuemVXXUclV-cKBEuf6mtx1l5Vwuc138Y
Source snippet
Paper in Two minutes: Attention Is All You Need5 Apr 2018 — A self-attention layer connects all positions with a constant number of seque...
Source: linkedin.com
Link: https://www.linkedin.com/pulse/understanding-groundbreaking-attention-all-you-need-research-disansa-becnc
Source snippet
ly on an attention mechanism to draw global dependencies (...Read more...

Why Transformers remember context differently

Introduction

The limits of a single recurrent state

How attention creates shorter information paths

Why direct context helps next-token guesses

A shift from compression to retrieval

Why this became a turning point

Further Reading

Hands-On Large Language Models

Natural Language Processing with Transformers

Build a Large Language Model (From Scratch)

Transformers for Natural Language Processing

Marketplace Samples

Artificial intelligence Framed Wall Art Poster Canvas Print Picture

Artificial intelligence Framed Wall Art Poster Canvas Print Picture

Copy of Artificial Intelligence Fra Framed Wall Art Poster Canvas Print Picture

artificial intelligence Framed Art Framed Wall Art Poster Canvas Print Picture

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2