Within Transformer shift
Why Transformers remember context differently
Transformers made next-word guesses stronger by letting tokens draw on relevant earlier context instead of relying on one compressed memory.
On this page
- The limits of a single recurrent state
- How attention creates shorter information paths
- Why direct context helps next token guesses
Page outline Jump by section
Introduction
Language models always make predictions from context. The crucial question is how that context is represented. Before Transformers, recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks carried information forward through a single evolving hidden state: a compressed summary of everything seen so far. The Transformer changed next-token prediction by giving each token a more direct way to access relevant earlier information through self-attention rather than forcing all information through one memory bottleneck. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…
This change mattered because many language tasks depend on relationships between words that may be far apart in a sequence. By creating direct connections between relevant tokens, Transformers shortened the path that information must travel and made it easier for models to use long-range context when predicting the next token. [NeurIPS Papers]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…
The limits of a single recurrent state
In a recurrent model, text is processed one token at a time. After reading each token, the model updates an internal state that is supposed to contain everything important from the preceding context. The next prediction is based largely on this accumulated state.
This design creates a compression problem. As a sequence grows longer, more information must be packed into the same hidden representation. Even sophisticated recurrent architectures such as LSTMs were designed partly to reduce information loss, but they still rely on a single chain of state updates. Important details from earlier in the sequence can become weakened, distorted, or overshadowed by newer information. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…
Consider a sentence such as:
The report that the committee reviewed after months of discussion was finally approved.
To predict a word near the end, the model may need information introduced many tokens earlier. In a recurrent system, that information must survive every intermediate update step. If the relevant signal degrades along the way, prediction quality suffers.
The challenge becomes larger in long documents, conversations, or technical texts where critical information may appear hundreds or thousands of tokens before the point where it becomes relevant.
How attention creates shorter information paths
The key insight of self-attention is that a token does not have to retrieve information through a long chain of intermediate states. Instead, it can directly calculate which earlier tokens matter most and use them when constructing its representation. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…
The original Transformer paper highlighted an important property: self-attention creates much shorter paths between positions in a sequence than recurrent architectures. In recurrent networks, information often travels through a chain whose length grows with sequence size. In self-attention, any two positions can interact through a constant number of computational steps. [NeurIPS Papers+2Medium]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…
Why does this matter?
When information must travel through many steps, learning the relationship becomes harder. Signals used during training have farther to travel, making long-range dependencies more difficult to learn. Shorter paths make these dependencies easier to capture. The Transformer paper explicitly identified path length as a major factor in learning long-distance relationships. [NeurIPS Papers]papers.neurips.cc7181 attention is all you needNeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye…
A useful way to think about the difference is:
- Recurrent model: information is passed along a relay race from token to token.
- Transformer: information can be accessed through a direct lookup of relevant earlier tokens.
The second approach dramatically reduces the distance between related pieces of information.
Why direct context helps next-token guesses
Next-token prediction improves when the model can identify which parts of the previous text are actually relevant to the upcoming token.
Self-attention allows the model to assign different weights to different earlier tokens. Instead of treating all past information as a single blended memory, the model can focus strongly on a few useful locations while largely ignoring irrelevant ones. [Codecademy]codecademy.comTransformer Architecture Explained With Self-Attention…Self-attention is a mechanism where each token in the input pays atte…
For example, imagine the context:
Sarah put the violin in its case before she carried it downstairs.
When predicting words related to “it”, the model benefits from linking the pronoun to “violin” rather than to every intervening word. Self-attention allows this connection to be formed directly. The model does not need to recover the relationship from a compressed summary created many steps earlier. [Medium]medium.comTransformers Explained Visually (Part 1): Overview of…The Transformer architecture uses self-attention by relating every word in…
This ability becomes especially important for:
- Pronoun resolution.
- Subject–verb agreement across long sentences.
- Tracking entities across paragraphs.
- Following instructions that reference earlier text.
- Maintaining topic consistency over long contexts.
In each case, the next-token prediction depends on identifying a specific part of the earlier context rather than recalling a general summary.
A shift from compression to retrieval
One way to understand the Transformer’s impact is that it changed context handling from primarily compression-based memory to something closer to selective retrieval.
Recurrent models try to preserve useful information inside a continuously updated hidden state. Transformers still build internal representations, but self-attention allows tokens to retrieve information from relevant positions directly when needed. [Harvard NLP]nlp.seas.harvard.eduHarvard NLPThe Annotated TransformerApr 3, 2018 — The Transformer is the first transduction model relying entirely on self-attention to c…
This does not mean Transformers have perfect memory. Their attention mechanisms still face practical limits, especially as context windows grow very large. Standard self-attention also becomes computationally expensive because attention calculations scale quadratically with sequence length. That limitation has motivated many later efforts to make attention more efficient. [arXiv]arxiv.orgarXiv Linformer: Self-Attention with Linear ComplexityLinformer: Self-Attention with Linear ComplexityJune 8, 2020…
Even so, the fundamental change remained: prediction no longer depended entirely on squeezing the past into one evolving state.
Why this became a turning point
The Transformer did not alter the objective of language modelling. Models still learn by predicting the next token from previous tokens. What changed was the accessibility of contextual information.
By allowing direct interactions between relevant tokens, self-attention reduced information bottlenecks, shortened dependency paths, and made long-range relationships easier to learn. Those advantages helped Transformers generate more accurate next-token predictions across longer contexts than earlier recurrent architectures could typically manage. [arXiv+2NeurIPS Papers]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…
That seemingly simple change—giving tokens direct access to relevant context instead of forcing everything through a single memory stream—became one of the central reasons why Transformer-based language models surpassed previous generations of next-token predictors. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…
Amazon book picks
Further Reading
Books and field guides related to Why Transformers remember context differently. Use these as the next step if you want deeper reading beyond the article.
Hands-On Large Language Models
Explains how attention lets models access relevant context across long sequences.
Natural Language Processing with Transformers
Directly addresses attention mechanisms and context handling in transformer models.
Build a Large Language Model (From Scratch)
Covers context representation, attention layers, and transformer internals.
Transformers for Natural Language Processing
Discusses self-attention and the advantages over recurrent memory bottlenecks.
Endnotes
-
Source: arxiv.org
Link: https://arxiv.org/abs/1706.03762Source snippet
arXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b...
Published: June 12, 2017
-
Source: nlp.seas.harvard.edu
Link: https://nlp.seas.harvard.edu/2018/04/03/attention.htmlSource snippet
Harvard NLPThe Annotated TransformerApr 3, 2018 — The Transformer is the first transduction model relying entirely on self-attention to c...
-
Source: papers.neurips.cc
Title: 7181 attention is all you need
Link: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdfSource snippet
NeurIPS PapersAttention is All you Needby A Vaswani · Cited by 245639 — In this section we compare various aspects of self-attention laye...
-
Source: codecademy.com
Link: https://www.codecademy.com/article/transformer-architecture-self-attention-mechanismSource snippet
Transformer Architecture Explained With Self-Attention...Self-attention is a mechanism where each token in the input pays atte...
-
Source: sh-tsang.medium.com
Link: https://sh-tsang.medium.com/review-attention-is-all-you-need-transformer-96c787ecdec1Source snippet
Attention Is All You Need (Transformer)The path length between long-range dependencies in the network. A self-attention layer connects...
-
Source: dataturbo.medium.com
Link: https://dataturbo.medium.com/transformer-attention-is-all-you-need-fe6205c5be33Source snippet
Transformer Clear Explanation: Attention Is All You Need!“Maximum Path Length” denotes path length between long-range dependencies...
-
Source: medium.com
Link: https://medium.com/data-science/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452Source snippet
Transformers Explained Visually (Part 1): Overview of...The Transformer architecture uses self-attention by relating every word in...
-
Source: arxiv.org
Title: arXiv Linformer: Self-Attention with Linear Complexity
Link: https://arxiv.org/abs/2006.04768Source snippet
Linformer: Self-Attention with Linear ComplexityJune 8, 2020...
Published: June 8, 2020
-
Source: arxiv.org
Title: arXiv Generating Long Sequences with Sparse Transformers
Link: https://arxiv.org/abs/1904.10509 -
Source: medium.com
Link: https://medium.com/%40richardhightower/why-language-is-hard-for-ai-and-how-transformers-changed-everything-d8a1fa299f1eSource snippet
ge relationships. Modern extensions push boundaries for...Read more...
-
Source: medium.com
Link: https://medium.com/%40adityathiruvengadam/transformer-architecture-attention-is-all-you-need-aeccd9f50d09Source snippet
Transformer Architecture: Attention Is All You NeedIt proposes to encode each position and applying the attention mechanism, to relate tw...
-
Source: medium.com
Link: https://medium.com/%40Andrew-Whitman/self-attention-and-the-transformer-f20010b1734eSource snippet
Self-attention and the Transformer | by Andrew WhitmanThe Transformer requires less than CNNs and less than RNNs when the length of the i...
-
Source: medium.com
Link: https://medium.com/data-science/all-you-need-to-know-about-attention-and-transformers-in-depth-[understandingSource snippet
w about the Attention mechanism including Self-Attention, Query, Keys, Values, Multi...
-
Source: medium.com
Link: https://medium.com/%40adnanmasood/attention-is-all-you-need-explained-like-youre-smart-and-busy-2a3d7436144fSource snippet
ism, and reshaped modern language models. Adnan Masood, PhD.Read more...
-
Source: medium.com
Link: https://medium.com/data-science-collective/the-hidden-mathematics-behind-transformer-attention-why-self-attention-actually-works-c6587311bcddSource snippet
The Hidden Mathematics Behind Transformer AttentionThe O(n²) memory and computational complexity of self-attention becomes prohibitive fo...
-
Source: medium.com
Link: https://medium.com/%40chilldenaya/transformer-attention-is-all-you-need-a-paper-summary-d5fa82ff65deSource snippet
sentence or elements in a sequence concerning each other.Read more...
-
Source: medium.com
Link: https://medium.com/%40marvelous_catawba_otter_200/attention-is-all-you-need-f9fe38d6e2fcSource snippet
Attention Is All You Need | by Xupeng WangSelf-Attention allows any two positions in the sequence to interact directly in a single comput...
-
Source: medium.com
Title: Attention Is All You Need!
Link: https://medium.com/data-science-collective/attention-is-all-you-need-661cb8db5f21Source snippet
Demystifying the Transformer…Self-attention is the cornerstone of the Transformer architecture — the mechanism that allows the model to f...
-
Source: arxiv.org
Link: https://arxiv.org/html/1706.03762v7 -
Source: arxiv.org
Link: https://arxiv.org/pdf/1706.03762Source snippet
1706.03762v7 [cs.CL] 2 Aug 2023by A Vaswani · 2017 · Cited by 235924 — Figure 3: An example of the attention mechanism following lo...
Additional References
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/323598131_Self-Attention_with_Relative_Position_RepresentationsSource snippet
Self-Attention with Relative Position RepresentationsIn this work we present an alternative approach, extending the self-attention mechan...
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/334115572_The_Annotated_TransformerSource snippet
The Annotated TransformerThe notion of attention is inspired by a brain mechanism that tends to focus on distinctive parts of memory when...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/alexxubyte_the-most-important-paper-attention-is-all-activity-7404924500187865088-7LDpSource snippet
Transformer Model Explained: Attention Is All You NeedTransformers solve this with self attention. All tokens communicate with each other...
-
Source: pub.towardsai.net
Link: https://pub.towardsai.net/attention-is-all-you-need-a-deep-dive-into-the-revolutionary-transformer-architecture-52734fb355dcSource snippet
Deep Dive into the Revolutionary Transformer Architecture10 Apr 2025 — This paper introduced the Transformer architecture, a novel approa...
-
Source: ai.stackexchange.com
Title: why does the transformer do better than rnn and lstm in long range context depen
Link: https://ai.stackexchange.com/questions/20075/why-does-the-transformer-do-better-than-rnn-and-lstm-in-long-range-context-depenSource snippet
does the transformer do better than RNN and LSTM in...Apr 7, 2020 — I am reading the article How Transformers Work where the author writ...
-
Source: openaccess.thecvf.com
Link: https://openaccess.thecvf.com/content/WACV2024/papers/Nagar_SEMA_Semantic_Attention_for_Capturing_Long-Range_Dependencies_in_Egocentric_Lifelogs_WACV_2024_paper.pdfSource snippet
Self-attention in transformers: To draw global dependen- cies between the input sequence, we take inspiration from.Read more...
-
Source: algodaily.com
Link: https://algodaily.com/lessons/attention-is-all-you-need-summarizedSource snippet
RNNs require O(n) sequential steps because each hidden state depends on the...Read more...
-
Source: reddit.com
Link: https://www.reddit.com/r/MachineLearning/comments/16l3vx2/discussion_question_on_the_paper_named/Source snippet
[Discussion] Question on the paper named, SELF...I just read the paper named " SELF-ATTENTION DOES NOT NEED O(n 2) MEMORY" from Google...
-
Source: packtpub.com
Title: paper in two minutes attention is all you need
Link: https://www.packtpub.com/en-us/learning/how-to-tutorials/paper-in-two-minutes-attention-is-all-you-need?srsltid=AfmBOophkcIOcH4nV42apMHwuemVXXUclV-cKBEuf6mtx1l5Vwuc138YSource snippet
Paper in Two minutes: Attention Is All You Need5 Apr 2018 — A self-attention layer connects all positions with a constant number of seque...
-
Source: linkedin.com
Link: https://www.linkedin.com/pulse/understanding-groundbreaking-attention-all-you-need-research-disansa-becncSource snippet
ly on an attention mechanism to draw global dependencies (...Read more...
Topic Tree



