Why faster training still hit a length limit

Introduction

Parallel training was the breakthrough that made Transformers practical to scale, but it did not eliminate every computational bottleneck. The same self-attention mechanism that allows a Transformer to process an entire sequence simultaneously also creates a growing cost as sequences become longer. A model can analyse thousands of words, lines of code, or conversation turns in parallel, yet the amount of work required by attention rises much faster than the length of the input. This is why extending context windows from a few thousand tokens to hundreds of thousands or even millions remains one of the most challenging engineering problems in modern AI. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — Jun 12, 2017 — We propose a new simple network architecture, the Transformer…Published: June 12, 2017

Long context cost illustration 1 The result is a central trade-off in Transformer design: the architecture scales exceptionally well across hardware, but long contexts place increasing pressure on computation, memory, and training budgets. Understanding this limitation helps explain why so much recent AI research focuses on efficient attention methods rather than simply making models larger. [hazyresearch.stanford.edu]hazyresearch.stanford.edu2023 01 12 flashattention long sequencesFast Transformer Training with Long SequencesJan 13, 2023 — FlashAttention is a new algorithm to speed up attention and reduce its memory…

Why attention cost grows with sequence length

The core issue comes from how self-attention works. Each token compares itself with every other token in the sequence to determine which pieces of information matter most. If a sequence contains 100 tokens, there are roughly 10,000 pairwise relationships. If the sequence grows to 1,000 tokens, the number of potential interactions rises to about one million. The growth is quadratic: doubling the sequence length roughly quadruples the amount of attention computation. [Shreyansh Singh+2aiplanet.com]shreyansh26.github.ioShreyansh SinghPaper Summary #7 - Efficient Transformers: A Survey10 Oct 2022 — This is a survey paper on the various memory-efficiency…

This behaviour differs from the scaling story that made Transformers attractive in the first place. Training can be distributed across many GPUs because attention calculations are highly parallelisable matrix operations. However, parallel hardware does not change the underlying mathematics. Longer sequences still create many more token-to-token comparisons that must be computed somewhere. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — Jun 12, 2017 — We propose a new simple network architecture, the Transformer…Published: June 12, 2017

A useful way to think about the problem is that attention treats a document as a dense network of relationships. Every new token potentially connects to all previous tokens. As context windows expand from 4,000 tokens to 32,000, 128,000, or beyond, the number of possible relationships explodes much faster than the visible increase in text length. [Shreyansh Singh]shreyansh26.github.ioShreyansh SinghPaper Summary #7 - Efficient Transformers: A Survey10 Oct 2022 — This is a survey paper on the various memory-efficiency…

Why longer prompts become disproportionately expensive

Many readers assume that processing a prompt twice as long should require roughly twice as much work. With standard self-attention, that intuition is wrong.

For attention-heavy workloads:

Doubling sequence length can roughly quadruple attention operations.
Four times the sequence length can require around sixteen times the attention work.
Memory requirements often grow alongside computation because attention scores and intermediate states must be stored during training. [arXiv+2Shreyansh Singh]arxiv.orgFull Attention Transformer with Sparse Computation Costby H Ren · 2021 · Cited by 112 — PDF of the paper titled ・ the key limitation…

This scaling behaviour explains why extending context windows is not merely a matter of buying proportionally more hardware. The costs can accelerate much faster than the apparent increase in usable context. [Michael Brenndoerfer]mbrenndoerfer.comquadratic attention bottleneck transformers long sequencesMichael BrenndoerferQuadratic Attention Bottleneck: Why Transformers Struggle…19 Jun 2025 — Understand why self-attention has O(n²) co…

Long context cost illustration 2

Memory pressure in long-input Transformer training

Computation is only part of the challenge. Training large Transformer models also requires storing activations, gradients, and attention-related intermediate values. As sequence length grows, memory consumption can become the limiting factor before raw processing speed does. [arXiv]arxiv.orgFast and Memory-Efficient Exact Attention with IO-Awarenessby T Dao · 2022 · Cited by 5165 — We propose FlashAttention, an IO-aware…

This memory pressure creates several practical constraints:

Smaller batch sizes may be required to fit long sequences into GPU memory.
Training throughput often falls because hardware spends more time moving data.
Additional GPUs may be needed simply to hold model state and attention information rather than perform new computation. [hazyresearch.stanford.edu]hazyresearch.stanford.edu2023 01 12 flashattention long sequencesFast Transformer Training with Long SequencesJan 13, 2023 — FlashAttention is a new algorithm to speed up attention and reduce its memory…

Researchers working on long-context language models frequently report that memory becomes a dominant engineering concern. Even when sufficient computing power exists, storing the information required by attention can prevent straightforward scaling to longer documents. [arXiv]arxiv.orgDISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs TrainingOctober 5, 2023…Published: October 5, 2023

The distinction matters because parallel training solved a different problem. Transformers removed the sequential dependency chain that limited recurrent neural networks. They did not remove the need to represent interactions across an entire context window. As contexts become larger, that interaction structure itself becomes expensive. [arXiv]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — Jun 12, 2017 — We propose a new simple network architecture, the Transformer…Published: June 12, 2017

How newer parallelism and efficient attention try to help

The industry’s response has largely been to make attention more efficient rather than abandon the Transformer entirely.

One influential example is FlashAttention, which reorganises attention computations to reduce costly memory transfers between different levels of GPU memory. The key insight is that modern accelerators often spend significant time moving data rather than performing arithmetic. By restructuring the computation, FlashAttention can reduce memory usage and accelerate training while preserving exact attention behaviour. [arXiv+2OpenReview]arxiv.orgFast and Memory-Efficient Exact Attention with IO-Awarenessby T Dao · 2022 · Cited by 5165 — We propose FlashAttention, an IO-aware…

Researchers have also explored distributed attention systems that spread long-context processing across multiple devices. Methods such as DistFlashAttn aim to support sequence lengths far beyond what a single GPU could comfortably handle. [arXiv]arxiv.orgDISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs TrainingOctober 5, 2023…Published: October 5, 2023

Another major research direction is reducing the number of token interactions that must be calculated at all. Common approaches include:

Sparse attention, where tokens attend only to selected parts of the sequence.
Linear attention variants, which replace full pairwise comparison with more efficient approximations.
Memory and retrieval mechanisms, which try to access only relevant information from very long histories. [Medium+2Medium]medium.com69. Making Transformers Efficient for Long Sequences:…By attending to a subset of tokens, attention cost drops from quadratic to…

These approaches attempt to move beyond the quadratic scaling of standard attention, although they often introduce trade-offs involving accuracy, implementation complexity, or the ability to capture long-range relationships. [arXiv]arxiv.orgFull Attention Transformer with Sparse Computation Costby H Ren · 2021 · Cited by 112 — PDF of the paper titled ・ the key limitation…

Long context cost illustration 3

The continuing trade-off behind long-context AI

Long-context models can read books, analyse large codebases, process lengthy conversations, and combine information spread across vast documents. Yet the cost of doing so remains one of the defining limitations of Transformer-based AI.

This creates a subtle but important lesson about the architecture’s success. Transformers became dominant because they aligned well with parallel hardware and large-scale training. However, the same attention mechanism that enabled that success still contains a scaling tension: every increase in context length asks the model to reason over a rapidly growing number of relationships. [arXiv+2hazyresearch.stanford.edu]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — Jun 12, 2017 — We propose a new simple network architecture, the Transformer…Published: June 12, 2017

Modern advances have made long contexts increasingly practical, but they mostly reduce, redistribute, or approximate the underlying cost rather than eliminating it entirely. As a result, attention efficiency remains one of the most active areas of research in the effort to build AI systems that can reliably work with ever larger amounts of information. [arXiv+2Machine Learning At Scale]arxiv.orgFast and Memory-Efficient Exact Attention with IO-Awarenessby T Dao · 2022 · Cited by 5165 — We propose FlashAttention, an IO-aware…

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Nike FA Coach Developer T Shirt

Search eBay.co.uk: developer t shirt

Browse similar on eBay.co.uk

Example eBay listing

Solution Developer T Shirt - We Do Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: developer t shirt

Browse similar on eBay.co.uk

Example eBay listing

Training Developer T Shirt - What I Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: developer t shirt

Browse similar on eBay.co.uk

Example eBay listing

Sales Developer T Shirt - What Is Y Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: developer t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Link: https://arxiv.org/abs/1706.03762
Source snippet
arXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — Jun 12, 2017 — We propose a new simple network architecture, the Transformer...

Published: June 12, 2017
Source: arxiv.org
Link: https://arxiv.org/abs/2107.05768
Source snippet
Full Attention Transformer with Sparse Computation Costby H Ren · 2021 · Cited by 112 — PDF of the paper titled ・ the key limitation...
Source: hazyresearch.stanford.edu
Title: 2023 01 12 flashattention long sequences
Link: https://hazyresearch.stanford.edu/blog/2023-01-12-flashattention-long-sequences
Source snippet
Fast Transformer Training with Long SequencesJan 13, 2023 — FlashAttention is a new algorithm to speed up attention and reduce its memory...
Source: arxiv.org
Link: https://arxiv.org/abs/2205.14135
Source snippet
Fast and Memory-Efficient Exact Attention with IO-Awarenessby T Dao · 2022 · Cited by 5165 — We propose FlashAttention, an IO-aware...
Source: aiplanet.com
Link: https://aiplanet.com/learn/llm-bootcamp/module-4/2343/transformers-attention-is-all-you-need
Source snippet
Transformers- Attention is all you needComputational Complexity: The traditional Attention mechanism involves pairwise comparisons betwee...
Source: arxiv.org
Link: https://arxiv.org/html/2310.03294v2
Source snippet
DistFlashAttn: Distributed Memory-efficient Attention for...Mar 31, 2024 — FlashAttention (Dao, 2023) effectively reduces the quadratic...
Source: openreview.net
Link: https://openreview.net/forum?id=H4DqfPSibmx
Source snippet
Fast and Memory-Efficient Exact Attention with IO-Awarenessby T Dao · 2022 · Cited by 5165 — We propose FlashAttention, an IO-aware exact...
Source: arxiv.org
Link: https://arxiv.org/abs/2310.03294
Source snippet
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs TrainingOctober 5, 2023...

Published: October 5, 2023
Source: medium.com
Link: https://medium.com/%40kiranvutukuri/69-sparse-attention-making-transformers-efficient-for-long-sequences-859aa03b03f6
Source snippet
69. Making Transformers Efficient for Long Sequences:...By attending to a subset of tokens, attention cost drops from quadratic to...
Source: medium.com
Link: https://medium.com/%40dr.teck/efficient-alternatives-to-transformer-self-attention-397851f324ab
Source: arxiv.org
Link: https://arxiv.org/abs/2502.01659
Source snippet
Increasing Transformer Context Length with Sparse Graph...by N Tomczak · 2025 · Cited by 2 — In this work, we address this issue by prop...
Source: arxiv.org
Link: https://arxiv.org/html/2506.01963v1
Source snippet
A Non-Attention LLM for Ultra-Long Context Horizons9 May 2025 — We present a novel non-attention-based architecture for large language mo...

Published: May 2025
Source: arxiv.org
Link: https://arxiv.org/abs/2209.04881
Source snippet
On The Computational Complexity of Self-Attentionby FD Keles · 2022 · Cited by 379 — We prove that the time complexity of self-attention...
Source: medium.com
Link: https://medium.com/%40mridulrao674385/attention-mechanism-complexity-analysis-7314063459b1
Source snippet
Attention Mechanism Complexity Analysis | by Mridul RaoComplexity analysis is about estimating how the time required to execute an algori...
Source: [machine-learning]({{ ‘machine-learning/’ | relative_url }})-made-simple.medium.com
Link: https://machine-learning-made-simple.medium.com/transformers-vs-mamba-vs-linear-attention-who-wins-long-context-f1dc8ceb5ede
Source snippet
vs Mamba vs Linear Attention: Who Wins Long...Transformer inference today faces a fundamental bottleneck — the quadratic cost of attention...
Source: medium.com
Link: https://medium.com/data-science-collective/transformers-the-game-changer-how-attention-is-all-you-need-architecture-changed-ai-forever-81a43344ce63
Source snippet
N, you compute N x N attention scores. For GPT-3 with 2048...Read more...
Source: medium.com
Link: https://medium.com/%40sailakkshmiallada/the-evolution-of-flash-attention-revolutionizing-transformer-efficiency-8a039918d507
Source snippet
ion, enabling longer context windows and faster training.Read more...
Source: sulbhajain.medium.com
Link: https://sulbhajain.medium.com/flash-attention-fast-and-memory-efficient-exact-attention-with-io-awareness-paper-review-79639127c5de
Source snippet
and Memory-Efficient Exact Attention with IO-Awareness...26 May 2025 — Solution: FlashAttention is an IO-aware exact attention algorithm...

Published: May 2025
Source: medium.com
Title: The Quadratic Burden
Link: https://medium.com/%40rajnish_khatri/the-quadratic-burden-45759ec6bd21
Source snippet
No recurrence. No convolution. Only attention — each token looking at every other token, weighing, selecting, composing...Read more...
Source: ahmdtaha.medium.com
Link: https://ahmdtaha.medium.com/flashattention-fast-and-memory-efficient-exact-attention-with-io-awareness-2a0aec52ed3d
Source snippet
medium.comFlashAttention: Fast and Memory-Efficient Exact Attention with...This paper [1] proposes an IO-aware algorithm that computes e...
Source: medium.com
Link: https://medium.com/%40kdk199604/kdks-review-attention-is-all-you-need-what-makes-the-transformer-so-revolutionary-c91f135583b0
Source snippet
Attention is All You Need: What makes the transformer so...In the Transformer model, the attention layer focuses on the input sequence i...
Source: openreview.net
Link: https://openreview.net/forum?id=pUEDkZyPDl&referrer=%5Bthe+profile+of+Dacheng+Li%5D%28%2Fprofile%3Fid%3D~Dacheng_Li1%29
Source snippet
DISTFLASHATTN: Distributed Memory-efficient Attention for...Aug 25, 2024 — Abstract: FlashAttention effectively reduces the quadratic pea...
Source: openreview.net
Link: https://openreview.net/pdf?id=Eh0Od2BJIM
Source: shreyansh26.github.io
Link: https://shreyansh26.github.io/post/2022-10-10_efficient_transformers_survey/
Source snippet
Shreyansh SinghPaper Summary #7 - Efficient Transformers: A Survey10 Oct 2022 — This is a survey paper on the various memory-efficiency...
Source: mbrenndoerfer.com
Title: quadratic attention bottleneck transformers long sequences
Link: https://mbrenndoerfer.com/writing/quadratic-attention-bottleneck-transformers-long-sequences
Source snippet
Michael BrenndoerferQuadratic Attention Bottleneck: Why Transformers Struggle...19 Jun 2025 — Understand why self-attention has O(n²) co...
Source: machinelearningatscale.substack.com
Title: Machine Learning At Scale64
Link: https://machinelearningatscale.substack.com/p/64-challenges-and-solutions-of-long
Source snippet
substack.com64. Breaking the Attention Barrier: A Deep Dive into Scaling...Flash Attention is an algorithm designed to address the memor...
Source: attention-survey.github.io
Link: https://attention-survey.github.io/files/Attention_Survey.pdf
Source snippet
ntion heads, and d = 128 head dimensionality in a 48-layer Transformer with MHA, the KV.Read more...
Source: mbrenndoerfer.com
Title: attention complexity quadratic scaling memory efficient transformers
Link: https://mbrenndoerfer.com/writing/attention-complexity-quadratic-scaling-memory-efficient-transformers
Source snippet
Attention Complexity: Quadratic Scaling, Memory Limits &...26 May 2025 — For autoregressive generation where the full context is reproce...

Published: May 2025
Source: mbrenndoerfer.com
Title: flashattention io aware exact attention long context language models
Link: https://mbrenndoerfer.com/writing/flashattention-io-aware-exact-attention-long-context-language-models
Source snippet
FlashAttention: IO-Aware Exact Attention for Long-Context...Jul 11, 2025 — This approach reduced memory complexity from quadratic to lin...
Source: mbrenndoerfer.com
Title: The Transformer: Attention Is All You Need
Link: https://mbrenndoerfer.com/writing/transformer-attention-is-all-you-need
Source snippet
InteractiveJun 7, 2025 — A comprehensive guide to the Transformer architecture, including self-attention mechanisms, [multi-head]({{ 'multi-heads/' | relative_url }}) attention...
Source: github.com
Link: https://github.com/dao-ailab/flash-attention
Source snippet
s memory quadratic in sequence length, whereas FlashAttention has memory linear...

Additional References

Source: aussieai.com
Link: https://www.aussieai.com/research/attention
Source snippet
Attention OptimizationMemory-efficient attention algorithms are an inference optimization method that improves the QKV matrix computation...
Source: bentoml.com
Link: https://bentoml.com/llm/kernel-optimization/flashattention
Source snippet
FlashAttention | LLM Inference HandbookFlashAttention is a fast, memory-efficient attention algorithm for Transformers that accelerates L...
Source: nvidia.com
Link: https://www.nvidia.com/en-us/on-demand/session/gtc24-s62546/
Source snippet
FlashAttention: Fast and Memory-Efficient Exact Attention...We propose FlashAttention, an IO-aware exact attention algorithm that uses t...
Source: reddit.com
Link: https://www.reddit.com/r/LocalLLaMA/comments/17rme8v/regarding_long_context_and_quadratic_attention/
Source snippet
Regarding long context and quadratic attentionQuadratic scaling of attention is a problem, but not something any of the currently trained...
Source: stackoverflow.com
Link: https://stackoverflow.com/questions/65703260/computational-complexity-of-self-attention-in-the-transformer-model
Source snippet
Computational Complexity of Self-Attention in the...I recently went through the Transformer paper from Google Research describing how se...
Source: velog.io
Link: https://velog.io/%40chaewonkim0425/Why-Attention-was-all-we-needed
Source snippet
[Paper review] Why Attention Was All We NeededThe self-attention mechanism compares every token with every other token, causing quadratic...
Source: researchgate.net
Link: https://www.researchgate.net/publication/398527678_Efficient_Attention_and_Beyond_A_Survey_of_Advances_in_Optimizing_Transformer_Inference
Source snippet
A Survey of Advances in Optimizing Transformer Inference12 Dec 2025 — Efficient Attention and Beyond: A Survey of Advances in Optimizing...
Source: apxml.com
Link: https://apxml.com/courses/foundations-transformers-architecture/chapter-6-advanced-architectural-variants-analysis/self-attention-complexity
Source: wandb.ai
Link: https://wandb.ai/wandb_fc/tips/reports/The-Problem-with-Quadratic-Attention-in-Transformer-Architectures–Vmlldzo3MDE0Mzcz
Source snippet
The Problem with Quadratic Attention in Transformer...Mar 4, 2024 — This report provides a brief overview of the problem with vanilla se...
Source: youtube.com
Link: https://www.youtube.com/watch?v=eMlx5fFNoYc
Source snippet
"Kurt, W. (2024, August 12). Say what you mean: A response to 'Let Me Speak Freely'. Count Bayesie. [https://www.countbayesie.com/blog/2024..."](https://www.countbayesie.com/blog/2024...")...

Why faster training still hit a length limit

Introduction

Why attention cost grows with sequence length

Why longer prompts become disproportionately expensive

Memory pressure in long-input Transformer training

How newer parallelism and efficient attention try to help

The continuing trade-off behind long-context AI

Further Reading

Hands-On Large Language Models

Build a Large Language Model (From Scratch)

Natural Language Processing with Transformers

Transformers for Machine Learning

Marketplace Samples

Nike FA Coach Developer T Shirt

Solution Developer T Shirt - We Do Framed Wall Art Poster Canvas Print Picture

Training Developer T Shirt - What I Framed Wall Art Poster Canvas Print Picture

Sales Developer T Shirt - What Is Y Framed Wall Art Poster Canvas Print Picture

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2