Within AI Sense
The Architecture Behind Modern AI
Transformers changed AI by letting models weigh relationships across inputs and train efficiently at large scale.
On this page
- What attention mechanisms compare
- Why parallel training mattered
- How transformers spread beyond language
Page outline Jump by section
Introduction
Transformers are one of the main technical reasons modern artificial intelligence moved from narrow, hand-built systems towards large, general-purpose models. Their key idea is attention: instead of reading an input strictly from left to right, a model can compare many parts of the input with each other and decide which relationships matter. That made it easier to train bigger models on huge datasets, because much of the work could be parallelised on modern hardware rather than processed step by step. The 2017 paper Attention Is All You Need introduced the Transformer as an architecture built around attention rather than recurrence or convolution, reporting stronger translation results and shorter training time than leading alternatives of the period. [arXiv]arxiv.orgarXiv Attention Is All You NeedAttention Is All You NeedJune 12, 2017…
The importance of Transformers is not just that they improved machine translation. The same mechanism became a flexible template for systems that write text, answer questions, classify images, process speech, model proteins, and combine text with images. BERT showed how Transformer encoders could produce powerful language representations; GPT-style models showed how decoder-only Transformers could scale into general text generators; Vision Transformers showed that images could be treated as sequences of patches; and AlphaFold’s Evoformer used attention-like machinery to reason over biological sequence and structure information. [Nature+3arXiv+3arXiv]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingOctober 11, 2018…
What attention mechanisms compare
Attention is easiest to understand as a learned comparison system. A Transformer breaks text, image patches, or other data into units often called tokens. For each token, the model forms three learned representations: a query, a key, and a value. The query asks, in effect, “what am I looking for?” The key says “what information do I contain?” The value carries the information that may be passed onward. The model compares queries with keys, turns those comparison scores into weights, and uses the weights to mix values from relevant tokens into a new representation. The original Transformer used “scaled dot-product attention”, a compact mathematical way of performing this comparison across the input. [arXiv]arxiv.orgarXiv Attention Is All You NeedAttention Is All You NeedJune 12, 2017…
This matters because meaning in real data often depends on relationships, not isolated symbols. In the sentence “The trophy would not fit in the suitcase because it was too large”, a useful model has to decide whether “it” refers to the trophy or the suitcase. A simpler system may represent each word mostly from its local neighbours. A self-attention layer can compare “it” with “trophy”, “suitcase”, “large”, and the rest of the sentence directly. That does not guarantee human-like understanding, but it gives the model a direct route for linking distant pieces of information.
Multi-head attention extends the idea by letting the model make several kinds of comparison at once. One attention head may learn patterns that resemble grammatical links; another may track repeated names; another may route information from nearby words; another may capture longer-range context. The heads are not hand-labelled rules, and their behaviour can be difficult to interpret cleanly, but the architectural effect is clear: each layer can move information between tokens in multiple learned ways before a feed-forward network refines each token’s representation. [Wikipedia]WikipediaTransformer (deep learningTransformer (deep learning
A common misunderstanding is that attention weights are a perfect explanation of what the model “looked at”. They are useful signals, but not a transparent map of causation. Research on attention flow has shown that information is mixed across layers, so raw attention weights can be unreliable as simple explanations of which input tokens mattered most. [arXiv]arxiv.orgarXiv Quantifying Attention Flow in TransformersarXiv Quantifying Attention Flow in Transformers
Why parallel training mattered
Before Transformers, many strong language systems used recurrent neural networks, including long short-term memory models. These models processed sequences in order: the hidden state after one word helped process the next word. That was natural for language, but it created a training bottleneck. If the model needs the previous step before computing the next one, it is harder to exploit the massive parallelism of graphics processing units and specialised AI chips.
The Transformer changed that tradeoff. Because self-attention can compare all tokens in a sequence at once, training can be parallelised much more effectively. The original Transformer paper was explicit about this advantage: it dispensed with recurrence and convolutions, achieved strong translation results, and trained its English-to-French model in 3.5 days on eight GPUs, described as a small fraction of the training cost of earlier leading models. [arXiv]arxiv.orgarXiv Attention Is All You NeedAttention Is All You NeedJune 12, 2017…
That shift fitted a broader pattern in AI: methods that can use more data and more computation tend to improve dramatically as hardware and datasets grow. Rich Sutton’s “bitter lesson” argued that, across AI history, general methods that leverage computation have repeatedly beaten approaches built around handcrafted human knowledge. Transformers became a practical example of that lesson because they were general, data-hungry, and well matched to parallel hardware. [Incomplete Ideas]incompleteideas.netOpen source on incompleteideas.net.
The later history of large language models is partly a history of this scaling. OpenAI’s GPT work used Transformer-based language modelling to show that pre-training on large text corpora could be adapted to many language tasks. GPT-3 then scaled an autoregressive Transformer to 175 billion parameters and tested it across many tasks using prompts rather than task-specific fine-tuning. [OpenAI CDN]cdn.openai.comOpen AI CDNImproving Language Understanding by Generative PreOpen AI CDNImproving Language Understanding by Generative Pre
Scaling also became more systematic. The 2020 scaling-laws work found that language model loss followed predictable power-law relationships with model size, data size, and training compute across a wide range of experiments. DeepMind’s Chinchilla work later argued that many large language models had been undertrained on too little data for their size, and that compute-optimal training should scale model size and training tokens together. [arXiv]arxiv.orgarXiv Scaling Laws for Neural Language ModelsarXiv Scaling Laws for Neural Language Models
How Transformers reshaped language AI
The first large wave of Transformer impact came through natural language processing. BERT, introduced by Google researchers in 2018, used a Transformer encoder to learn bidirectional representations from unlabelled text. Instead of reading only from left to right, BERT was trained to condition on both left and right context, then fine-tuned for tasks such as question answering and language inference. The paper reported new state-of-the-art results on eleven NLP tasks, including GLUE, MultiNLI, and SQuAD benchmarks. [arXiv]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingOctober 11, 2018…
GPT-style models took a different route. They used decoder-only Transformers trained to predict the next token. That sounds simple, but at large scale it proved surprisingly flexible: the same model could summarise, translate, answer questions, write code, and follow instructions after further training and alignment. The GPT-4 technical report describes GPT-4 as a Transformer-based model pre-trained to predict the next token, then improved through post-training alignment; it also notes that infrastructure and optimisation methods allowed some performance trends to be predicted from smaller runs. [arXiv]arxiv.orgarXiv GPT-4 Technical ReportarXiv GPT-4 Technical Report
This split between encoder, decoder, and encoder-decoder designs helps explain why “Transformer” is an architectural family rather than a single product. Encoder models are often useful for understanding and classifying inputs. Decoder models are central to text generation. Encoder-decoder models remain important for tasks where an input sequence is transformed into an output sequence, such as translation or summarisation. The shared core is attention-based representation and information routing, not a single fixed behaviour. [Stanford HAI]hai.stanford.eduHAIWhat is a Transformer?HAIWhat is a Transformer?
How Transformers spread beyond language
Transformers spread because attention is not inherently tied to words. It works on any data that can be represented as a sequence or set of tokens: words, image patches, audio frames, protein residues, robot actions, or multimodal embeddings. That made the architecture portable.
Vision Transformers made this point vividly. The 2020 paper An Image is Worth 16x16 Words split images into fixed-size patches and treated those patches like tokens in a sequence. The authors argued that a pure Transformer could perform very well on image classification when pre-trained on large datasets, challenging the assumption that convolutional neural networks were always necessary for vision. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScalearXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
In biology, AlphaFold did not simply drop in a standard language Transformer, but it used attention-based components inside a specialised architecture. Its Evoformer blocks exchanged information between multiple sequence alignments and pair representations, helping the network reason about evolutionary and spatial relationships in proteins. The Nature paper describes attention-based and non-attention-based components working together, with structural hypotheses refined across the network. [Nature]nature.comOpen source on nature.com.
This spread beyond language is why Transformers became central to the idea of foundation models: large models trained on broad data that can be adapted to many tasks. Stanford’s foundation model report describes these systems as models whose capabilities and risks span language, vision, robotics, reasoning, human interaction, and other domains. [Stanford CRFM]crfm.stanford.eduOpen source on stanford.edu.
The tradeoff: powerful context, expensive computation
Self-attention has a cost. Standard attention compares each token with every other token in the sequence, so its time and memory requirements grow quadratically with sequence length. Doubling the context length can more than double the attention burden. This is one reason long documents, video, scientific data, and extended conversations remain technically challenging even for powerful models.
The field has responded with more efficient attention methods, sparse patterns, memory optimisations, and hardware-aware implementations. FlashAttention is a notable example: rather than approximating attention, it reorganises exact attention computation to reduce costly reads and writes between GPU memory levels. Its authors reported faster Transformer training, reduced memory pressure, and improved handling of longer sequences. [arXiv]arxiv.orgOpen source on arxiv.org.
The cost problem also affects who can build frontier models. Large Transformers require expensive hardware, engineering expertise, energy, and data pipelines. This has concentrated leading model development in well-funded companies and research labs, even though open-source models and efficiency improvements have widened access. The architecture made scaling practical, but it did not make scaling cheap.
What Transformers do not solve by themselves
Transformers are often described as the architecture behind modern AI, but architecture alone does not explain today’s systems. A Transformer becomes useful through training data, objectives, optimisation methods, hardware, evaluation, product design, and alignment work. GPT-4, for example, is described not only as a Transformer-style model but also as a system shaped by pre-training, licensed and public data, reinforcement learning from human feedback, and safety evaluation. [OpenAI CDN]cdn.openai.comOpen AI CDNGPT-4 Technical ReportOpen AI CDNGPT-4 Technical Report
They also do not remove the need for caution about reliability. A Transformer can model statistical relationships in text or other data without having a grounded, human-like understanding of the world. It can produce fluent but false answers, inherit biases from training data, struggle outside its training distribution, or fail on tasks that require robust causal reasoning. Debate around foundation models has repeatedly centred on this tension: the same scale and flexibility that make these systems useful can also make their failures broad and difficult to inspect. [Axios]axios.comAI's shaky foundationsAI's shaky foundations
Even within the architecture, attention is not the whole story. Transformer blocks usually combine attention with feed-forward networks, residual connections, normalisation, positional information, and training tricks. Research has shown that pure attention without the surrounding machinery can have limiting behaviours, while practical Transformers depend on the interaction of multiple components. [arXiv]arxiv.orgOpen source on arxiv.org.
Why this architecture still anchors modern AI
Transformers became dominant because they solved a practical bottleneck at the right historical moment. Attention gave models a flexible way to compare relationships across inputs. Parallel training made those models fit modern accelerators. Scaling laws gave researchers confidence that larger models, larger datasets, and more compute would often produce predictable improvements. The same architecture then generalised beyond language into vision, biology, multimodal systems, and other fields.
The most important takeaway is not that attention is magic. It is that Transformers turned relationship-weighing into a scalable computational primitive. That made it possible to train large, reusable models whose abilities are not coded task by task but emerge from broad pre-training and adaptation. In the wider project of understanding artificial intelligence, Transformers mark the point where architecture, data, and compute began to combine into the modern foundation-model paradigm.
Amazon book picks
Further Reading
Books and field guides related to The Architecture Behind Modern AI. Use these as the next step if you want deeper reading beyond the article.
Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...
Helps technically inclined readers move from concepts toward working neural-network models.
Artificial Intelligence
Gives the conceptual background needed to understand why transformer-based AI matters.
Natural Language Processing with Transformers
Directly covers transformer architectures and their use in modern NLP systems.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Provides the neural-network foundations behind modern architectures, including attention-era systems.
Endnotes
-
Source: arxiv.org
Title: arXiv Attention Is All You Need
Link: https://arxiv.org/abs/1706.03762Source snippet
Attention Is All You NeedJune 12, 2017...
Published: June 12, 2017
-
Source: arxiv.org
Link: https://arxiv.org/abs/1810.04805Source snippet
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingOctober 11, 2018...
Published: October 11, 2018
-
Source: arxiv.org
Link: https://arxiv.org/abs/2005.14165Source snippet
arXiv[2005.14165] Language Models are Few-Shot Learnersby TB Brown · 2020 · Cited by 74911 — Here we show that scaling up language models...
-
Source: arxiv.org
Title: arXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Link: https://arxiv.org/abs/2010.11929 -
Source: nature.com
Link: https://www.nature.com/articles/s41586-021-03819-2 -
Source: Wikipedia
Title: Transformer (deep learning)
Link: https://en.wikipedia.org/wiki/Transformer_%28deep_learning%29 -
Source: arxiv.org
Title: arXiv Quantifying Attention Flow in Transformers
Link: https://arxiv.org/abs/2005.00928 -
Source: cdn.openai.com
Title: Open AI CDNImproving Language Understanding by [Generative]({{ ‘generative-ai/’ | relative_url }}) Pre
Link: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf -
Source: arxiv.org
Title: arXiv Scaling Laws for Neural Language Models
Link: https://arxiv.org/abs/2001.08361 -
Source: arxiv.org
Title: arXiv Training Compute-Optimal Large Language Models
Link: https://arxiv.org/abs/2203.15556 -
Source: arxiv.org
Title: arXiv GPT-4 Technical Report
Link: https://arxiv.org/abs/2303.08774 -
Source: hai.stanford.edu
Title: HAIWhat is a Transformer?
Link: https://hai.stanford.edu/ai-definitions/what-is-a-transformer -
Source: crfm.stanford.edu
Link: https://crfm.stanford.edu/report.html -
Source: arxiv.org
Link: https://arxiv.org/abs/2205.14135 -
Source: hai.stanford.edu
Link: https://hai.stanford.edu/research/flashattention-fast-and-memory-efficient-exact-attention-with-io-awareness -
Source: cdn.openai.com
Title: Open AI CDNGPT-4 Technical Report
Link: https://cdn.openai.com/papers/gpt-4.pdf -
Source: axios.com
Title: AI’s shaky foundations
Link: https://www.axios.com/2021/08/18/foundation-ai-models-stanford -
Source: arxiv.org
Link: https://arxiv.org/abs/2103.03404 -
Source: arxiv.org
Link: https://arxiv.org/html/2604.00965v1 -
Source: arxiv.org
Link: https://arxiv.org/pdf/1810.04805 -
Source: arxiv.org
Link: https://arxiv.org/html/1810.04805v2 -
Source: arxiv.org
Link: https://arxiv.org/pdf/2005.14165 -
Source: arxiv.org
Link: https://arxiv.org/html/2505.20098v2 -
Source: arxiv.org
Link: https://arxiv.org/pdf/2010.11929 -
Source: arxiv.org
Link: https://arxiv.org/html/2510.20387v1 -
Source: arxiv.org
Link: https://arxiv.org/pdf/2203.15556 -
Source: arxiv.org
Link: https://arxiv.org/abs/2507.19595 -
Source: ar5iv.labs.arxiv.org
Link: https://ar5iv.labs.arxiv.org/html/2203.15556 -
Source: ar5iv.labs.arxiv.org
Link: https://ar5iv.labs.arxiv.org/html/2001.08361 -
Source: arxiv.org
Link: https://arxiv.org/html/2410.09649v1 -
Source: arxiv.org
Link: https://arxiv.org/html/2303.08774v6 -
Source: arxiv.org
Link: https://arxiv.org/pdf/2012.11747 -
Source: arxiv.org
Link: https://arxiv.org/pdf/2407.09517 -
Source: OpenAI
Link: https://openai.com/ -
Source: hai.stanford.edu
Title: 2025 ai index report
Link: https://hai.stanford.edu/ai-index/2025-ai-index-report -
Source: hai.stanford.edu
Title: ai index 2025 state of ai in 10 charts
Link: https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts -
Source: hai.stanford.edu
Title: what are foundation models
Link: https://hai.stanford.edu/ai-definitions/what-are-foundation-models -
Source: hai.stanford.edu
Title: ai index
Link: https://hai.stanford.edu/ai-index -
Source: hai.stanford.edu
Title: what is a llm
Link: https://hai.stanford.edu/ai-definitions/what-is-a-llm -
Source: hai.stanford.edu
Title: transparency in ai is on the decline
Link: https://hai.stanford.edu/news/transparency-in-ai-is-on-the-decline -
Source: hai.stanford.edu
Title: research and development
Link: https://hai.stanford.edu/ai-index/2025-ai-index-report/research-and-development -
Source: hai.stanford.edu
Title: hai annualreport2025 digital v5 compressed
Link: https://hai.stanford.edu/assets/files/hai_annualreport2025_digital_v5_compressed.pdf -
Source: crfm.stanford.edu
Link: https://crfm.stanford.edu/assets/report.pdf -
Source: crfm.stanford.edu
Link: https://crfm.stanford.edu/ -
Source: Wikipedia
Title: Attention Is All You Need
Link: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need -
Source: Wikipedia
Title: Attention ([machine learning]({{ ‘machine-learning/’ | relative_url }}))
Link: https://en.wikipedia.org/wiki/Attention_%28machine_learning%29 -
Source: Wikipedia
Title: Foundation model
Link: https://en.wikipedia.org/wiki/Foundation_model -
Source: Wikipedia
Title: BERT (language model)
Link: https://en.wikipedia.org/wiki/BERT_%28language_model%29 -
Source: Wikipedia
Title: Alpha Fold
Link: https://en.wikipedia.org/wiki/AlphaFold -
Source: Wikipedia
Title: Neural scaling law
Link: https://en.wikipedia.org/wiki/Neural_scaling_law -
Source: Wikipedia
Title: Bitter lesson
Link: https://en.wikipedia.org/wiki/Bitter_lesson -
Source: Wikipedia
Title: Open AI
Link: https://en.wikipedia.org/wiki/OpenAI -
Source: scholar.google.com
Link: https://scholar.google.com/citations?hl=en&user=KNr3vb4AAAAJ -
Source: nature.com
Link: https://www.nature.com/articles/s41392-023-01381-z -
Source: incompleteideas.net
Link: https://www.incompleteideas.net/IncIdeas/BitterLesson.html -
Source: cs.utexas.edu
Title: bitter lesson
Link: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf -
Source: github.com
Title: Machine Translation
Link: https://github.com/abmitra84/Machine_Translation -
Source: openreview.net
Link: https://openreview.net/pdf?id=H4DqfPSibmx -
Source: yenguage.github.io
Link: https://yenguage.github.io/natural%20language%20processing/GPT/ -
Source: linkedin.com
Link: https://www.linkedin.com/posts/stanfordhai_new-the-stanford-center-for-research-activity-7404221342213955586-cLzX -
Source: linkedin.com
Link: https://www.linkedin.com/company/openai -
Source: gogoduck912.github.io
Title: bitter lesson
Link: https://gogoduck912.github.io/blog/bitter-lesson/ -
Source: radical.vc
Title: stanford hai ai index report 2025
Link: https://radical.vc/stanford-hai-ai-index-report-2025/
Additional References
-
Source: youtu.be
Link: https://youtu.be/8UZgTNxuKzYSource snippet
"'Attention is all you need' paper - [https://arxiv.org/pdf/1706.03762.pdf..."](https://arxiv.org/pdf/1706.03762.pdf...")...
-
Source: youtu.be
Link: https://youtu.be/KFZrBxSA9tISource snippet
Query, Key, Value Explained The Secret Behind GPT | AI - YouTube...
-
Source: youtu.be
Link: https://youtu.be/GzomXNLFgkkSource snippet
"►AWS Certified Solution Architect Professional: [https://youtu.be/KFZrBxSA9tI..."](https://youtu.be/KFZrBxSA9tI...")...
-
Source: youtube.com
Title: Alpha Fold Decoded: Evoformer (Lesson 5)
Link: https://www.youtube.com/watch?v=gY4-vVRTkpkSource snippet
Attention is all you need explained - YouTube Attention is all you need explained - YouTube...
-
Source: youtube.com
Title: Transformer Architecture Explained ‘Attention Is All You Need’
Link: https://www.youtube.com/watch?v=XwYY0lCGWW8Source snippet
Query, Key, Value Explained The Secret Behind GPT | AI...
-
Source: youtube.com
Title: Query, Key, Value Explained The Secret Behind GPT | AI
Link: https://www.youtube.com/watch?v=VAyb-M14ka8Source snippet
Attention in Transformers Query, Key and Value in Machine Learning...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/atalbajpai_the-illustrated-transformer-activity-7369831609962811392-C2DA -
Source: academia.edu
Link: https://www.academia.edu/95250851/An_Image_is_Worth_16x16_Words_Transformers_for_Image_Recognition_at_Scale -
Source: d2l.ai
Link: https://www.d2l.ai/chapter_attention-mechanisms-and-transformers/index.html -
Source: wired.com
Link: https://www.wired.com/story/stanford-proposal-ai-foundations-ignites-debate
Topic Tree
Follow this branch
Parent topic
AI SenseRelated pages 11
- AI Errors Why AI Can Be Confidently Wrong
- AI Outputs What Counts as AI Today?
- Business Adoption Why AI Pilots Often Stall
- Deep Learning Why Layers Changed AI
- Generative AI Why Generative AI Feels Different
- Language Models Why Chatbots Sound So Fluent
- Machine Learning How Machines Learn From Examples
- Narrow vs AGI Is Today’s AI Actually General?
- +3 more in sidebar



