Within AI Sense

The Architecture Behind Modern AI

Transformers changed AI by letting models weigh relationships across inputs and train efficiently at large scale.

On this page

  • What attention mechanisms compare
  • Why parallel training mattered
  • How transformers spread beyond language
Preview for The Architecture Behind Modern AI

Introduction

Transformers are one of the main technical reasons modern artificial intelligence moved from narrow, hand-built systems towards large, general-purpose models. Their key idea is attention: instead of reading an input strictly from left to right, a model can compare many parts of the input with each other and decide which relationships matter. That made it easier to train bigger models on huge datasets, because much of the work could be parallelised on modern hardware rather than processed step by step. The 2017 paper Attention Is All You Need introduced the Transformer as an architecture built around attention rather than recurrence or convolution, reporting stronger translation results and shorter training time than leading alternatives of the period. [arXiv]arxiv.orgarXiv Attention Is All You NeedAttention Is All You NeedJune 12, 2017…Published: June 12, 2017

Overview image for Transformers The importance of Transformers is not just that they improved machine translation. The same mechanism became a flexible template for systems that write text, answer questions, classify images, process speech, model proteins, and combine text with images. BERT showed how Transformer encoders could produce powerful language representations; GPT-style models showed how decoder-only Transformers could scale into general text generators; Vision Transformers showed that images could be treated as sequences of patches; and AlphaFold’s Evoformer used attention-like machinery to reason over biological sequence and structure information. [Nature+3arXiv+3arXiv]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingOctober 11, 2018…Published: October 11, 2018

What attention mechanisms compare

Attention is easiest to understand as a learned comparison system. A Transformer breaks text, image patches, or other data into units often called tokens. For each token, the model forms three learned representations: a query, a key, and a value. The query asks, in effect, “what am I looking for?” The key says “what information do I contain?” The value carries the information that may be passed onward. The model compares queries with keys, turns those comparison scores into weights, and uses the weights to mix values from relevant tokens into a new representation. The original Transformer used “scaled dot-product attention”, a compact mathematical way of performing this comparison across the input. [arXiv]arxiv.orgarXiv Attention Is All You NeedAttention Is All You NeedJune 12, 2017…Published: June 12, 2017

This matters because meaning in real data often depends on relationships, not isolated symbols. In the sentence “The trophy would not fit in the suitcase because it was too large”, a useful model has to decide whether “it” refers to the trophy or the suitcase. A simpler system may represent each word mostly from its local neighbours. A self-attention layer can compare “it” with “trophy”, “suitcase”, “large”, and the rest of the sentence directly. That does not guarantee human-like understanding, but it gives the model a direct route for linking distant pieces of information.

Multi-head attention extends the idea by letting the model make several kinds of comparison at once. One attention head may learn patterns that resemble grammatical links; another may track repeated names; another may route information from nearby words; another may capture longer-range context. The heads are not hand-labelled rules, and their behaviour can be difficult to interpret cleanly, but the architectural effect is clear: each layer can move information between tokens in multiple learned ways before a feed-forward network refines each token’s representation. [Wikipedia]WikipediaTransformer (deep learningTransformer (deep learning

A common misunderstanding is that attention weights are a perfect explanation of what the model “looked at”. They are useful signals, but not a transparent map of causation. Research on attention flow has shown that information is mixed across layers, so raw attention weights can be unreliable as simple explanations of which input tokens mattered most. [arXiv]arxiv.orgarXiv Quantifying Attention Flow in TransformersarXiv Quantifying Attention Flow in Transformers

Transformers illustration 1

Why parallel training mattered

Before Transformers, many strong language systems used recurrent neural networks, including long short-term memory models. These models processed sequences in order: the hidden state after one word helped process the next word. That was natural for language, but it created a training bottleneck. If the model needs the previous step before computing the next one, it is harder to exploit the massive parallelism of graphics processing units and specialised AI chips.

The Transformer changed that tradeoff. Because self-attention can compare all tokens in a sequence at once, training can be parallelised much more effectively. The original Transformer paper was explicit about this advantage: it dispensed with recurrence and convolutions, achieved strong translation results, and trained its English-to-French model in 3.5 days on eight GPUs, described as a small fraction of the training cost of earlier leading models. [arXiv]arxiv.orgarXiv Attention Is All You NeedAttention Is All You NeedJune 12, 2017…Published: June 12, 2017

That shift fitted a broader pattern in AI: methods that can use more data and more computation tend to improve dramatically as hardware and datasets grow. Rich Sutton’s “bitter lesson” argued that, across AI history, general methods that leverage computation have repeatedly beaten approaches built around handcrafted human knowledge. Transformers became a practical example of that lesson because they were general, data-hungry, and well matched to parallel hardware. [Incomplete Ideas]incompleteideas.netOpen source on incompleteideas.net.

The later history of large language models is partly a history of this scaling. OpenAI’s GPT work used Transformer-based language modelling to show that pre-training on large text corpora could be adapted to many language tasks. GPT-3 then scaled an autoregressive Transformer to 175 billion parameters and tested it across many tasks using prompts rather than task-specific fine-tuning. [OpenAI CDN]cdn.openai.comOpen AI CDNImproving Language Understanding by Generative PreOpen AI CDNImproving Language Understanding by Generative Pre

Scaling also became more systematic. The 2020 scaling-laws work found that language model loss followed predictable power-law relationships with model size, data size, and training compute across a wide range of experiments. DeepMind’s Chinchilla work later argued that many large language models had been undertrained on too little data for their size, and that compute-optimal training should scale model size and training tokens together. [arXiv]arxiv.orgarXiv Scaling Laws for Neural Language ModelsarXiv Scaling Laws for Neural Language Models

How Transformers reshaped language AI

The first large wave of Transformer impact came through natural language processing. BERT, introduced by Google researchers in 2018, used a Transformer encoder to learn bidirectional representations from unlabelled text. Instead of reading only from left to right, BERT was trained to condition on both left and right context, then fine-tuned for tasks such as question answering and language inference. The paper reported new state-of-the-art results on eleven NLP tasks, including GLUE, MultiNLI, and SQuAD benchmarks. [arXiv]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingOctober 11, 2018…Published: October 11, 2018

GPT-style models took a different route. They used decoder-only Transformers trained to predict the next token. That sounds simple, but at large scale it proved surprisingly flexible: the same model could summarise, translate, answer questions, write code, and follow instructions after further training and alignment. The GPT-4 technical report describes GPT-4 as a Transformer-based model pre-trained to predict the next token, then improved through post-training alignment; it also notes that infrastructure and optimisation methods allowed some performance trends to be predicted from smaller runs. [arXiv]arxiv.orgarXiv GPT-4 Technical ReportarXiv GPT-4 Technical Report

This split between encoder, decoder, and encoder-decoder designs helps explain why “Transformer” is an architectural family rather than a single product. Encoder models are often useful for understanding and classifying inputs. Decoder models are central to text generation. Encoder-decoder models remain important for tasks where an input sequence is transformed into an output sequence, such as translation or summarisation. The shared core is attention-based representation and information routing, not a single fixed behaviour. [Stanford HAI]hai.stanford.eduHAIWhat is a Transformer?HAIWhat is a Transformer?

Transformers illustration 2

How Transformers spread beyond language

Transformers spread because attention is not inherently tied to words. It works on any data that can be represented as a sequence or set of tokens: words, image patches, audio frames, protein residues, robot actions, or multimodal embeddings. That made the architecture portable.

Vision Transformers made this point vividly. The 2020 paper An Image is Worth 16x16 Words split images into fixed-size patches and treated those patches like tokens in a sequence. The authors argued that a pure Transformer could perform very well on image classification when pre-trained on large datasets, challenging the assumption that convolutional neural networks were always necessary for vision. [arXiv]arxiv.orgarXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at ScalearXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

In biology, AlphaFold did not simply drop in a standard language Transformer, but it used attention-based components inside a specialised architecture. Its Evoformer blocks exchanged information between multiple sequence alignments and pair representations, helping the network reason about evolutionary and spatial relationships in proteins. The Nature paper describes attention-based and non-attention-based components working together, with structural hypotheses refined across the network. [Nature]nature.comOpen source on nature.com.

This spread beyond language is why Transformers became central to the idea of foundation models: large models trained on broad data that can be adapted to many tasks. Stanford’s foundation model report describes these systems as models whose capabilities and risks span language, vision, robotics, reasoning, human interaction, and other domains. [Stanford CRFM]crfm.stanford.eduOpen source on stanford.edu.

The tradeoff: powerful context, expensive computation

Self-attention has a cost. Standard attention compares each token with every other token in the sequence, so its time and memory requirements grow quadratically with sequence length. Doubling the context length can more than double the attention burden. This is one reason long documents, video, scientific data, and extended conversations remain technically challenging even for powerful models.

The field has responded with more efficient attention methods, sparse patterns, memory optimisations, and hardware-aware implementations. FlashAttention is a notable example: rather than approximating attention, it reorganises exact attention computation to reduce costly reads and writes between GPU memory levels. Its authors reported faster Transformer training, reduced memory pressure, and improved handling of longer sequences. [arXiv]arxiv.orgOpen source on arxiv.org.

The cost problem also affects who can build frontier models. Large Transformers require expensive hardware, engineering expertise, energy, and data pipelines. This has concentrated leading model development in well-funded companies and research labs, even though open-source models and efficiency improvements have widened access. The architecture made scaling practical, but it did not make scaling cheap.

Transformers illustration 3

What Transformers do not solve by themselves

Transformers are often described as the architecture behind modern AI, but architecture alone does not explain today’s systems. A Transformer becomes useful through training data, objectives, optimisation methods, hardware, evaluation, product design, and alignment work. GPT-4, for example, is described not only as a Transformer-style model but also as a system shaped by pre-training, licensed and public data, reinforcement learning from human feedback, and safety evaluation. [OpenAI CDN]cdn.openai.comOpen AI CDNGPT-4 Technical ReportOpen AI CDNGPT-4 Technical Report

They also do not remove the need for caution about reliability. A Transformer can model statistical relationships in text or other data without having a grounded, human-like understanding of the world. It can produce fluent but false answers, inherit biases from training data, struggle outside its training distribution, or fail on tasks that require robust causal reasoning. Debate around foundation models has repeatedly centred on this tension: the same scale and flexibility that make these systems useful can also make their failures broad and difficult to inspect. [Axios]axios.comAI's shaky foundationsAI's shaky foundations

Even within the architecture, attention is not the whole story. Transformer blocks usually combine attention with feed-forward networks, residual connections, normalisation, positional information, and training tricks. Research has shown that pure attention without the surrounding machinery can have limiting behaviours, while practical Transformers depend on the interaction of multiple components. [arXiv]arxiv.orgOpen source on arxiv.org.

Why this architecture still anchors modern AI

Transformers became dominant because they solved a practical bottleneck at the right historical moment. Attention gave models a flexible way to compare relationships across inputs. Parallel training made those models fit modern accelerators. Scaling laws gave researchers confidence that larger models, larger datasets, and more compute would often produce predictable improvements. The same architecture then generalised beyond language into vision, biology, multimodal systems, and other fields.

The most important takeaway is not that attention is magic. It is that Transformers turned relationship-weighing into a scalable computational primitive. That made it possible to train large, reusable models whose abilities are not coded task by task but emerge from broad pre-training and adaptation. In the wider project of understanding artificial intelligence, Transformers mark the point where architecture, data, and compute began to combine into the modern foundation-model paradigm.

Amazon book picks

Further Reading

Books and field guides related to The Architecture Behind Modern AI. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Provides the neural-network foundations behind modern architectures, including attention-era systems.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Title: arXiv Attention Is All You Need
    Link: https://arxiv.org/abs/1706.03762
    Source snippet

    Attention Is All You NeedJune 12, 2017...

    Published: June 12, 2017

  2. Source: arxiv.org
    Link: https://arxiv.org/abs/1810.04805
    Source snippet

    BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingOctober 11, 2018...

    Published: October 11, 2018

  3. Source: arxiv.org
    Link: https://arxiv.org/abs/2005.14165
    Source snippet

    arXiv[2005.14165] Language Models are Few-Shot Learnersby TB Brown · 2020 · Cited by 74911 — Here we show that scaling up language models...

  4. Source: arxiv.org
    Title: arXiv An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
    Link: https://arxiv.org/abs/2010.11929

  5. Source: nature.com
    Link: https://www.nature.com/articles/s41586-021-03819-2

  6. Source: Wikipedia
    Title: Transformer (deep learning)
    Link: https://en.wikipedia.org/wiki/Transformer_%28deep_learning%29

  7. Source: arxiv.org
    Title: arXiv Quantifying Attention Flow in Transformers
    Link: https://arxiv.org/abs/2005.00928

  8. Source: cdn.openai.com
    Title: Open AI CDNImproving Language Understanding by [Generative]({{ ‘generative-ai/’ | relative_url }}) Pre
    Link: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

  9. Source: arxiv.org
    Title: arXiv Scaling Laws for Neural Language Models
    Link: https://arxiv.org/abs/2001.08361

  10. Source: arxiv.org
    Title: arXiv Training Compute-Optimal Large Language Models
    Link: https://arxiv.org/abs/2203.15556

  11. Source: arxiv.org
    Title: arXiv GPT-4 Technical Report
    Link: https://arxiv.org/abs/2303.08774

  12. Source: hai.stanford.edu
    Title: HAIWhat is a Transformer?
    Link: https://hai.stanford.edu/ai-definitions/what-is-a-transformer

  13. Source: crfm.stanford.edu
    Link: https://crfm.stanford.edu/report.html

  14. Source: arxiv.org
    Link: https://arxiv.org/abs/2205.14135

  15. Source: hai.stanford.edu
    Link: https://hai.stanford.edu/research/flashattention-fast-and-memory-efficient-exact-attention-with-io-awareness

  16. Source: cdn.openai.com
    Title: Open AI CDNGPT-4 Technical Report
    Link: https://cdn.openai.com/papers/gpt-4.pdf

  17. Source: axios.com
    Title: AI’s shaky foundations
    Link: https://www.axios.com/2021/08/18/foundation-ai-models-stanford

  18. Source: arxiv.org
    Link: https://arxiv.org/abs/2103.03404

  19. Source: arxiv.org
    Link: https://arxiv.org/html/2604.00965v1

  20. Source: arxiv.org
    Link: https://arxiv.org/pdf/1810.04805

  21. Source: arxiv.org
    Link: https://arxiv.org/html/1810.04805v2

  22. Source: arxiv.org
    Link: https://arxiv.org/pdf/2005.14165

  23. Source: arxiv.org
    Link: https://arxiv.org/html/2505.20098v2

  24. Source: arxiv.org
    Link: https://arxiv.org/pdf/2010.11929

  25. Source: arxiv.org
    Link: https://arxiv.org/html/2510.20387v1

  26. Source: arxiv.org
    Link: https://arxiv.org/pdf/2203.15556

  27. Source: arxiv.org
    Link: https://arxiv.org/abs/2507.19595

  28. Source: ar5iv.labs.arxiv.org
    Link: https://ar5iv.labs.arxiv.org/html/2203.15556

  29. Source: ar5iv.labs.arxiv.org
    Link: https://ar5iv.labs.arxiv.org/html/2001.08361

  30. Source: arxiv.org
    Link: https://arxiv.org/html/2410.09649v1

  31. Source: arxiv.org
    Link: https://arxiv.org/html/2303.08774v6

  32. Source: arxiv.org
    Link: https://arxiv.org/pdf/2012.11747

  33. Source: arxiv.org
    Link: https://arxiv.org/pdf/2407.09517

  34. Source: OpenAI
    Link: https://openai.com/

  35. Source: hai.stanford.edu
    Title: 2025 ai index report
    Link: https://hai.stanford.edu/ai-index/2025-ai-index-report

  36. Source: hai.stanford.edu
    Title: ai index 2025 state of ai in 10 charts
    Link: https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts

  37. Source: hai.stanford.edu
    Title: what are foundation models
    Link: https://hai.stanford.edu/ai-definitions/what-are-foundation-models

  38. Source: hai.stanford.edu
    Title: ai index
    Link: https://hai.stanford.edu/ai-index

  39. Source: hai.stanford.edu
    Title: what is a llm
    Link: https://hai.stanford.edu/ai-definitions/what-is-a-llm

  40. Source: hai.stanford.edu
    Title: transparency in ai is on the decline
    Link: https://hai.stanford.edu/news/transparency-in-ai-is-on-the-decline

  41. Source: hai.stanford.edu
    Title: research and development
    Link: https://hai.stanford.edu/ai-index/2025-ai-index-report/research-and-development

  42. Source: hai.stanford.edu
    Title: hai annualreport2025 digital v5 compressed
    Link: https://hai.stanford.edu/assets/files/hai_annualreport2025_digital_v5_compressed.pdf

  43. Source: crfm.stanford.edu
    Link: https://crfm.stanford.edu/assets/report.pdf

  44. Source: crfm.stanford.edu
    Link: https://crfm.stanford.edu/

  45. Source: Wikipedia
    Title: Attention Is All You Need
    Link: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

  46. Source: Wikipedia
    Title: Attention ([machine learning]({{ ‘machine-learning/’ | relative_url }}))
    Link: https://en.wikipedia.org/wiki/Attention_%28machine_learning%29

  47. Source: Wikipedia
    Title: Foundation model
    Link: https://en.wikipedia.org/wiki/Foundation_model

  48. Source: Wikipedia
    Title: BERT (language model)
    Link: https://en.wikipedia.org/wiki/BERT_%28language_model%29

  49. Source: Wikipedia
    Title: Alpha Fold
    Link: https://en.wikipedia.org/wiki/AlphaFold

  50. Source: Wikipedia
    Title: Neural scaling law
    Link: https://en.wikipedia.org/wiki/Neural_scaling_law

  51. Source: Wikipedia
    Title: Bitter lesson
    Link: https://en.wikipedia.org/wiki/Bitter_lesson

  52. Source: Wikipedia
    Title: Open AI
    Link: https://en.wikipedia.org/wiki/OpenAI

  53. Source: scholar.google.com
    Link: https://scholar.google.com/citations?hl=en&user=KNr3vb4AAAAJ

  54. Source: nature.com
    Link: https://www.nature.com/articles/s41392-023-01381-z

  55. Source: incompleteideas.net
    Link: https://www.incompleteideas.net/IncIdeas/BitterLesson.html

  56. Source: cs.utexas.edu
    Title: bitter lesson
    Link: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf

  57. Source: github.com
    Title: Machine Translation
    Link: https://github.com/abmitra84/Machine_Translation

  58. Source: openreview.net
    Link: https://openreview.net/pdf?id=H4DqfPSibmx

  59. Source: yenguage.github.io
    Link: https://yenguage.github.io/natural%20language%20processing/GPT/

  60. Source: linkedin.com
    Link: https://www.linkedin.com/posts/stanfordhai_new-the-stanford-center-for-research-activity-7404221342213955586-cLzX

  61. Source: linkedin.com
    Link: https://www.linkedin.com/company/openai

  62. Source: gogoduck912.github.io
    Title: bitter lesson
    Link: https://gogoduck912.github.io/blog/bitter-lesson/

  63. Source: radical.vc
    Title: stanford hai ai index report 2025
    Link: https://radical.vc/stanford-hai-ai-index-report-2025/

Additional References

  1. Source: youtu.be
    Link: https://youtu.be/8UZgTNxuKzY
    Source snippet

    "'Attention is all you need' paper - [https://arxiv.org/pdf/1706.03762.pdf..."](https://arxiv.org/pdf/1706.03762.pdf...")...

  2. Source: youtu.be
    Link: https://youtu.be/KFZrBxSA9tI
    Source snippet

    Query, Key, Value Explained The Secret Behind GPT | AI - YouTube...

  3. Source: youtu.be
    Link: https://youtu.be/GzomXNLFgkk
    Source snippet

    "►AWS Certified Solution Architect Professional: [https://youtu.be/KFZrBxSA9tI..."](https://youtu.be/KFZrBxSA9tI...")...

  4. Source: youtube.com
    Title: Alpha Fold Decoded: Evoformer (Lesson 5)
    Link: https://www.youtube.com/watch?v=gY4-vVRTkpk
    Source snippet

    Attention is all you need explained - YouTube Attention is all you need explained - YouTube...

  5. Source: youtube.com
    Title: Transformer Architecture Explained ‘Attention Is All You Need’
    Link: https://www.youtube.com/watch?v=XwYY0lCGWW8
    Source snippet

    Query, Key, Value Explained The Secret Behind GPT | AI...

  6. Source: youtube.com
    Title: Query, Key, Value Explained The Secret Behind GPT | AI
    Link: https://www.youtube.com/watch?v=VAyb-M14ka8
    Source snippet

    Attention in Transformers Query, Key and Value in Machine Learning...

  7. Source: linkedin.com
    Link: https://www.linkedin.com/posts/atalbajpai_the-illustrated-transformer-activity-7369831609962811392-C2DA

  8. Source: academia.edu
    Link: https://www.academia.edu/95250851/An_Image_is_Worth_16x16_Words_Transformers_for_Image_Recognition_at_Scale

  9. Source: d2l.ai
    Link: https://www.d2l.ai/chapter_attention-mechanisms-and-transformers/index.html

  10. Source: wired.com
    Link: https://www.wired.com/story/stanford-proposal-ai-foundations-ignites-debate

Topic Tree

Follow this branch

Parent topic

AI Sense

Related pages 11

More on this topic 5