Within Attention

Did transformers really replace deep learning?

Transformers changed the main layer type inside language models, but they still rely on stacked neural layers trained on large datasets.

On this page

  • What makes a transformer a deep neural network
  • How stacked attention layers refine representations
  • Why the misconception matters
Preview for Did transformers really replace deep learning?

Introduction

When transformers became the dominant architecture for language models, many people concluded that attention had replaced deep learning. That interpretation is understandable but incorrect. Attention changed how modern language models process information, yet transformers remain deep neural networks trained with the same broad principles that define deep learning: large datasets, layered representations, gradient-based optimisation, and learned parameters distributed across many neural layers.

Still deep illustration 1 The transformer revolution was therefore an architectural shift within deep learning, not a departure from it. Attention replaced specific components such as recurrent processing in many language systems, but it did not replace the deeper idea of learning hierarchical representations through multi-layer neural networks. The distinction matters because it affects how we understand both the strengths and limitations of modern AI. [arXiv+2Google Research]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Did transformers really replace deep learning?

The short answer is no.

Deep learning is a broad approach to machine learning in which neural networks learn representations through multiple layers of computation. A transformer is one particular neural-network architecture within that broader family. The famous 2017 paper Attention Is All You Need proposed replacing recurrence and convolution for sequence processing with attention mechanisms, but it did not abandon neural networks themselves. The authors explicitly introduced the transformer as a new network architecture rather than an alternative to deep learning. [arXiv+2NeurIPS Papers]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Part of the confusion comes from the paper’s title. “Attention Is All You Need” is often interpreted as meaning that attention alone performs all the work. In practice, transformer models contain multiple interacting components, including attention layers, feed-forward neural networks, residual connections, normalisation mechanisms, and deep layer stacking. Attention is central, but it is only one part of the overall system. [arXiv+2Medium]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

A useful analogy is that jet engines did not replace aviation. They replaced an earlier propulsion method while remaining part of the broader field of flight. Similarly, attention replaced certain older neural-network mechanisms while remaining inside the deep-learning framework.

What makes a transformer a deep neural network?

[Transformer]WikipediaTransformerA transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or mu… er satisfies the defining characteristics of deep learning.

First, it consists of many stacked layers. Models such as BERT are explicitly described as deep bidirectional transformers, with versions containing 12 or 24 transformer blocks arranged in sequence. Each layer transforms the representation produced by previous layers, creating progressively richer abstractions. [arXiv+2ACL Anthology]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for…October 11, 2018 — by J Devlin · 2018 · Cited by 167535 — Unlike recent…Published: October 11, 2018

Second, transformers learn their behaviour from data rather than from hand-written rules. During training, billions or trillions of parameters are adjusted through gradient descent to reduce prediction error. This is the same learning process used across modern deep learning. [arXiv]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for…October 11, 2018 — by J Devlin · 2018 · Cited by 167535 — Unlike recent…Published: October 11, 2018

Third, transformers build hierarchical representations. Early layers tend to capture relatively local or shallow patterns, while later layers develop more abstract semantic information. This layered progression is a hallmark of deep neural networks. Research examining transformer feed-forward layers and model internals has found evidence that different layers contribute different levels of abstraction rather than performing a single flat computation. [arXiv]arxiv.orgarXiv Transformer Feed-Forward Layers Are Key-Value MemoriesTransformer Feed-Forward Layers Are Key-Value MemoriesDecember 29, 2020…Published: December 29, 2020

If the attention mechanism were removed from a transformer, it would no longer be the same architecture. But if the deep layered structure were removed, it would no longer be a deep-learning model at all.

Still deep illustration 2

How stacked attention layers refine representations

Attention is powerful because it allows tokens to exchange information directly across a sequence. However, a single attention operation is not what gives transformers their remarkable capabilities.

The strength comes from repetition. A token representation is updated, passed to another layer, updated again, and refined through many stages. Each layer receives the results of previous computations and performs additional transformations. This iterative refinement is fundamentally a deep-learning process. [Medium+2poloclub.github.io]medium.comAttention Is All You Need: A Complete Guide to TransformersThe transformer follows an architecture containing stacked attention la…

Consider a sentence containing ambiguity:

“The scientist thanked the engineer because she solved the problem.”

One layer might help connect “she” with potential referents. Later layers can combine broader context, grammatical cues, and semantic information to strengthen one interpretation over another. The final representation emerges through multiple stages rather than a single attention calculation. [poloclub.github.io]poloclub.github.ioLLM Transformer Model Visually ExplainedThe core innovation and power of Transformers lie in their use of self-attention mechanism, which…

Equally important, attention layers are paired with feed-forward neural networks. These feed-forward components introduce additional transformations and non-linearity, enabling the model to build complex representations. Analyses of transformers show that these feed-forward networks account for a large share of the model’s parameters and perform substantial representational work. [arXiv+2Medium]arxiv.orgarXiv Transformer Feed-Forward Layers Are Key-Value MemoriesTransformer Feed-Forward Layers Are Key-Value MemoriesDecember 29, 2020…Published: December 29, 2020

The practical lesson is that modern language models succeed because of the interaction between attention and depth, not because attention acts alone.

Why attention alone is not enough

Research has repeatedly shown that pure attention mechanisms are insufficient without the surrounding deep-learning machinery.

One theoretical analysis found that self-attention networks without key supporting components such as feed-forward layers and residual connections tend toward representational degeneration, where outputs become increasingly uniform. The study argued that the additional neural-network structures are essential for preventing this collapse and maintaining expressive power. [arXiv]arxiv.orgAttention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with DepthMarch 5, 2021…Published: March 5, 2021

This finding highlights an important misconception. Attention determines which pieces of information should influence one another, but it does not automatically provide all the computation needed to build sophisticated internal representations. Other neural-network components perform critical roles, including transforming, storing, refining, and stabilising information. [arXiv+2arXiv]arxiv.orgAttention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with DepthMarch 5, 2021…Published: March 5, 2021

Even the transformer architecture described in the original paper includes far more than attention. Every encoder and decoder layer combines attention mechanisms with feed-forward neural networks and other supporting operations. The architecture’s success comes from the system as a whole. [arXiv+2Medium]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Still deep illustration 3

Why the misconception matters

Treating attention as a replacement for deep learning can lead to misunderstandings about how modern AI advances occur.

One misconception is that a single clever mechanism suddenly solved language understanding. In reality, progress emerged from combining several ideas: attention-based architectures, large-scale training data, powerful hardware, optimisation techniques, and increasingly deep models. Removing any of these elements significantly changes performance. [arXiv+2Google Research]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Another misconception is that attention somehow eliminates the challenges associated with deep learning. Transformers still require vast amounts of data, extensive computation, careful training procedures, and large parameter counts. They inherit many of the same strengths and weaknesses as other deep neural networks. [arXiv]arxiv.orgBERT: Pre-training of Deep Bidirectional Transformers for…October 11, 2018 — by J Devlin · 2018 · Cited by 167535 — Unlike recent…Published: October 11, 2018

Understanding this distinction also helps explain why researchers continue experimenting with new architectures. If attention were literally all that mattered, architectural innovation would have stopped. Instead, researchers continue studying alternative mechanisms, more efficient replacements, and modifications to self-attention itself. The ongoing search reflects the fact that attention is a highly successful component, not the final word in neural-network design. [arXiv]arxiv.orgarXiv Attention Is Not All You Need AnymoreAttention Is Not All You Need AnymoreAugust 15, 2023…Published: August 15, 2023

The real takeaway: attention changed deep learning from the inside

The transformer era represents one of the most important architectural shifts in artificial intelligence. Attention replaced recurrence as the dominant way of handling language sequences and enabled models to scale far beyond earlier approaches. Yet transformers remain deeply rooted in the principles of deep learning.

A modern large language model is still a deep neural network. It learns from data, builds layered representations, adjusts parameters through optimisation, and relies on many stacked computational layers. Attention changed the internal machinery of those layers, but it did not replace the broader framework that made them possible. [Wikipedia+3arXiv+3Google Research]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b…Published: June 12, 2017

Amazon book picks

Further Reading

Books and field guides related to Did transformers really replace deep learning?. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Directly addresses what deep learning is, making it ideal for clarifying transformers versus deep learning.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/1706.03762
    Source snippet

    arXiv[1706.03762] Attention Is All You NeedJune 12, 2017 — 12 Jun 2017 — We propose a new simple network architecture, the Transformer, b...

    Published: June 12, 2017

  2. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
    Source snippet

    Attention Is All You NeedThe paper introduced a new deep learning architecture known as the transformer, based on the attention mechan...

  3. Source: papers.neurips.cc
    Title: 7181 attention is all you need
    Link: https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
    Source snippet

    NeurIPS PapersAttention is All you Needby A Vaswani · Cited by 252609 — We propose a new simple network architecture, the Transformer, ba...

  4. Source: medium.com
    Link: https://medium.com/%40alejandro.itoaramendia/attention-is-all-you-need-a-complete-guide-to-transformers-8670a3f09d02
    Source snippet

    Attention Is All You Need: A Complete Guide to TransformersThe transformer follows an architecture containing stacked attention la...

  5. Source: arxiv.org
    Link: https://arxiv.org/abs/1810.04805
    Source snippet

    BERT: Pre-training of Deep Bidirectional Transformers for...October 11, 2018 — by J Devlin · 2018 · Cited by 167535 — Unlike recent...

    Published: October 11, 2018

  6. Source: arxiv.org
    Title: arXiv Transformer Feed-Forward Layers Are Key-Value Memories
    Link: https://arxiv.org/abs/2012.14913
    Source snippet

    Transformer Feed-Forward Layers Are Key-Value MemoriesDecember 29, 2020...

    Published: December 29, 2020

  7. Source: poloclub.github.io
    Link: https://poloclub.github.io/transformer-explainer/
    Source snippet

    LLM Transformer Model Visually ExplainedThe core innovation and power of Transformers lie in their use of self-attention mechanism, which...

  8. Source: medium.com
    Link: https://medium.com/%40kuberca.io/deep-dive-into-transformer-layers-self-attention-feedforward-and-add-norm-1f59395d376b
    Source snippet

    ayer introduces non-linearity and complexity, and the Add & Norm...Read more...

  9. Source: arxiv.org
    Link: https://arxiv.org/abs/2103.03404
    Source snippet

    Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with DepthMarch 5, 2021...

    Published: March 5, 2021

  10. Source: arxiv.org
    Title: arXiv Attention Is Not All You Need Anymore
    Link: https://arxiv.org/abs/2308.07661
    Source snippet

    Attention Is Not All You Need AnymoreAugust 15, 2023...

    Published: August 15, 2023

  11. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Transformer
    Source snippet

    TransformerA transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or mu...

  12. Source: Wikipedia
    Title: BERT (language model)
    Link: https://en.wikipedia.org/wiki/BERT_%28language_model%29
    Source snippet

    BERT (language model)Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by...

    Published: October 2018

  13. Source: Wikipedia
    Title: Transformer (deep learning)
    Link: https://en.wikipedia.org/wiki/Transformer_%28deep_learning%29
    Source snippet

    Transformer (deep learning)In deep learning, the transformer is an artificial neural network architecture based on the [multi-head]({{ 'multi-heads/' | relative_url }}) atte...

  14. Source: medium.com
    Link: https://medium.com/%40punya8147_26846/understanding-feed-forward-networks-in-transformers-77f4c1095c67
    Source snippet

    They take the context-rich outputs from self-attention layers and transform them...Read more...

  15. Source: medium.com
    Link: https://medium.com/data-science/guide-to-llm-part-1-bert-3d1bf880386a

  16. Source: medium.com
    Link: https://medium.com/%40tnodecode/bert-bidirectional-encoder-representations-from-transformers-0696d29f9d11
    Source snippet

    BERT — Bidirectional Encoder Representations from...Layer by layer, it constructs the deep contextual representations that made BERT suc...

  17. Source: medium.com
    Link: https://medium.com/the-owl/paper-reading-club-day-1-paper-1-bert-bidirectional-encoder-representations-from-transformers-b4c0d9a3a5ef
    Source snippet

    BERT: Bidirectional Encoder Representations from...BERT's model architecture is a multi-layer bidirectional Transformer encoder based on...

  18. Source: medium.com
    Title: Attention Is All You Need!
    Link: https://medium.com/data-science-collective/attention-is-all-you-need-661cb8db5f21
    Source snippet

    Demystifying the Transformer…The Transformer architecture represents one of the most significant breakthroughs in artificial intelligence...

  19. Source: medium.com
    Link: https://medium.com/image-processing-with-python/the-feedforward-network-ffn-in-the-transformer-model-6bb6e0ff18db
    Source snippet

    The Feedforward Network (FFN) in The Transformer ModelIn summary, the Feedforward Network is a cornerstone of the Transformer architectur...

  20. Source: medium.com
    Link: https://medium.com/%40adnanmasood/attention-is-all-you-need-explained-like-youre-smart-and-busy-2a3d7436144f
    Source snippet

    ism, and reshaped modern language models. Adnan Masood, PhD.Read more...

  21. Source: medium.com
    Title: attention is all you need the paper that revolutionized ai 6e606e6a847b
    Link: https://medium.com/%40vijay.poudel1/attention-is-all-you-need-the-paper-that-revolutionized-ai-6e606e6a847b
    Source snippet

    Attention Is All You Need: The Paper That Revolutionized AIIn 2017, a groundbreaking paper titled “Attention Is All You Need” was publish...

  22. Source: dilipkumar.medium.com
    Title: transformers neural network architecture a6fd825d2d5f
    Link: https://dilipkumar.medium.com/transformers-neural-network-architecture-a6fd825d2d5f
    Source snippet

    Neural network architecture | by Dilip KumarMasked Multi-Head Attention: This is a self-attention layer that looks at the sentence being...

  23. Source: medium.com
    Link: https://medium.com/read-a-paper/bert-read-a-paper-811b836141e9
    Source snippet

    Read A Paper | BERT | Language ModelA language representation model that is designed to pre-train deep bidirectional representations from...

  24. Source: medium.com
    Link: https://medium.com/%40robin5002234/attention-is-all-you-need-a-deep-dive-into-transformer-architecture-8c34753098c7
    Source snippet

    model to weigh the importance of different words in a sentence when encoding or...Read more...

  25. Source: shreyansh26.github.io
    Link: https://shreyansh26.github.io/post/2021-05-09_pretraining_deep_bidirectional_transformers_bert/
    Source snippet

    Paper Summary #4 - BERT: Pre-training of Deep...09 May 2021 — The underlying architecture of BERT is a multi-layer Transformer encoder...

    Published: May 2021

  26. Source: github.com
    Link: https://github.com/GitYCC/machine-learning-papers-summary/blob/master/nlp/bert.md
    Source snippet

    BERT uses masked language models to enable pre-trained...Read more...

  27. Source: youtube.com
    Title: Transformers, explained: Understand the model behind GPT, BERT, and T5
    Link: https://www.youtube.com/watch?v=SZorAJ4I-sA
    Source snippet

    What are Transformer Neural Networks?...

  28. Source: youtube.com
    Title: What are Transformers (Machine Learning Model)?
    Link: https://www.youtube.com/watch?v=ZXiruGOCn9s
    Source snippet

    Transformers Explained | Simple Explanation of Transformers...

  29. Source: youtube.com
    Title: Transformers Explained | Simple Explanation of Transformers
    Link: https://www.youtube.com/watch?v=ZhAz268Hdpw

  30. Source: research.google
    Link: https://research.google/blog/transformer-a-novel-neural-network-architecture-for-language-understanding/
    Source snippet

    Google ResearchTransformer: A Novel Neural Network Architecture for...In “Attention Is All You Need”, we introduce the Transformer, a no...

  31. Source: linkedin.com
    Link: https://www.linkedin.com/pulse/transformers-simplified-guide-attention-all-you-need-moiz-asghar-zdvmc
    Source snippet

    Transformers Simplified: A Guide to Attention Is All You NeedThe Transformer model consists of two main parts: the encoder and the decode...

  32. Source: aclanthology.org
    Link: https://aclanthology.org/N19-1423.pdf
    Source snippet

    (i.e., Transformer blocks) as L, the hidden size as. H, and the number of self-attention heads as A.3. We...Read more...

  33. Source: aclanthology.org
    Link: https://aclanthology.org/N19-1423/
    Source snippet

    We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.Re...

  34. Source: mbrenndoerfer.com
    Title: transformer feed forward networks
    Link: https://mbrenndoerfer.com/writing/transformer-feed-forward-networks
    Source snippet

    Michael BrenndoerferFeed-Forward Networks in Transformers: Architecture...Jun 8, 2025 — Learn how feed-forward networks provide nonlinea...

  35. Source: techradar.com
    Link: https://www.techradar.com/pro/what-are-transformer-models
    Source snippet

    Transformers utilize a structure composed of encoders, decoders, and a dynamic attention mechanism, allowing more efficient handling of l...

  36. Source: electronics-tutorials.ws
    Link: https://www.electronics-tutorials.ws/transformer/transformer-basics.html
    Source snippet

    Transformer Basics and Transformer PrinciplesTransformers are electrical devices consisting of two or more coils of wire used to transfer...

  37. Source: geeksforgeeks.org
    Title: getting started with transformers
    Link: https://www.geeksforgeeks.org/machine-learning/getting-started-with-transformers/
    Source snippet

    Transformers in Machine Learning11 May 2026 — Transformer is a neural network architecture used for various machine learning tasks, espec...

    Published: May 2026

  38. Source: transformerindia.com
    Link: https://www.transformerindia.com/
    Source snippet

    on transformers ranging from 250KVA to 10,000KVA and up to 33 kV...

  39. Source: linkedin.com
    Link: https://www.linkedin.com/posts/alexxubyte_the-most-important-paper-attention-is-all-activity-7404924500187865088-7LDp
    Source snippet

    Transformer Model Explained: Attention Is All You NeedA Transformer is simply a neural network where inputs talk to each other. That comm...

  40. Source: huggingface.co
    Title: attention is all you need
    Link: https://huggingface.co/blog/Esmail-AGumaan/attention-is-all-you-need
    Source snippet

    Transformers2 Jul 2024 — The paper titled "Attention Is All You Need" introduces a new network architecture called the Transformer, which...

  41. Source: codecademy.com
    Link: https://www.codecademy.com/article/transformer-architecture-self-attention-mechanism
    Source snippet

    Add & norm...Read more...

Additional References

  1. Source: d2l.ai
    Link: https://www.d2l.ai/chapter_attention-mechanisms-and-transformers/index.html
    Source snippet

    11. Attention Mechanisms and TransformersThe core idea behind the Transformer model is the attention mechanism, an innovation that was or...

  2. Source: linkedin.com
    Link: https://www.linkedin.com/pulse/bert-bidirectional-encoder-representations-from-shradha-agarwal-xnelc
    Source snippet

    BERT-Bidirectional Encoder Representations from...Unlike traditional models, BERT does not mask the upper triangle embeddings before app...

  3. Source: datacamp.com
    Link: https://www.datacamp.com/tutorial/how-transformers-work
    Source snippet

    How Transformers Work: A Detailed Exploration of...Transformers are neural network architectures that use self-attention mechanisms to p...

  4. Source: web.stanford.edu
    Link: https://web.stanford.edu/~jurafsky/slp3/8.pdf
    Source snippet

    stanford.edu8 Transformers... self-attention layer, includes three other kinds of layers: (1) a feedforward layer, (2) residual connectio...

  5. Source: d2l.ai
    Link: https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html

  6. Source: inspirehep.net
    Link: https://inspirehep.net/literature/2702854

  7. Source: towardsai.net
    Link: https://towardsai.net/p/machine-learning/attention-is-all-you-need-a-deep-dive-into-the-revolutionary-transformer-architecture
    Source snippet

    A Deep Dive into the Revolutionary Transformer Architecture10 Apr 2025 — This paper introduced the Transformer architecture, a novel appr...

  8. Source: blog.paperspace.com
    Title: bert pre training of deep bidirectional transformers for language understanding
    Link: https://blog.paperspace.com/bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding/
    Source snippet

    paperspace.comBERT: Pre-training of Deep Bidirectional Transformers for...This study proposes improving fine-tuning-based techniques by...

  9. Source: quantpedia.com
    Title: bert model bidirectional encoder representations from transformers
    Link: https://quantpedia.com/bert-model-bidirectional-encoder-representations-from-transformers/
    Source snippet

    BERT Model – Bidirectional Encoder Representations from...12 Apr 2023 — The BERT model employs fine-tuning and bidirectional transformer...

  10. Source: geeksforgeeks.org
    Title: architecture and working of transformers in deep learning
    Link: https://www.geeksforgeeks.org/deep-learning/architecture-and-working-of-transformers-in-deep-learning/
    Source snippet

    18 Oct 2025 — Feed-Forward Neural Network: This sub-layer processes the combined output of the masked self-attention and encoder-decoder...

Topic Tree

Follow this branch

Parent topic

Attention What makes attention layers different?

Related pages 2