Within Transformers

Why did attention work beyond language?

Attention became portable because many problems can be represented as tokens, from image patches to protein residues.

On this page

  • Images as patch sequences
  • Protein relationships and specialized attention
  • What portability does not mean
Preview for Why did attention work beyond language?

Introduction

Transformers were invented for language, but one of the most important discoveries in modern artificial intelligence was that attention is not tied to words. Once researchers realised that many kinds of data could be represented as sequences of tokens, the same basic architecture began working in fields as different as computer vision and molecular biology. Images could be broken into patches and treated like visual “words”. Proteins could be represented as sequences of amino-acid residues and analysed as a kind of biological language. The result was a rapid expansion of Transformer-based systems far beyond text processing. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Overview image for Beyond text This portability mattered because it suggested that attention was capturing a more general principle: learning relationships between elements in a sequence, regardless of whether those elements were words, image regions, or biological building blocks. The success of Vision Transformers and protein language models provided some of the strongest evidence that the core ideas behind Transformers were not language-specific innovations but broadly useful computational tools. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Beyond text illustration 3

Images as patch sequences

The move from language to images required a simple but powerful change in perspective. Traditional computer vision systems usually relied on convolutional neural networks (CNNs), which process images through local filters designed to exploit spatial structure. Transformers, by contrast, expected sequences.

Researchers behind the Vision Transformer (ViT) showed that an image could be divided into small patches—often 16×16 pixels—and each patch could be converted into a token embedding. Once this transformation was performed, the image became a sequence much like a sentence. The standard Transformer encoder could then process the patch sequence using self-attention. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

The significance of this result was not merely conceptual. ViT demonstrated that a largely unchanged Transformer architecture could achieve highly competitive image-classification performance when trained on sufficiently large datasets. Rather than hard-coding assumptions about local image structure, the model learned which patches should influence each other through attention. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Attention offered a particular advantage for capturing long-range relationships. In a photograph, an object may occupy distant regions of the image. A Transformer can directly connect information from those regions through attention, whereas older architectures often needed many layers of processing before distant pixels could influence each other strongly. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

The success of ViT also encouraged researchers to import additional language-model ideas into vision. Techniques inspired by BERT, such as masking portions of the input and training the model to reconstruct missing information, were adapted to images. Models such as BEiT treated image patches as tokens and learned visual representations through masked prediction tasks that closely resembled language-model pretraining. [arXiv]arxiv.orgarXiv BEi T: BERT Pre-Training of Image TransformersBEiT: BERT Pre-Training of Image TransformersJune 15, 2021…Published: June 15, 2021

Beyond text illustration 1

Protein relationships and specialised attention

Proteins provided a very different test. Unlike images, proteins are biological molecules built from chains of amino acids. Yet proteins also form sequences, making them natural candidates for Transformer-style modelling.

Researchers began treating amino-acid sequences in a way analogous to sentences. Instead of predicting missing words, protein language models learned to predict missing amino acids from large databases of biological sequences. Through this process, Transformer models learned statistical patterns that reflected evolutionary and structural relationships within proteins. [Science+2GitHub]science.orgEvolutionary-scale prediction of atomic-level protein…by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot…

One striking result was that models trained only on protein sequences could learn representations that correlated with biological properties such as structure and function. The ESM family of models demonstrated that large-scale Transformer training on protein data could capture information useful for structure prediction and protein design. [Science+2GitHub]science.orgEvolutionary-scale prediction of atomic-level protein…by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot…

AlphaFold2 provided an even more influential example. Although AlphaFold2 is not simply a standard language model, its Evoformer component uses attention-based mechanisms to reason about relationships between amino acids and evolutionary information derived from related protein sequences. Rather than treating protein prediction as a straightforward sequence task, the system combined sequence representations with pairwise relationship representations and repeatedly exchanged information between them. This attention-centred design was a major contributor to AlphaFold2’s breakthrough performance in predicting protein structures. [Nature+2PMC]nature.comHighly accurate protein structure prediction with AlphaFoldby J Jumper · 2021 · Cited by 50540 — Here we provide the first computat…

The biological setting highlights why attention proved useful. Amino acids that are far apart in a protein sequence can end up physically adjacent after the protein folds into its three-dimensional shape. Attention mechanisms provide a natural way to model these long-range dependencies because every position can potentially interact with every other position. [PMC]pmc.ncbi.nlm.nih.govThe transformative power of transformers in protein structure…by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of…

The impact extended beyond structure prediction. Protein Transformers are increasingly used for analysing mutations, predicting functional properties, generating novel proteins, and supporting drug-discovery research. Recent generations of ESM models have continued this trend by scaling Transformer-based approaches to billions of protein sequences. [EvolutionaryScale+2Reuters]evolutionaryscale.aiesm3 releaseOur API and open model allow scientists to explore the frontiers of protein design and synthetic biology.Read more…

Why attention transferred so successfully

The evidence from images and proteins points to a common explanation. Attention works whenever the task depends heavily on relationships among elements rather than isolated features.

In language, the model must determine how words relate across a sentence or document. In images, it must determine how distant patches relate to form objects and scenes. In proteins, it must identify relationships among amino acids that influence folding and function. Although the data types differ dramatically, the computational challenge is similar: identify important interactions across a collection of tokens. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

This observation helped establish Transformers as a general-purpose architecture. Instead of designing entirely different neural-network families for every domain, researchers could often adapt the same core attention machinery by changing the tokenisation strategy and training objective. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Beyond text illustration 2

What portability does not mean

The success of Transformers across domains does not mean that all problems can be solved by treating everything as text.

Different domains still require specialised adaptations. Vision Transformers need methods for encoding spatial information because images possess geometric structure that sentences do not. Protein systems often incorporate evolutionary information, structural constraints, or specialised attention mechanisms designed for biological data. AlphaFold2’s Evoformer, for example, is far more specialised than a standard language Transformer. [PMC+2Blopig]pmc.ncbi.nlm.nih.govThe transformative power of transformers in protein structure…by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of…

Portability also does not imply that attention is always the best solution. Vision Transformers initially required very large datasets and substantial computational resources before outperforming established convolutional approaches. Researchers continue to explore hybrid architectures and more efficient alternatives for particular tasks. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Perhaps the most important lesson is narrower. The spread of Transformers from text to images and proteins demonstrated that attention captures a broadly useful way of modelling relationships. The architecture’s success came not from a special understanding of language, but from the discovery that many seemingly different problems can be represented as interacting tokens whose relationships can be learned through attention. [Science+3arXiv+3arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Amazon book picks

Further Reading

Books and field guides related to Why did attention work beyond language?. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Provides the neural-network foundations behind modern architectures, including attention-era systems.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/2010.11929
    Source snippet

    An Image is Worth 16x16 Words: Transformers for...October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer...

    Published: October 22, 2020

  2. Source: arxiv.org
    Link: https://arxiv.org/pdf/2010.11929
    Source snippet

    arXiv:2010.11929v2 [cs.CV] 3 Jun 2021by A Dosovitskiy · 2020 · Cited by 97009 — Instead, we interpret an image as a sequence of patc...

  3. Source: nature.com
    Link: https://www.nature.com/articles/s41586-021-03819-2
    Source snippet

    Highly accurate protein structure prediction with AlphaFoldby J Jumper · 2021 · Cited by 50540 — Here we provide the first computat...

  4. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10410766/
    Source snippet

    The transformative power of transformers in protein structure...by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of...

  5. Source: arxiv.org
    Title: arXiv Ada Vi T: Adaptive Vision Transformers for Efficient Image Recognition
    Link: https://arxiv.org/abs/2111.15668

  6. Source: arxiv.org
    Title: arXiv BEi T: BERT Pre-Training of Image Transformers
    Link: https://arxiv.org/abs/2106.08254
    Source snippet

    BEiT: BERT Pre-Training of Image TransformersJune 15, 2021...

    Published: June 15, 2021

  7. Source: github.com
    Link: https://github.com/facebookresearch/esm
    Source snippet

    facebookresearch/esm: Evolutionary Scale Modeling...This repository contains code and pre-trained weights for Transformer protein langua...

  8. Source: blopig.com
    Link: https://www.blopig.com/blog/2021/07/[alphafold-2
    Source snippet

    AlphaFold 2 is here: what's behind the structure prediction...19 Jul 2021 — The central idea behind the Evoformer is that the info...

  9. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8329862/
    Source snippet

    structure prediction by AlphaFold2: are attention and...by N Bouatta · 2021 · Cited by 98 — This review discusses the AlphaFold2 system...

  10. Source: evolutionaryscale.ai
    Title: esm3 release
    Link: https://www.evolutionaryscale.ai/blog/esm3-release
    Source snippet

    Our API and open model allow scientists to explore the frontiers of protein design and synthetic biology.Read more...

  11. Source: reuters.com
    Link: https://www.reuters.com/business/healthcare-pharmaceuticals/zuckerbergs-philanthropic-venture-unveils-ai-world-model-drug-discovery-2026-05-27/
    Source snippet

    Priscilla Chan, has launched a pioneering AI-powered world model for protein biology aimed at accelerating drug discovery. This model, ba...

  12. Source: nature.com
    Link: https://www.nature.com/articles/s41392-023-01381-z
    Source snippet

    AlphaFold2 and its applications in the fields of biology and...by Z Yang · 2023 · Cited by 706 — AlphaFold2 (AF2) is an artificial intel...

  13. Source: nature.com
    Link: https://www.nature.com/articles/s42003-025-08783-5
    Source snippet

    Nature 596, 583–589 (2021).Read more...

  14. Source: nature.com
    Link: https://www.nature.com/articles/s41592-026-03050-9
    Source snippet

    Compressing the collective knowledge of ESM into a single...by T Dinh · 2026 · Cited by 2 — ESM models are pretrained with the masked la...

  15. Source: nature.com
    Title: Alpha Fold’s new rival?
    Link: https://www.nature.com/articles/d41586-022-03539-1
    Source snippet

    Meta AI predicts shape of 600...by E Callaway · 2022 · Cited by 1 — Meta AI predicts shape of 600 million proteins. Microbial molecules...

  16. Source: arxiv.org
    Link: https://arxiv.org/pdf/2206.04981
    Source snippet

    2206.04981v3 [cs.CV] 16 Feb 2023by Z Zhang · 2022 · Cited by 17 — Vision transformers (ViT) [Dosovitskiy et al., 2020; Zhao et al...

  17. Source: naokishibuya.github.io
    Title: 2022 11 02 vit vision transformer image classifier 2020
    Link: https://naokishibuya.github.io/blog/2022-11-02-vit-vision-transformer-image-classifier-2020/
    Source snippet

    ViT: Vision Transformer (2020)Nov 2, 2022 — The idea is simple: ViT splits an image into a sequence of image patch embeddings mixed with...

  18. Source: amiteshbadkul.github.io
    Link: https://amiteshbadkul.github.io/blog/2023/esm2-explained/
    Source snippet

    Evolutionary Scale Modeling using Protein Language Models29 Jul 2023 — Building on insights from ESM-2, the ESMFold model enables fast, e...

  19. Source: science.org
    Link: https://www.science.org/doi/10.1126/science.ade2574
    Source snippet

    Evolutionary-scale prediction of atomic-level protein...by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot...

  20. Source: huggingface.co
    Link: https://huggingface.co/docs/transformers/en/model_doc/esm
    Source snippet

    ESMThis page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental AI Research Team.Re...

  21. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8592092/
    Source snippet

    by J Skolnick · 2021 · Cited by 323 — Using novel [deep learning]({{ 'deep-learning/' | relative_url }}), AF2 predicted the structures of many difficult protein targets at or...

  22. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10011655/
    Source snippet

    and after AlphaFold2: An overview of protein structure...by LMF Bertoline · 2023 · Cited by 354 — In this mini-review, we provide an ove...

  23. Source: esmatlas.com
    Link: https://esmatlas.com/about
    Source snippet

    ESM Metagenomic Atlas by Meta AI01 Nov 2022 — The embedding vector is obtained by averaging the final layer activations of the ESM2 trans...

Additional References

  1. Source: medium.com
    Link: https://medium.com/%40EleventhHourEnthusiast/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-d5a9ad816a80
    Source snippet

    An Image is Worth 16x16 Words: Transformers for...The authors introduced ViT, a new approach to image classification, leveraging transfo...

  2. Source: esmsolutions.com
    Link: https://esmsolutions.com/
    Source snippet

    ESM SolutionsA better way to browse, shop, and buy. Connecting people with the resources they need to power education. All your suppliers...

  3. Source: researchgate.net
    Link: https://www.researchgate.net/figure/Encoding-an-image-an-example-Dosovitskiy-et-al-2021-An-image-is-split-into-N_fig1_370212894
    Source snippet

    An image is split into N patches. The transformer is a neural network component that can be used to learn useful representations of seque...

  4. Source: medium.com
    Link: https://medium.com/%40samaniloqman91/highly-accurate-protein-structure-prediction-with-alphafold-9e4cc8b6c692
    Source snippet

    Highly accurate protein structure prediction with AlphaFoldAlphaFold2 represents a breakthrough computational solution that utilizes mach...

  5. Source: medium.com
    Link: https://medium.com/%40anrizal05/protein-language-models-from-amino-acid-tokens-to-sequence-embeddings-e488e89a330e
    Source snippet

    Protein Language Models: From Amino Acid Tokens to...ESMFold achieves breakthrough MSA-free structure prediction by combining ESM-2 repr...

  6. Source: mmclassification.readthedocs.io
    Link: https://mmclassification.readthedocs.io/en/stable/papers/vision_transformer.html
    Source snippet

    Transformers for Image Recognition at ScaleA pure transformer applied directly to sequences of image patches can perform very well on ima...

  7. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%3A-Transformers-for-at-Dosovitskiy-Beyer/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a
    Source snippet

    Transformers for Image Recognition at ScaleThis paper investigates how to train ViTs with limited data and gives theoretical analyses tha...

  8. Source: medium.com
    Link: https://medium.com/%40akinduk619/vision-transformers-from-pixels-to-patches-to-[predictions
    Source snippet

    Vision Transformers — From Pixels to Patches to PredictionsIn this approach, the image is broken down into a sequence of patches, which a...

  9. Source: training-docs.cerebras.ai
    Link: https://training-docs.cerebras.ai/rel-2.9.0/model-zoo/models/nlp/esm2
    Source snippet

    cerebras.aiESM-2ESM-2 (Evolutionary Scale Modeling) is a family of transformer-based protein language models developed by Meta AI's Funda...

  10. Source: disco.ethz.ch
    Link: https://disco.ethz.ch/courses/fs23/seminar/talks/21_03_AlphaFold.pdf
    Source snippet

    accurate protein structure prediction with AlphaFoldParticipants are asked to predict the structure of Proteins. • Predictions are made o...

Topic Tree

Follow this branch

Parent topic

Transformers The Architecture Behind Modern AI

Related pages 4

More on this topic 3