Why did attention work beyond language?

Introduction

Transformers were invented for language, but one of the most important discoveries in modern artificial intelligence was that attention is not tied to words. Once researchers realised that many kinds of data could be represented as sequences of tokens, the same basic architecture began working in fields as different as computer vision and molecular biology. Images could be broken into patches and treated like visual “words”. Proteins could be represented as sequences of amino-acid residues and analysed as a kind of biological language. The result was a rapid expansion of Transformer-based systems far beyond text processing. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Overview image for Beyond text This portability mattered because it suggested that attention was capturing a more general principle: learning relationships between elements in a sequence, regardless of whether those elements were words, image regions, or biological building blocks. The success of Vision Transformers and protein language models provided some of the strongest evidence that the core ideas behind Transformers were not language-specific innovations but broadly useful computational tools. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Beyond text illustration 3

Images as patch sequences

The move from language to images required a simple but powerful change in perspective. Traditional computer vision systems usually relied on convolutional neural networks (CNNs), which process images through local filters designed to exploit spatial structure. Transformers, by contrast, expected sequences.

Researchers behind the Vision Transformer (ViT) showed that an image could be divided into small patches—often 16×16 pixels—and each patch could be converted into a token embedding. Once this transformation was performed, the image became a sequence much like a sentence. The standard Transformer encoder could then process the patch sequence using self-attention. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

The significance of this result was not merely conceptual. ViT demonstrated that a largely unchanged Transformer architecture could achieve highly competitive image-classification performance when trained on sufficiently large datasets. Rather than hard-coding assumptions about local image structure, the model learned which patches should influence each other through attention. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Attention offered a particular advantage for capturing long-range relationships. In a photograph, an object may occupy distant regions of the image. A Transformer can directly connect information from those regions through attention, whereas older architectures often needed many layers of processing before distant pixels could influence each other strongly. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

The success of ViT also encouraged researchers to import additional language-model ideas into vision. Techniques inspired by BERT, such as masking portions of the input and training the model to reconstruct missing information, were adapted to images. Models such as BEiT treated image patches as tokens and learned visual representations through masked prediction tasks that closely resembled language-model pretraining. [arXiv]arxiv.orgarXiv BEi T: BERT Pre-Training of Image TransformersBEiT: BERT Pre-Training of Image TransformersJune 15, 2021…Published: June 15, 2021

Beyond text illustration 1

Protein relationships and specialised attention

Proteins provided a very different test. Unlike images, proteins are biological molecules built from chains of amino acids. Yet proteins also form sequences, making them natural candidates for Transformer-style modelling.

Researchers began treating amino-acid sequences in a way analogous to sentences. Instead of predicting missing words, protein language models learned to predict missing amino acids from large databases of biological sequences. Through this process, Transformer models learned statistical patterns that reflected evolutionary and structural relationships within proteins. [Science+2GitHub]science.orgEvolutionary-scale prediction of atomic-level protein…by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot…

One striking result was that models trained only on protein sequences could learn representations that correlated with biological properties such as structure and function. The ESM family of models demonstrated that large-scale Transformer training on protein data could capture information useful for structure prediction and protein design. [Science+2GitHub]science.orgEvolutionary-scale prediction of atomic-level protein…by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot…

AlphaFold2 provided an even more influential example. Although AlphaFold2 is not simply a standard language model, its Evoformer component uses attention-based mechanisms to reason about relationships between amino acids and evolutionary information derived from related protein sequences. Rather than treating protein prediction as a straightforward sequence task, the system combined sequence representations with pairwise relationship representations and repeatedly exchanged information between them. This attention-centred design was a major contributor to AlphaFold2’s breakthrough performance in predicting protein structures. [Nature+2PMC]nature.comHighly accurate protein structure prediction with AlphaFoldby J Jumper · 2021 · Cited by 50540 — Here we provide the first computat…

The biological setting highlights why attention proved useful. Amino acids that are far apart in a protein sequence can end up physically adjacent after the protein folds into its three-dimensional shape. Attention mechanisms provide a natural way to model these long-range dependencies because every position can potentially interact with every other position. [PMC]pmc.ncbi.nlm.nih.govThe transformative power of transformers in protein structure…by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of…

The impact extended beyond structure prediction. Protein Transformers are increasingly used for analysing mutations, predicting functional properties, generating novel proteins, and supporting drug-discovery research. Recent generations of ESM models have continued this trend by scaling Transformer-based approaches to billions of protein sequences. [EvolutionaryScale+2Reuters]evolutionaryscale.aiesm3 releaseOur API and open model allow scientists to explore the frontiers of protein design and synthetic biology.Read more…

Why attention transferred so successfully

The evidence from images and proteins points to a common explanation. Attention works whenever the task depends heavily on relationships among elements rather than isolated features.

In language, the model must determine how words relate across a sentence or document. In images, it must determine how distant patches relate to form objects and scenes. In proteins, it must identify relationships among amino acids that influence folding and function. Although the data types differ dramatically, the computational challenge is similar: identify important interactions across a collection of tokens. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

This observation helped establish Transformers as a general-purpose architecture. Instead of designing entirely different neural-network families for every domain, researchers could often adapt the same core attention machinery by changing the tokenisation strategy and training objective. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Beyond text illustration 2

What portability does not mean

The success of Transformers across domains does not mean that all problems can be solved by treating everything as text.

Different domains still require specialised adaptations. Vision Transformers need methods for encoding spatial information because images possess geometric structure that sentences do not. Protein systems often incorporate evolutionary information, structural constraints, or specialised attention mechanisms designed for biological data. AlphaFold2’s Evoformer, for example, is far more specialised than a standard language Transformer. [PMC+2Blopig]pmc.ncbi.nlm.nih.govThe transformative power of transformers in protein structure…by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of…

Portability also does not imply that attention is always the best solution. Vision Transformers initially required very large datasets and substantial computational resources before outperforming established convolutional approaches. Researchers continue to explore hybrid architectures and more efficient alternatives for particular tasks. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Perhaps the most important lesson is narrower. The spread of Transformers from text to images and proteins demonstrated that attention captures a broadly useful way of modelling relationships. The architecture’s success came not from a special understanding of language, but from the discovery that many seemingly different problems can be represented as interacting tokens whose relationships can be learned through attention. [Science+3arXiv+3arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…Published: October 22, 2020

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

62pcs Chinese Brainrot Cartoon Stickers Pack Waterproof Funny AI Memes China

Search eBay.co.uk: AI sticker pack

Browse similar on eBay.co.uk

Example eBay listing

Biribirba Universal Brainrot AI Mystery Pack [3D Keychain, Sticker) - 2 Pack

Search eBay.co.uk: AI sticker pack

Browse similar on eBay.co.uk

Example eBay listing

Biribirba Universal Brainrot AI Mystery Pack [3D Keychain, Sticker) - 2 Pack

Search eBay.co.uk: AI sticker pack

Browse similar on eBay.co.uk

Example eBay listing

3x Magnetic Phone Cooling Sticker - Graphene Heat Sink for AI Gaming & MagSafe

Search eBay.co.uk: AI sticker pack

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Link: https://arxiv.org/abs/2010.11929
Source snippet
An Image is Worth 16x16 Words: Transformers for...October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer...

Published: October 22, 2020
Source: arxiv.org
Link: https://arxiv.org/pdf/2010.11929
Source snippet
arXiv:2010.11929v2 [cs.CV] 3 Jun 2021by A Dosovitskiy · 2020 · Cited by 97009 — Instead, we interpret an image as a sequence of patc...
Source: nature.com
Link: https://www.nature.com/articles/s41586-021-03819-2
Source snippet
Highly accurate protein structure prediction with AlphaFoldby J Jumper · 2021 · Cited by 50540 — Here we provide the first computat...
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10410766/
Source snippet
The transformative power of transformers in protein structure...by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of...
Source: arxiv.org
Title: arXiv Ada Vi T: Adaptive Vision Transformers for Efficient Image Recognition
Link: https://arxiv.org/abs/2111.15668
Source: arxiv.org
Title: arXiv BEi T: BERT Pre-Training of Image Transformers
Link: https://arxiv.org/abs/2106.08254
Source snippet
BEiT: BERT Pre-Training of Image TransformersJune 15, 2021...

Published: June 15, 2021
Source: github.com
Link: https://github.com/facebookresearch/esm
Source snippet
facebookresearch/esm: Evolutionary Scale Modeling...This repository contains code and pre-trained weights for Transformer protein langua...
Source: blopig.com
Link: https://www.blopig.com/blog/2021/07/[alphafold-2
Source snippet
AlphaFold 2 is here: what's behind the structure prediction...19 Jul 2021 — The central idea behind the Evoformer is that the info...
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8329862/
Source snippet
structure prediction by AlphaFold2: are attention and...by N Bouatta · 2021 · Cited by 98 — This review discusses the AlphaFold2 system...
Source: evolutionaryscale.ai
Title: esm3 release
Link: https://www.evolutionaryscale.ai/blog/esm3-release
Source snippet
Our API and open model allow scientists to explore the frontiers of protein design and synthetic biology.Read more...
Source: reuters.com
Link: https://www.reuters.com/business/healthcare-pharmaceuticals/zuckerbergs-philanthropic-venture-unveils-ai-world-model-drug-discovery-2026-05-27/
Source snippet
Priscilla Chan, has launched a pioneering AI-powered world model for protein biology aimed at accelerating drug discovery. This model, ba...
Source: nature.com
Link: https://www.nature.com/articles/s41392-023-01381-z
Source snippet
AlphaFold2 and its applications in the fields of biology and...by Z Yang · 2023 · Cited by 706 — AlphaFold2 (AF2) is an artificial intel...
Source: nature.com
Link: https://www.nature.com/articles/s42003-025-08783-5
Source snippet
Nature 596, 583–589 (2021).Read more...
Source: nature.com
Link: https://www.nature.com/articles/s41592-026-03050-9
Source snippet
Compressing the collective knowledge of ESM into a single...by T Dinh · 2026 · Cited by 2 — ESM models are pretrained with the masked la...
Source: nature.com
Title: Alpha Fold’s new rival?
Link: https://www.nature.com/articles/d41586-022-03539-1
Source snippet
Meta AI predicts shape of 600...by E Callaway · 2022 · Cited by 1 — Meta AI predicts shape of 600 million proteins. Microbial molecules...
Source: arxiv.org
Link: https://arxiv.org/pdf/2206.04981
Source snippet
2206.04981v3 [cs.CV] 16 Feb 2023by Z Zhang · 2022 · Cited by 17 — Vision transformers (ViT) [Dosovitskiy et al., 2020; Zhao et al...
Source: naokishibuya.github.io
Title: 2022 11 02 vit vision transformer image classifier 2020
Link: https://naokishibuya.github.io/blog/2022-11-02-vit-vision-transformer-image-classifier-2020/
Source snippet
ViT: Vision Transformer (2020)Nov 2, 2022 — The idea is simple: ViT splits an image into a sequence of image patch embeddings mixed with...
Source: amiteshbadkul.github.io
Link: https://amiteshbadkul.github.io/blog/2023/esm2-explained/
Source snippet
Evolutionary Scale Modeling using Protein Language Models29 Jul 2023 — Building on insights from ESM-2, the ESMFold model enables fast, e...
Source: science.org
Link: https://www.science.org/doi/10.1126/science.ade2574
Source snippet
Evolutionary-scale prediction of atomic-level protein...by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot...
Source: huggingface.co
Link: https://huggingface.co/docs/transformers/en/model_doc/esm
Source snippet
ESMThis page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental AI Research Team.Re...
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8592092/
Source snippet
by J Skolnick · 2021 · Cited by 323 — Using novel [deep learning]({{ 'deep-learning/' | relative_url }}), AF2 predicted the structures of many difficult protein targets at or...
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10011655/
Source snippet
and after AlphaFold2: An overview of protein structure...by LMF Bertoline · 2023 · Cited by 354 — In this mini-review, we provide an ove...
Source: esmatlas.com
Link: https://esmatlas.com/about
Source snippet
ESM Metagenomic Atlas by Meta AI01 Nov 2022 — The embedding vector is obtained by averaging the final layer activations of the ESM2 trans...

Additional References

Source: medium.com
Link: https://medium.com/%40EleventhHourEnthusiast/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-d5a9ad816a80
Source snippet
An Image is Worth 16x16 Words: Transformers for...The authors introduced ViT, a new approach to image classification, leveraging transfo...
Source: esmsolutions.com
Link: https://esmsolutions.com/
Source snippet
ESM SolutionsA better way to browse, shop, and buy. Connecting people with the resources they need to power education. All your suppliers...
Source: researchgate.net
Link: https://www.researchgate.net/figure/Encoding-an-image-an-example-Dosovitskiy-et-al-2021-An-image-is-split-into-N_fig1_370212894
Source snippet
An image is split into N patches. The transformer is a neural network component that can be used to learn useful representations of seque...
Source: medium.com
Link: https://medium.com/%40samaniloqman91/highly-accurate-protein-structure-prediction-with-alphafold-9e4cc8b6c692
Source snippet
Highly accurate protein structure prediction with AlphaFoldAlphaFold2 represents a breakthrough computational solution that utilizes mach...
Source: medium.com
Link: https://medium.com/%40anrizal05/protein-language-models-from-amino-acid-tokens-to-sequence-embeddings-e488e89a330e
Source snippet
Protein Language Models: From Amino Acid Tokens to...ESMFold achieves breakthrough MSA-free structure prediction by combining ESM-2 repr...
Source: mmclassification.readthedocs.io
Link: https://mmclassification.readthedocs.io/en/stable/papers/vision_transformer.html
Source snippet
Transformers for Image Recognition at ScaleA pure transformer applied directly to sequences of image patches can perform very well on ima...
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%3A-Transformers-for-at-Dosovitskiy-Beyer/268d347e8a55b5eb82fb5e7d2f800e33c75ab18a
Source snippet
Transformers for Image Recognition at ScaleThis paper investigates how to train ViTs with limited data and gives theoretical analyses tha...
Source: medium.com
Link: https://medium.com/%40akinduk619/vision-transformers-from-pixels-to-patches-to-[predictions
Source snippet
Vision Transformers — From Pixels to Patches to PredictionsIn this approach, the image is broken down into a sequence of patches, which a...
Source: training-docs.cerebras.ai
Link: https://training-docs.cerebras.ai/rel-2.9.0/model-zoo/models/nlp/esm2
Source snippet
cerebras.aiESM-2ESM-2 (Evolutionary Scale Modeling) is a family of transformer-based protein language models developed by Meta AI's Funda...
Source: disco.ethz.ch
Link: https://disco.ethz.ch/courses/fs23/seminar/talks/21_03_AlphaFold.pdf
Source snippet
accurate protein structure prediction with AlphaFoldParticipants are asked to predict the structure of Proteins. • Predictions are made o...

Why did attention work beyond language?

Introduction

Images as patch sequences

Protein relationships and specialised attention

Why attention transferred so successfully

What portability does not mean

Further Reading

Hands-On Large Language Models

Deep Learning

Natural Language Processing with Transformers

Transformers for Machine Learning

Marketplace Samples

62pcs Chinese Brainrot Cartoon Stickers Pack Waterproof Funny AI Memes China

Biribirba Universal Brainrot AI Mystery Pack [3D Keychain, Sticker) - 2 Pack

Biribirba Universal Brainrot AI Mystery Pack [3D Keychain, Sticker) - 2 Pack

3x Magnetic Phone Cooling Sticker - Graphene Heat Sink for AI Gaming & MagSafe

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 4

More on this topic 3