Within Transformers
Why did attention work beyond language?
Attention became portable because many problems can be represented as tokens, from image patches to protein residues.
On this page
- Images as patch sequences
- Protein relationships and specialized attention
- What portability does not mean
Page outline Jump by section
Introduction
Transformers were invented for language, but one of the most important discoveries in modern artificial intelligence was that attention is not tied to words. Once researchers realised that many kinds of data could be represented as sequences of tokens, the same basic architecture began working in fields as different as computer vision and molecular biology. Images could be broken into patches and treated like visual “words”. Proteins could be represented as sequences of amino-acid residues and analysed as a kind of biological language. The result was a rapid expansion of Transformer-based systems far beyond text processing. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
This portability mattered because it suggested that attention was capturing a more general principle: learning relationships between elements in a sequence, regardless of whether those elements were words, image regions, or biological building blocks. The success of Vision Transformers and protein language models provided some of the strongest evidence that the core ideas behind Transformers were not language-specific innovations but broadly useful computational tools. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
Images as patch sequences
The move from language to images required a simple but powerful change in perspective. Traditional computer vision systems usually relied on convolutional neural networks (CNNs), which process images through local filters designed to exploit spatial structure. Transformers, by contrast, expected sequences.
Researchers behind the Vision Transformer (ViT) showed that an image could be divided into small patches—often 16×16 pixels—and each patch could be converted into a token embedding. Once this transformation was performed, the image became a sequence much like a sentence. The standard Transformer encoder could then process the patch sequence using self-attention. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
The significance of this result was not merely conceptual. ViT demonstrated that a largely unchanged Transformer architecture could achieve highly competitive image-classification performance when trained on sufficiently large datasets. Rather than hard-coding assumptions about local image structure, the model learned which patches should influence each other through attention. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
Attention offered a particular advantage for capturing long-range relationships. In a photograph, an object may occupy distant regions of the image. A Transformer can directly connect information from those regions through attention, whereas older architectures often needed many layers of processing before distant pixels could influence each other strongly. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
The success of ViT also encouraged researchers to import additional language-model ideas into vision. Techniques inspired by BERT, such as masking portions of the input and training the model to reconstruct missing information, were adapted to images. Models such as BEiT treated image patches as tokens and learned visual representations through masked prediction tasks that closely resembled language-model pretraining. [arXiv]arxiv.orgarXiv BEi T: BERT Pre-Training of Image TransformersBEiT: BERT Pre-Training of Image TransformersJune 15, 2021…
Protein relationships and specialised attention
Proteins provided a very different test. Unlike images, proteins are biological molecules built from chains of amino acids. Yet proteins also form sequences, making them natural candidates for Transformer-style modelling.
Researchers began treating amino-acid sequences in a way analogous to sentences. Instead of predicting missing words, protein language models learned to predict missing amino acids from large databases of biological sequences. Through this process, Transformer models learned statistical patterns that reflected evolutionary and structural relationships within proteins. [Science+2GitHub]science.orgEvolutionary-scale prediction of atomic-level protein…by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot…
One striking result was that models trained only on protein sequences could learn representations that correlated with biological properties such as structure and function. The ESM family of models demonstrated that large-scale Transformer training on protein data could capture information useful for structure prediction and protein design. [Science+2GitHub]science.orgEvolutionary-scale prediction of atomic-level protein…by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot…
AlphaFold2 provided an even more influential example. Although AlphaFold2 is not simply a standard language model, its Evoformer component uses attention-based mechanisms to reason about relationships between amino acids and evolutionary information derived from related protein sequences. Rather than treating protein prediction as a straightforward sequence task, the system combined sequence representations with pairwise relationship representations and repeatedly exchanged information between them. This attention-centred design was a major contributor to AlphaFold2’s breakthrough performance in predicting protein structures. [Nature+2PMC]nature.comHighly accurate protein structure prediction with AlphaFoldby J Jumper · 2021 · Cited by 50540 — Here we provide the first computat…
The biological setting highlights why attention proved useful. Amino acids that are far apart in a protein sequence can end up physically adjacent after the protein folds into its three-dimensional shape. Attention mechanisms provide a natural way to model these long-range dependencies because every position can potentially interact with every other position. [PMC]pmc.ncbi.nlm.nih.govThe transformative power of transformers in protein structure…by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of…
The impact extended beyond structure prediction. Protein Transformers are increasingly used for analysing mutations, predicting functional properties, generating novel proteins, and supporting drug-discovery research. Recent generations of ESM models have continued this trend by scaling Transformer-based approaches to billions of protein sequences. [EvolutionaryScale+2Reuters]evolutionaryscale.aiesm3 releaseOur API and open model allow scientists to explore the frontiers of protein design and synthetic biology.Read more…
Why attention transferred so successfully
The evidence from images and proteins points to a common explanation. Attention works whenever the task depends heavily on relationships among elements rather than isolated features.
In language, the model must determine how words relate across a sentence or document. In images, it must determine how distant patches relate to form objects and scenes. In proteins, it must identify relationships among amino acids that influence folding and function. Although the data types differ dramatically, the computational challenge is similar: identify important interactions across a collection of tokens. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
This observation helped establish Transformers as a general-purpose architecture. Instead of designing entirely different neural-network families for every domain, researchers could often adapt the same core attention machinery by changing the tokenisation strategy and training objective. [arXiv+2arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
What portability does not mean
The success of Transformers across domains does not mean that all problems can be solved by treating everything as text.
Different domains still require specialised adaptations. Vision Transformers need methods for encoding spatial information because images possess geometric structure that sentences do not. Protein systems often incorporate evolutionary information, structural constraints, or specialised attention mechanisms designed for biological data. AlphaFold2’s Evoformer, for example, is far more specialised than a standard language Transformer. [PMC+2Blopig]pmc.ncbi.nlm.nih.govThe transformative power of transformers in protein structure…by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of…
Portability also does not imply that attention is always the best solution. Vision Transformers initially required very large datasets and substantial computational resources before outperforming established convolutional approaches. Researchers continue to explore hybrid architectures and more efficient alternatives for particular tasks. [arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
Perhaps the most important lesson is narrower. The spread of Transformers from text to images and proteins demonstrated that attention captures a broadly useful way of modelling relationships. The architecture’s success came not from a special understanding of language, but from the discovery that many seemingly different problems can be represented as interacting tokens whose relationships can be learned through attention. [Science+3arXiv+3arXiv]arxiv.orgAn Image is Worth 16x16 Words: Transformers for…October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer…
Amazon book picks
Further Reading
Books and field guides related to Why did attention work beyond language?. Use these as the next step if you want deeper reading beyond the article.
Hands-On Large Language Models
Explains how transformer ideas generalise beyond language.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Provides the neural-network foundations behind modern architectures, including attention-era systems.
Natural Language Processing with Transformers
Useful foundation before exploring multimodal applications.
Endnotes
-
Source: arxiv.org
Link: https://arxiv.org/abs/2010.11929Source snippet
An Image is Worth 16x16 Words: Transformers for...October 22, 2020 — by A Dosovitskiy · 2020 · Cited by 95916 — A pure transformer...
Published: October 22, 2020
-
Source: arxiv.org
Link: https://arxiv.org/pdf/2010.11929Source snippet
arXiv:2010.11929v2 [cs.CV] 3 Jun 2021by A Dosovitskiy · 2020 · Cited by 97009 — Instead, we interpret an image as a sequence of patc...
-
Source: nature.com
Link: https://www.nature.com/articles/s41586-021-03819-2Source snippet
Highly accurate protein structure prediction with AlphaFoldby J Jumper · 2021 · Cited by 50540 — Here we provide the first computat...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10410766/Source snippet
The transformative power of transformers in protein structure...by B Moussad · 2023 · Cited by 48 — AlphaFold2 harnessed the power of...
-
Source: arxiv.org
Title: arXiv Ada Vi T: Adaptive Vision Transformers for Efficient Image Recognition
Link: https://arxiv.org/abs/2111.15668 -
Source: arxiv.org
Title: arXiv BEi T: BERT Pre-Training of Image Transformers
Link: https://arxiv.org/abs/2106.08254Source snippet
BEiT: BERT Pre-Training of Image TransformersJune 15, 2021...
Published: June 15, 2021
-
Source: github.com
Link: https://github.com/facebookresearch/esmSource snippet
facebookresearch/esm: Evolutionary Scale Modeling...This repository contains code and pre-trained weights for Transformer protein langua...
-
Source: blopig.com
Link: https://www.blopig.com/blog/2021/07/[alphafold-2Source snippet
AlphaFold 2 is here: what's behind the structure prediction...19 Jul 2021 — The central idea behind the Evoformer is that the info...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8329862/Source snippet
structure prediction by AlphaFold2: are attention and...by N Bouatta · 2021 · Cited by 98 — This review discusses the AlphaFold2 system...
-
Source: evolutionaryscale.ai
Title: esm3 release
Link: https://www.evolutionaryscale.ai/blog/esm3-releaseSource snippet
Our API and open model allow scientists to explore the frontiers of protein design and synthetic biology.Read more...
-
Source: reuters.com
Link: https://www.reuters.com/business/healthcare-pharmaceuticals/zuckerbergs-philanthropic-venture-unveils-ai-world-model-drug-discovery-2026-05-27/Source snippet
Priscilla Chan, has launched a pioneering AI-powered world model for protein biology aimed at accelerating drug discovery. This model, ba...
-
Source: nature.com
Link: https://www.nature.com/articles/s41392-023-01381-zSource snippet
AlphaFold2 and its applications in the fields of biology and...by Z Yang · 2023 · Cited by 706 — AlphaFold2 (AF2) is an artificial intel...
-
Source: nature.com
Link: https://www.nature.com/articles/s42003-025-08783-5Source snippet
Nature 596, 583–589 (2021).Read more...
-
Source: nature.com
Link: https://www.nature.com/articles/s41592-026-03050-9Source snippet
Compressing the collective knowledge of ESM into a single...by T Dinh · 2026 · Cited by 2 — ESM models are pretrained with the masked la...
-
Source: nature.com
Title: Alpha Fold’s new rival?
Link: https://www.nature.com/articles/d41586-022-03539-1Source snippet
Meta AI predicts shape of 600...by E Callaway · 2022 · Cited by 1 — Meta AI predicts shape of 600 million proteins. Microbial molecules...
-
Source: arxiv.org
Link: https://arxiv.org/pdf/2206.04981Source snippet
2206.04981v3 [cs.CV] 16 Feb 2023by Z Zhang · 2022 · Cited by 17 — Vision transformers (ViT) [Dosovitskiy et al., 2020; Zhao et al...
-
Source: naokishibuya.github.io
Title: 2022 11 02 vit vision transformer image classifier 2020
Link: https://naokishibuya.github.io/blog/2022-11-02-vit-vision-transformer-image-classifier-2020/Source snippet
ViT: Vision Transformer (2020)Nov 2, 2022 — The idea is simple: ViT splits an image into a sequence of image patch embeddings mixed with...
-
Source: amiteshbadkul.github.io
Link: https://amiteshbadkul.github.io/blog/2023/esm2-explained/Source snippet
Evolutionary Scale Modeling using Protein Language Models29 Jul 2023 — Building on insights from ESM-2, the ESMFold model enables fast, e...
-
Source: science.org
Link: https://www.science.org/doi/10.1126/science.ade2574Source snippet
Evolutionary-scale prediction of atomic-level protein...by Z Lin · 2023 · Cited by 6086 — We trained a family of transformer prot...
-
Source: huggingface.co
Link: https://huggingface.co/docs/transformers/en/model_doc/esmSource snippet
ESMThis page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental AI Research Team.Re...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8592092/Source snippet
by J Skolnick · 2021 · Cited by 323 — Using novel [deep learning]({{ 'deep-learning/' | relative_url }}), AF2 predicted the structures of many difficult protein targets at or...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10011655/Source snippet
and after AlphaFold2: An overview of protein structure...by LMF Bertoline · 2023 · Cited by 354 — In this mini-review, we provide an ove...
-
Source: esmatlas.com
Link: https://esmatlas.com/aboutSource snippet
ESM Metagenomic Atlas by Meta AI01 Nov 2022 — The embedding vector is obtained by averaging the final layer activations of the ESM2 trans...
Additional References
-
Source: medium.com
Link: https://medium.com/%40EleventhHourEnthusiast/an-image-is-worth-16x16-words-transformers-for-image-recognition-at-scale-d5a9ad816a80Source snippet
An Image is Worth 16x16 Words: Transformers for...The authors introduced ViT, a new approach to image classification, leveraging transfo...
-
Source: esmsolutions.com
Link: https://esmsolutions.com/Source snippet
ESM SolutionsA better way to browse, shop, and buy. Connecting people with the resources they need to power education. All your suppliers...
-
Source: researchgate.net
Link: https://www.researchgate.net/figure/Encoding-an-image-an-example-Dosovitskiy-et-al-2021-An-image-is-split-into-N_fig1_370212894Source snippet
An image is split into N patches. The transformer is a neural network component that can be used to learn useful representations of seque...
-
Source: medium.com
Link: https://medium.com/%40samaniloqman91/highly-accurate-protein-structure-prediction-with-alphafold-9e4cc8b6c692Source snippet
Highly accurate protein structure prediction with AlphaFoldAlphaFold2 represents a breakthrough computational solution that utilizes mach...
-
Source: medium.com
Link: https://medium.com/%40anrizal05/protein-language-models-from-amino-acid-tokens-to-sequence-embeddings-e488e89a330eSource snippet
Protein Language Models: From Amino Acid Tokens to...ESMFold achieves breakthrough MSA-free structure prediction by combining ESM-2 repr...
-
Source: mmclassification.readthedocs.io
Link: https://mmclassification.readthedocs.io/en/stable/papers/vision_transformer.htmlSource snippet
Transformers for Image Recognition at ScaleA pure transformer applied directly to sequences of image patches can perform very well on ima...
-
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/An-Image-is-Worth-16x16-Words%3A-Transformers-for-at-Dosovitskiy-Beyer/268d347e8a55b5eb82fb5e7d2f800e33c75ab18aSource snippet
Transformers for Image Recognition at ScaleThis paper investigates how to train ViTs with limited data and gives theoretical analyses tha...
-
Source: medium.com
Link: https://medium.com/%40akinduk619/vision-transformers-from-pixels-to-patches-to-[predictionsSource snippet
Vision Transformers — From Pixels to Patches to PredictionsIn this approach, the image is broken down into a sequence of patches, which a...
-
Source: training-docs.cerebras.ai
Link: https://training-docs.cerebras.ai/rel-2.9.0/model-zoo/models/nlp/esm2Source snippet
cerebras.aiESM-2ESM-2 (Evolutionary Scale Modeling) is a family of transformer-based protein language models developed by Meta AI's Funda...
-
Source: disco.ethz.ch
Link: https://disco.ethz.ch/courses/fs23/seminar/talks/21_03_AlphaFold.pdfSource snippet
accurate protein structure prediction with AlphaFoldParticipants are asked to predict the structure of Proteins. • Predictions are made o...
Topic Tree



