Within Beyond text
Can proteins be read like a language?
Protein Transformers can learn useful signals about structure and function by predicting missing amino acids across huge biological sequence databases.
On this page
- How amino acid sequences become model tokens
- What missing residue prediction teaches the model
- Why learned protein representations matter
Page outline Jump by section
Introduction
Protein language models apply the core idea behind Transformer-based language systems to biology: instead of reading sentences made of words, they read proteins made of amino acids. A protein is a chain built from a small alphabet of amino-acid residues, and the order of those residues largely determines the molecule’s shape and biological role. By training on hundreds of millions of protein sequences and learning to predict missing amino acids, Transformer models discover statistical patterns that reflect evolution, structure, and function. Remarkably, many of these properties emerge without the model being explicitly taught chemistry or three-dimensional protein structures. Research on models such as ESM (Evolutionary Scale Modeling) has shown that large-scale sequence learning can produce representations useful for structure prediction, mutation analysis, and protein design. [Science+2GitHub]science.orgAlthough the training objective…Read more…
Can proteins be read like a language?
The analogy between language and proteins is not perfect, but it is surprisingly productive. Human languages are built from vocabularies and grammatical rules. Proteins are built from sequences of roughly twenty common amino acids arranged in specific orders. Just as changing a word can alter the meaning of a sentence, changing a single amino acid can alter a protein’s stability, shape, or function. [Elisa G. de Lope+2PMC]elisagdelope.rbind.ioElisa Gde LopeGetting started with Protein Language Models - Elisa G. de Lope13 Sept 2024 — Like sentences crafted from words, proteins are intr…
Protein language models therefore treat amino acids as tokens. A sequence such as a protein chain is converted into a series of token embeddings and processed by a Transformer. Instead of learning grammar and semantics, the model learns which amino acids tend to occur together, which positions are highly constrained by evolution, and which substitutions are likely or unlikely in a biological context. [ACS Publications+2PMC]pubs.acs.orgACS PublicationsProtein Language Models: Applications and Perspectives26 Dec 2025 — Originally designed for language tasks, LLMs have bee…
What makes this approach powerful is the scale of available data. Public biological databases contain hundreds of millions of protein sequences collected from organisms across the tree of life. These sequences provide a vast record of evolutionary experiments carried out over billions of years. Protein language models effectively compress patterns from that record into their internal representations. [Science+2EvolutionaryScale]science.orgAlthough the training objective…Read more…
How amino-acid sequences become model tokens
Unlike natural-language systems that must handle huge vocabularies, protein models work with a compact alphabet. Each amino acid receives its own token representation. The model then processes the sequence using self-attention, allowing every position to influence every other position. [ACS Publications]pubs.acs.orgACS PublicationsProtein Language Models: Applications and Perspectives26 Dec 2025 — Originally designed for language tasks, LLMs have bee…
This matters because proteins often contain long-range dependencies. Two amino acids that are far apart in the sequence may end up physically touching when the protein folds into its three-dimensional form. A Transformer can directly connect these distant positions through attention mechanisms rather than relying only on local neighbourhoods. [OpenReview]openreview.netTransformer protein language models are unsupervised…by R Rao · Cited by 479 — In this paper we demonstrate that Transformer…
As training progresses, the model develops numerical representations known as embeddings. Proteins with similar biological properties tend to occupy nearby regions of this learned representation space. The embeddings become compressed summaries of information that is difficult to read directly from raw amino-acid strings. [PMC+2PMC]pmc.ncbi.nlm.nih.govFine-tuning protein language models unlocks the potential of…by R Sawhney · 2025 · Cited by 5 — Protein language models (pLMs) have…
What missing-residue prediction teaches the model
Most leading protein language models are trained with a masked language modelling objective similar to that used in BERT. During training, some amino acids are hidden from the model. The model must predict the missing residues using the surrounding sequence context. [Science+2orbion.life]science.orgAlthough the training objective…Read more…
At first glance, this appears to be a simple prediction task. In practice, solving it requires learning deep biological regularities. Certain amino acids are chemically compatible with particular environments. Others tend to appear in active sites, structural motifs, or evolutionarily conserved regions. To predict the missing residue correctly, the model must infer these hidden constraints from sequence patterns alone. [Science+2PNAS]science.orgAlthough the training objective…Read more…
Evidence from large protein language models suggests that this training process teaches the model several kinds of biologically meaningful information:
- Evolutionary constraints: The model learns which substitutions are tolerated and which are strongly selected against because they would disrupt function. [PNAS]pnas.orgProtein language models learn evolutionary statistics of…by Z Zhang · 2024 · Cited by 164 — We developed a completely unsupervised…
- Structural relationships: The model learns patterns associated with residues that interact when a protein folds, even though explicit structural labels are not provided during training. [OpenReview]openreview.netTransformer protein language models are unsupervised…by R Rao · Cited by 479 — In this paper we demonstrate that Transformer…
- Functional signatures: Sequences associated with similar biochemical roles often acquire similar internal representations. [PMC+2PMC]pmc.ncbi.nlm.nih.govThe evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models…
One influential study found that Transformer attention maps learned information about residue contacts directly from the unsupervised prediction objective. In effect, the model discovered clues about protein structure simply by trying to fill in missing amino acids. [OpenReview]openreview.netTransformer protein language models are unsupervised…by R Rao · Cited by 479 — In this paper we demonstrate that Transformer…
What the evidence shows the models actually learn
A key question is whether protein language models merely memorise sequence statistics or genuinely learn biologically useful abstractions.
Several lines of evidence suggest the latter. The ESM programme demonstrated that scaling Transformer training on massive protein databases produced representations that could support accurate structure prediction from sequence alone. Researchers reported that biological structure and function emerged from large-scale unsupervised learning, even though the model was trained primarily on sequence prediction tasks. [GitHub]github.comfacebookresearch/esm: Evolutionary Scale Modeling…ESMFold harnesses the ESM-2 language model to generate accurate structure pred…
Further evidence comes from mutation analysis. If a model assigns very low probability to a particular amino-acid substitution, that substitution often proves damaging in laboratory experiments. This indicates that the model has internalised information about which residues are important for maintaining a protein’s behaviour. [PMC]pmc.ncbi.nlm.nih.govThe evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models…
Researchers have also shown that protein language models capture co-evolutionary patterns—cases where changes at one position are linked to changes elsewhere. Such patterns frequently reflect structural or functional coupling within proteins. The ability to recover these relationships from sequence data alone suggests that the models learn more than surface-level frequency statistics. [PNAS]pnas.orgProtein language models learn evolutionary statistics of…by Z Zhang · 2024 · Cited by 164 — We developed a completely unsupervised…
Why learned protein representations matter
The most important outcome of this research is not simply better prediction of masked amino acids. The value lies in the learned representations that emerge during training.
These representations can be reused for downstream tasks, including:
- Predicting protein structure. [pmc.ncbi.nlm.nih.gov]pmc.ncbi.nlm.nih.govPretrained Protein Language Model Embeddings…by R Shaw · 2025 · Cited by 6 — Trained on vast databases of protein sequences, these mod…
- Estimating the effects of genetic mutations.
- Classifying protein functions. [Wikipedia]WikipediaProteinProteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perfo…
- Identifying biologically related proteins.
- Assisting protein engineering and design. [Nature+3PMC+3PMC]pmc.ncbi.nlm.nih.govFine-tuning protein language models unlocks the potential of…by R Sawhney · 2025 · Cited by 5 — Protein language models (pLMs) have…
In practical terms, the model acts as a compressed statistical summary of evolutionary knowledge. Instead of analysing millions of related sequences individually, researchers can use embeddings generated by a pretrained model to access information about likely structure and function. [PMC]pmc.ncbi.nlm.nih.govThe evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models…
This has become one of the strongest demonstrations that Transformer architectures are not limited to human language. When trained on amino-acid sequences, they learn representations that reflect genuine biological organisation. The model begins with nothing more than strings of residues and a missing-token prediction task, yet it develops internal knowledge that aligns with how proteins fold, evolve, and function. [Science+2OpenReview]science.orgAlthough the training objective…Read more…
Amazon book picks
Further Reading
Books and field guides related to Can proteins be read like a language?. Use these as the next step if you want deeper reading beyond the article.
Deep Learning for the Life Sciences
Directly addresses machine learning applications in biological sequence and molecular data.
Bioinformatics and Functional Genomics
Explains biological sequence analysis and genomic information central to protein language models.
Deep Learning
Rating: 3.5/5 from 6 Google Books ratings
Covers representation learning concepts that underpin protein language models.
Transformers for Natural Language Processing
Protein language models borrow core Transformer concepts originally developed for language.
Endnotes
-
Source: github.com
Link: https://github.com/facebookresearch/esmSource snippet
facebookresearch/esm: Evolutionary Scale Modeling...ESMFold harnesses the ESM-2 language model to generate accurate structure pred...
-
Source: elisagdelope.rbind.io
Title: Elisa G
Link: https://elisagdelope.rbind.io/post/plms/Source snippet
de LopeGetting started with Protein Language Models - Elisa G. de Lope13 Sept 2024 — Like sentences crafted from words, proteins are intr...
-
Source: pmc.ncbi.nlm.nih.gov
Title: PMCThe language of proteins: NLP, [machine learning]({{ ‘machine-learning/’ | relative_url }})
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8050421/Source snippet
NIHby D Ofer · 2021 · Cited by 468 — In this review, we present a modern view on applications of NLP methods to the study of protei...
-
Source: pubs.acs.org
Link: https://pubs.acs.org/doi/10.1021/acs.jproteome.5c00506Source snippet
ACS PublicationsProtein Language Models: Applications and Perspectives26 Dec 2025 — Originally designed for language tasks, LLMs have bee...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12888012/Source snippet
Protein Language Models: Applications and Perspectives - PMCby M Leclercq · 2025 · Cited by 5 — These models treat amino acid sequence...
-
Source: evolutionaryscale.ai
Title: esm cambrian
Link: https://www.evolutionaryscale.ai/blog/esm-cambrianSource snippet
Revealing the mysteries of proteins with...Dec 4, 2024 — Today we're introducing ESM Cambrian, a next generation language model trained...
-
Source: openreview.net
Link: https://openreview.net/forum?id=fylclEqgvgdSource snippet
Transformer protein language models are unsupervised...by R Rao · Cited by 479 — In this paper we demonstrate that Transformer...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12450373/Source snippet
Fine-tuning protein language models unlocks the potential of...by R Sawhney · 2025 · Cited by 5 — Protein language models (pLMs) have...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12756192/Source snippet
Pretrained Protein Language Model Embeddings...by R Shaw · 2025 · Cited by 6 — Trained on vast databases of protein sequences, these mod...
-
Source: orbion.life
Link: https://www.orbion.life/blog/protein-language-models-explained-for-bench-scientistsSource snippet
Protein Language Models Explained for Bench Scientists3 Apr 2026 — Mask residues: Randomly hide ~15% of amino acids in each sequence...
-
Source: nature.com
Link: https://www.nature.com/articles/s41592-026-03050-9Source snippet
Compressing the collective knowledge of ESM into a single...by T Dinh · 2026 · Cited by 2 — ESM models are pretrained with the mas...
-
Source: pnas.org
Link: https://www.pnas.org/doi/10.1073/pnas.2406285121Source snippet
Protein language models learn evolutionary statistics of...by Z Zhang · 2024 · Cited by 164 — We developed a completely unsupervised...
-
Source: nature.com
Link: https://www.nature.com/articles/s41467-022-32007-7Source snippet
ProtGPT2 is a deep unsupervised language model for...by N Ferruz · 2022 · Cited by 1098 — We describe ProtGPT2, a language model trained...
-
Source: nature.com
Link: https://www.nature.com/articles/s41592-025-02776-2Source snippet
Biophysics-based protein language models for...by S Gelman · 2025 · Cited by 60 — Molecular modeling can generate large datasets reveali...
-
Source: amiteshbadkul.github.io
Title: esm2 explained
Link: https://amiteshbadkul.github.io/blog/2023/esm2-explained/Source snippet
Evolutionary Scale Modeling using Protein Language Models29 Jul 2023 — These protein language models (PLMs) treat amino acid sequences an...
-
Source: science.org
Link: https://www.science.org/doi/10.1126/science.ade2574Source snippet
Although the training objective...Read more...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12806033/Source snippet
The evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models...
-
Source: pipebio.com
Title: protein language models
Link: https://pipebio.com/blog/protein-language-modelsSource snippet
promises, pitfalls and applicationsJun 18, 2024 — PLMs have proven very valuable to learn the underlying patterns of protein sequences ev...
-
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/ProteinSource snippet
ProteinProteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perfo...
-
Source: huggingface.co
Link: https://huggingface.co/docs/transformers/en/model_doc/esmSource snippet
ESMThis page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental AI Research Team.Re...
-
Source: betterhealth.vic.gov.au
Link: https://www.betterhealth.vic.gov.au/health/healthyliving/proteinSource snippet
ProteinProtein is a nutrient your body needs to grow and repair cells, and to work properly. Protein is found in a wide range of food and...
-
Source: britannica.com
Link: https://www.britannica.com/science/proteinSource snippet
Protein | Definition, Structure, & ClassificationApr 17, 2026 — Protein, highly complex substance that is present in all living organisms...
-
Source: esmprep.com
Link: https://www.esmprep.com/9-12/college-admissionsSource snippet
ESM | College AdmissionsESM's mission is to help students across the world gain admission to the right school for them, from Princeton to...
-
Source: pubmed.ncbi.nlm.nih.gov
Link: https://pubmed.ncbi.nlm.nih.gov/30060014/Source snippet
M Watford · 2018 · Cited by 141 — Proteins are polymers of amino acids linked via α-peptide bonds. They can be represented as primary, se...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12621866/Source snippet
by AM Subramanian · 2025 · Cited by 1 — We show that although these models exhibit a prodigious latent capacity to access novel amino...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=uPoFdCUqBWkSource snippet
Protein Language Models - MLCB24A protein language model by training on a lot of protein sequences can learn about what amino acids are s...
Additional References
-
Source: medlineplus.gov
Title: They do most of the work in cells and are required for the structure, function
Link: https://medlineplus.gov/genetics/[understandingSource snippet
What are proteins and what do they do?Mar 26, 2021 — Proteins are large, complex molecules that play many critical roles in the body...
-
Source: esminsite.com
Link: https://www.esminsite.com/Source snippet
ESM INSITE Workers' Compensation Risk ManagementESM's tech-enabled risk management services provide Insurance Agencies and Employers with...
-
Source: esmsolutions.com
Link: https://esmsolutions.com/Source snippet
ESM SolutionsA better way to browse, shop, and buy. Connecting people with the resources they need to power education. All your suppliers...
-
Source: opentext.com
Link: https://www.opentext.com/what-is/enterprise-service-managementSource snippet
Enterprise Service Management (ESM) ExplainedEnterprise service management (ESM) is the application of IT service management (ITSM) princ...
-
Source: executiveship.com
Link: https://www.executiveship.com/Source snippet
Executive Ship ManagementExecutive Ship Management (ESM) is a premier management company valued by its clients and partners in the indust...
-
Source: the-scientist.com
Link: https://www.the-scientist.com/researchers-decode-how-protein-language-models-think-making-ai-more-transparent-73520Source snippet
Researchers Decode How Protein Language Models Think...Sep 28, 2025 — By spreading out tightly packed information in neural networks, a...
-
Source: nutrition.org.uk
Link: https://www.nutrition.org.uk/nutritional-information/protein/Source snippet
We need protein for energy growth, repair and maintenance of our bodies, especially our bones and muscles...
-
Source: esm.europa.eu
Link: https://www.esm.europa.eu/Source snippet
Stability Mechanism: HomeThe European Stability Mechanism (ESM) provides financial assistance to euro area countries in crisis, acting as...
-
Source: medium.com
Link: https://medium.com/%40anrizal05/protein-language-models-from-amino-acid-tokens-to-sequence-embeddings-e488e89a330eSource snippet
The training objective is to maximize the log likelihood...Read more...
-
Source: esmschools.org
Title: East Syracuse Minoa Central School District
Link: https://www.esmschools.org/Source snippet
HomeESM will be an exemplary student-centered learning community whose graduates are future focused and ready to excel in a complex, inte...
Topic Tree



