Within Beyond text

Can proteins be read like a language?

Protein Transformers can learn useful signals about structure and function by predicting missing amino acids across huge biological sequence databases.

On this page

  • How amino acid sequences become model tokens
  • What missing residue prediction teaches the model
  • Why learned protein representations matter
Preview for Can proteins be read like a language?

Introduction

Protein language models apply the core idea behind Transformer-based language systems to biology: instead of reading sentences made of words, they read proteins made of amino acids. A protein is a chain built from a small alphabet of amino-acid residues, and the order of those residues largely determines the molecule’s shape and biological role. By training on hundreds of millions of protein sequences and learning to predict missing amino acids, Transformer models discover statistical patterns that reflect evolution, structure, and function. Remarkably, many of these properties emerge without the model being explicitly taught chemistry or three-dimensional protein structures. Research on models such as ESM (Evolutionary Scale Modeling) has shown that large-scale sequence learning can produce representations useful for structure prediction, mutation analysis, and protein design. [Science+2GitHub]science.orgAlthough the training objective…Read more…

Protein models illustration 1

Can proteins be read like a language?

The analogy between language and proteins is not perfect, but it is surprisingly productive. Human languages are built from vocabularies and grammatical rules. Proteins are built from sequences of roughly twenty common amino acids arranged in specific orders. Just as changing a word can alter the meaning of a sentence, changing a single amino acid can alter a protein’s stability, shape, or function. [Elisa G. de Lope+2PMC]elisagdelope.rbind.ioElisa Gde LopeGetting started with Protein Language Models - Elisa G. de Lope13 Sept 2024 — Like sentences crafted from words, proteins are intr…

Protein language models therefore treat amino acids as tokens. A sequence such as a protein chain is converted into a series of token embeddings and processed by a Transformer. Instead of learning grammar and semantics, the model learns which amino acids tend to occur together, which positions are highly constrained by evolution, and which substitutions are likely or unlikely in a biological context. [ACS Publications+2PMC]pubs.acs.orgACS PublicationsProtein Language Models: Applications and Perspectives26 Dec 2025 — Originally designed for language tasks, LLMs have bee…

What makes this approach powerful is the scale of available data. Public biological databases contain hundreds of millions of protein sequences collected from organisms across the tree of life. These sequences provide a vast record of evolutionary experiments carried out over billions of years. Protein language models effectively compress patterns from that record into their internal representations. [Science+2EvolutionaryScale]science.orgAlthough the training objective…Read more…

How amino-acid sequences become model tokens

Unlike natural-language systems that must handle huge vocabularies, protein models work with a compact alphabet. Each amino acid receives its own token representation. The model then processes the sequence using self-attention, allowing every position to influence every other position. [ACS Publications]pubs.acs.orgACS PublicationsProtein Language Models: Applications and Perspectives26 Dec 2025 — Originally designed for language tasks, LLMs have bee…

This matters because proteins often contain long-range dependencies. Two amino acids that are far apart in the sequence may end up physically touching when the protein folds into its three-dimensional form. A Transformer can directly connect these distant positions through attention mechanisms rather than relying only on local neighbourhoods. [OpenReview]openreview.netTransformer protein language models are unsupervised…by R Rao · Cited by 479 — In this paper we demonstrate that Transformer…

As training progresses, the model develops numerical representations known as embeddings. Proteins with similar biological properties tend to occupy nearby regions of this learned representation space. The embeddings become compressed summaries of information that is difficult to read directly from raw amino-acid strings. [PMC+2PMC]pmc.ncbi.nlm.nih.govFine-tuning protein language models unlocks the potential of…by R Sawhney · 2025 · Cited by 5 — Protein language models (pLMs) have…

What missing-residue prediction teaches the model

Most leading protein language models are trained with a masked language modelling objective similar to that used in BERT. During training, some amino acids are hidden from the model. The model must predict the missing residues using the surrounding sequence context. [Science+2orbion.life]science.orgAlthough the training objective…Read more…

At first glance, this appears to be a simple prediction task. In practice, solving it requires learning deep biological regularities. Certain amino acids are chemically compatible with particular environments. Others tend to appear in active sites, structural motifs, or evolutionarily conserved regions. To predict the missing residue correctly, the model must infer these hidden constraints from sequence patterns alone. [Science+2PNAS]science.orgAlthough the training objective…Read more…

Evidence from large protein language models suggests that this training process teaches the model several kinds of biologically meaningful information:

  • Evolutionary constraints: The model learns which substitutions are tolerated and which are strongly selected against because they would disrupt function. [PNAS]pnas.orgProtein language models learn evolutionary statistics of…by Z Zhang · 2024 · Cited by 164 — We developed a completely unsupervised…
  • Structural relationships: The model learns patterns associated with residues that interact when a protein folds, even though explicit structural labels are not provided during training. [OpenReview]openreview.netTransformer protein language models are unsupervised…by R Rao · Cited by 479 — In this paper we demonstrate that Transformer…
  • Functional signatures: Sequences associated with similar biochemical roles often acquire similar internal representations. [PMC+2PMC]pmc.ncbi.nlm.nih.govThe evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models…

One influential study found that Transformer attention maps learned information about residue contacts directly from the unsupervised prediction objective. In effect, the model discovered clues about protein structure simply by trying to fill in missing amino acids. [OpenReview]openreview.netTransformer protein language models are unsupervised…by R Rao · Cited by 479 — In this paper we demonstrate that Transformer…

Protein models illustration 2

What the evidence shows the models actually learn

A key question is whether protein language models merely memorise sequence statistics or genuinely learn biologically useful abstractions.

Several lines of evidence suggest the latter. The ESM programme demonstrated that scaling Transformer training on massive protein databases produced representations that could support accurate structure prediction from sequence alone. Researchers reported that biological structure and function emerged from large-scale unsupervised learning, even though the model was trained primarily on sequence prediction tasks. [GitHub]github.comfacebookresearch/esm: Evolutionary Scale Modeling…ESMFold harnesses the ESM-2 language model to generate accurate structure pred…

Further evidence comes from mutation analysis. If a model assigns very low probability to a particular amino-acid substitution, that substitution often proves damaging in laboratory experiments. This indicates that the model has internalised information about which residues are important for maintaining a protein’s behaviour. [PMC]pmc.ncbi.nlm.nih.govThe evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models…

Researchers have also shown that protein language models capture co-evolutionary patterns—cases where changes at one position are linked to changes elsewhere. Such patterns frequently reflect structural or functional coupling within proteins. The ability to recover these relationships from sequence data alone suggests that the models learn more than surface-level frequency statistics. [PNAS]pnas.orgProtein language models learn evolutionary statistics of…by Z Zhang · 2024 · Cited by 164 — We developed a completely unsupervised…

Why learned protein representations matter

The most important outcome of this research is not simply better prediction of masked amino acids. The value lies in the learned representations that emerge during training.

These representations can be reused for downstream tasks, including:

  • Predicting protein structure. [pmc.ncbi.nlm.nih.gov]pmc.ncbi.nlm.nih.govPretrained Protein Language Model Embeddings…by R Shaw · 2025 · Cited by 6 — Trained on vast databases of protein sequences, these mod…
  • Estimating the effects of genetic mutations.
  • Classifying protein functions. [Wikipedia]WikipediaProteinProteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perfo…
  • Identifying biologically related proteins.
  • Assisting protein engineering and design. [Nature+3PMC+3PMC]pmc.ncbi.nlm.nih.govFine-tuning protein language models unlocks the potential of…by R Sawhney · 2025 · Cited by 5 — Protein language models (pLMs) have…

In practical terms, the model acts as a compressed statistical summary of evolutionary knowledge. Instead of analysing millions of related sequences individually, researchers can use embeddings generated by a pretrained model to access information about likely structure and function. [PMC]pmc.ncbi.nlm.nih.govThe evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models…

This has become one of the strongest demonstrations that Transformer architectures are not limited to human language. When trained on amino-acid sequences, they learn representations that reflect genuine biological organisation. The model begins with nothing more than strings of residues and a missing-token prediction task, yet it develops internal knowledge that aligns with how proteins fold, evolve, and function. [Science+2OpenReview]science.orgAlthough the training objective…Read more…

Protein models illustration 3

Amazon book picks

Further Reading

Books and field guides related to Can proteins be read like a language?. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Covers representation learning concepts that underpin protein language models.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: github.com
    Link: https://github.com/facebookresearch/esm
    Source snippet

    facebookresearch/esm: Evolutionary Scale Modeling...ESMFold harnesses the ESM-2 language model to generate accurate structure pred...

  2. Source: elisagdelope.rbind.io
    Title: Elisa G
    Link: https://elisagdelope.rbind.io/post/plms/
    Source snippet

    de LopeGetting started with Protein Language Models - Elisa G. de Lope13 Sept 2024 — Like sentences crafted from words, proteins are intr...

  3. Source: pmc.ncbi.nlm.nih.gov
    Title: PMCThe language of proteins: NLP, [machine learning]({{ ‘machine-learning/’ | relative_url }})
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC8050421/
    Source snippet

    NIHby D Ofer · 2021 · Cited by 468 — In this review, we present a modern view on applications of NLP methods to the study of protei...

  4. Source: pubs.acs.org
    Link: https://pubs.acs.org/doi/10.1021/acs.jproteome.5c00506
    Source snippet

    ACS PublicationsProtein Language Models: Applications and Perspectives26 Dec 2025 — Originally designed for language tasks, LLMs have bee...

  5. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12888012/
    Source snippet

    Protein Language Models: Applications and Perspectives - PMCby M Leclercq · 2025 · Cited by 5 — These models treat amino acid sequence...

  6. Source: evolutionaryscale.ai
    Title: esm cambrian
    Link: https://www.evolutionaryscale.ai/blog/esm-cambrian
    Source snippet

    Revealing the mysteries of proteins with...Dec 4, 2024 — Today we're introducing ESM Cambrian, a next generation language model trained...

  7. Source: openreview.net
    Link: https://openreview.net/forum?id=fylclEqgvgd
    Source snippet

    Transformer protein language models are unsupervised...by R Rao · Cited by 479 — In this paper we demonstrate that Transformer...

  8. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12450373/
    Source snippet

    Fine-tuning protein language models unlocks the potential of...by R Sawhney · 2025 · Cited by 5 — Protein language models (pLMs) have...

  9. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12756192/
    Source snippet

    Pretrained Protein Language Model Embeddings...by R Shaw · 2025 · Cited by 6 — Trained on vast databases of protein sequences, these mod...

  10. Source: orbion.life
    Link: https://www.orbion.life/blog/protein-language-models-explained-for-bench-scientists
    Source snippet

    Protein Language Models Explained for Bench Scientists3 Apr 2026 — Mask residues: Randomly hide ~15% of amino acids in each sequence...

  11. Source: nature.com
    Link: https://www.nature.com/articles/s41592-026-03050-9
    Source snippet

    Compressing the collective knowledge of ESM into a single...by T Dinh · 2026 · Cited by 2 — ESM models are pretrained with the mas...

  12. Source: pnas.org
    Link: https://www.pnas.org/doi/10.1073/pnas.2406285121
    Source snippet

    Protein language models learn evolutionary statistics of...by Z Zhang · 2024 · Cited by 164 — We developed a completely unsupervised...

  13. Source: nature.com
    Link: https://www.nature.com/articles/s41467-022-32007-7
    Source snippet

    ProtGPT2 is a deep unsupervised language model for...by N Ferruz · 2022 · Cited by 1098 — We describe ProtGPT2, a language model trained...

  14. Source: nature.com
    Link: https://www.nature.com/articles/s41592-025-02776-2
    Source snippet

    Biophysics-based protein language models for...by S Gelman · 2025 · Cited by 60 — Molecular modeling can generate large datasets reveali...

  15. Source: amiteshbadkul.github.io
    Title: esm2 explained
    Link: https://amiteshbadkul.github.io/blog/2023/esm2-explained/
    Source snippet

    Evolutionary Scale Modeling using Protein Language Models29 Jul 2023 — These protein language models (PLMs) treat amino acid sequences an...

  16. Source: science.org
    Link: https://www.science.org/doi/10.1126/science.ade2574
    Source snippet

    Although the training objective...Read more...

  17. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12806033/
    Source snippet

    The evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models...

  18. Source: pipebio.com
    Title: protein language models
    Link: https://pipebio.com/blog/protein-language-models
    Source snippet

    promises, pitfalls and applicationsJun 18, 2024 — PLMs have proven very valuable to learn the underlying patterns of protein sequences ev...

  19. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Protein
    Source snippet

    ProteinProteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perfo...

  20. Source: huggingface.co
    Link: https://huggingface.co/docs/transformers/en/model_doc/esm
    Source snippet

    ESMThis page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental AI Research Team.Re...

  21. Source: betterhealth.vic.gov.au
    Link: https://www.betterhealth.vic.gov.au/health/healthyliving/protein
    Source snippet

    ProteinProtein is a nutrient your body needs to grow and repair cells, and to work properly. Protein is found in a wide range of food and...

  22. Source: britannica.com
    Link: https://www.britannica.com/science/protein
    Source snippet

    Protein | Definition, Structure, & ClassificationApr 17, 2026 — Protein, highly complex substance that is present in all living organisms...

  23. Source: esmprep.com
    Link: https://www.esmprep.com/9-12/college-admissions
    Source snippet

    ESM | College AdmissionsESM's mission is to help students across the world gain admission to the right school for them, from Princeton to...

  24. Source: pubmed.ncbi.nlm.nih.gov
    Link: https://pubmed.ncbi.nlm.nih.gov/30060014/
    Source snippet

    M Watford · 2018 · Cited by 141 — Proteins are polymers of amino acids linked via α-peptide bonds. They can be represented as primary, se...

  25. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC12621866/
    Source snippet

    by AM Subramanian · 2025 · Cited by 1 — We show that although these models exhibit a prodigious latent capacity to access novel amino...

  26. Source: youtube.com
    Link: https://www.youtube.com/watch?v=uPoFdCUqBWk
    Source snippet

    Protein Language Models - MLCB24A protein language model by training on a lot of protein sequences can learn about what amino acids are s...

Additional References

  1. Source: medlineplus.gov
    Title: They do most of the work in cells and are required for the structure, function
    Link: https://medlineplus.gov/genetics/[understanding
    Source snippet

    What are proteins and what do they do?Mar 26, 2021 — Proteins are large, complex molecules that play many critical roles in the body...

  2. Source: esminsite.com
    Link: https://www.esminsite.com/
    Source snippet

    ESM INSITE Workers' Compensation Risk ManagementESM's tech-enabled risk management services provide Insurance Agencies and Employers with...

  3. Source: esmsolutions.com
    Link: https://esmsolutions.com/
    Source snippet

    ESM SolutionsA better way to browse, shop, and buy. Connecting people with the resources they need to power education. All your suppliers...

  4. Source: opentext.com
    Link: https://www.opentext.com/what-is/enterprise-service-management
    Source snippet

    Enterprise Service Management (ESM) ExplainedEnterprise service management (ESM) is the application of IT service management (ITSM) princ...

  5. Source: executiveship.com
    Link: https://www.executiveship.com/
    Source snippet

    Executive Ship ManagementExecutive Ship Management (ESM) is a premier management company valued by its clients and partners in the indust...

  6. Source: the-scientist.com
    Link: https://www.the-scientist.com/researchers-decode-how-protein-language-models-think-making-ai-more-transparent-73520
    Source snippet

    Researchers Decode How Protein Language Models Think...Sep 28, 2025 — By spreading out tightly packed information in neural networks, a...

  7. Source: nutrition.org.uk
    Link: https://www.nutrition.org.uk/nutritional-information/protein/
    Source snippet

    We need protein for energy growth, repair and maintenance of our bodies, especially our bones and muscles...

  8. Source: esm.europa.eu
    Link: https://www.esm.europa.eu/
    Source snippet

    Stability Mechanism: HomeThe European Stability Mechanism (ESM) provides financial assistance to euro area countries in crisis, acting as...

  9. Source: medium.com
    Link: https://medium.com/%40anrizal05/protein-language-models-from-amino-acid-tokens-to-sequence-embeddings-e488e89a330e
    Source snippet

    The training objective is to maximize the log likelihood...Read more...

  10. Source: esmschools.org
    Title: East Syracuse Minoa Central School District
    Link: https://www.esmschools.org/
    Source snippet

    HomeESM will be an exemplary student-centered learning community whose graduates are future focused and ready to excel in a complex, inte...

Topic Tree

Follow this branch

Parent topic

Beyond text Why did attention work beyond language?

Related pages 2