Within Transformers

Why can next token models do so much?

Many people first encounter artificial intelligence through chatbots, writing assistants, translation tools, or coding helpers. What makes these systems surprising is that they are often built on a remarkably simple training objective: predict the next piece of text. GPT-style models are trained to read a sequence of tokens and guess what comes next.

On this page

  • Decoder only Transformers and prediction
  • Pre training, prompting, and task flexibility
  • What generation reveals and hides
Preview for Why can next token models do so much?

Why can next-token prediction produce so many abilities?

A GPT-style model is a decoder-only Transformer. Unlike the original encoder–decoder Transformer designed for translation, it uses causal or masked attention: each token can attend only to earlier tokens in the sequence. During training, the model repeatedly sees text and learns to predict the next token given everything that came before it. [arXiv+2Michael Brenndoerfer]arxiv.orgarXiv[1706.03762] Attention Is All You NeedJun 12, 2017 — We propose a new simple network architecture, the Transformer, based solely on…

GPT generators illustration 1 This setup creates a powerful learning signal. To predict the next word accurately, the model must absorb many patterns hidden in text:

  • Grammar and sentence structure.
  • Facts and common associations.
  • Styles of writing.
  • Question-and-answer formats.
  • Programming syntax.
  • Translation correspondences between languages.
  • Patterns of reasoning expressed in text.

The model is never explicitly told, “this is a translation task” or “this is a coding task”. Instead, those activities appear within the training data. Predicting the next token forces the model to model the structures that generate language in many different contexts. [arXiv]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

An important consequence is that generation and understanding become closely linked. To continue text correctly, the model often needs to infer what the preceding text means. Although this is not the same as human understanding, it allows a single architecture to support many language-related activities.

Decoder-only Transformers and prediction

The decoder-only design proved particularly well suited to large-scale text generation. Because the model generates text one token at a time while conditioning on everything already written, the same mechanism can support many different outputs without changing the underlying architecture. [Michael Brenndoerfer]mbrenndoerfer.comMichael BrenndoerferDecoder Architecture: Causal Masking & Autoregressive…Jun 17, 2025 — A decoder-only model consists of a stack of t…

Consider a prompt such as:

Translate to French: “Good morning”

The model does not switch into a dedicated translation module. It simply continues the text in a way that resembles translation examples seen during training. Likewise, a prompt beginning with a programming problem encourages continuation in the style of code, while a prompt beginning with a question encourages an answer.

This flexibility comes from treating all tasks as text completion. Translation, summarisation, dialogue, classification, and coding become variations of the same operation: predict the most plausible continuation of the current context. GPT-3 demonstrated that sufficiently large autoregressive models could perform many such tasks without gradient updates or task-specific fine-tuning, relying instead on text prompts and examples supplied in the input itself. [arXiv+2NeurIPS Proceedings]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

Pre-training, prompting, and task flexibility

The crucial step was pre-training at scale. GPT-style models are exposed to enormous amounts of text from books, articles, websites, code repositories, and other sources. During this stage they are not learning a specific application. They are learning broad statistical regularities about language and information. [arXiv]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

Once pre-training is complete, prompting provides a way to activate different behaviours. A prompt acts as a temporary specification of the task. For example:

  • A question prompt encourages question answering.
  • Input-output examples encourage translation or classification.
  • A partially written program encourages code completion.
  • A conversation transcript encourages dialogue.

Researchers described this behaviour as zero-shot, one-shot, and few-shot learning. In zero-shot use, only an instruction is given. In one-shot or few-shot use, the prompt includes one or more examples. GPT-3 showed that larger language models became increasingly capable of adapting to new tasks through prompting alone. [arXiv+2NeurIPS Proceedings]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

This phenomenon is often called in-context learning. Rather than updating its parameters during the interaction, the model uses patterns contained in the prompt itself to infer what kind of continuation is expected. Subsequent research has explored how this capability emerges and how much of it can be explained through the interaction of pre-training, memory, and pattern matching within context. [OpenReview+2ACL Anthology]openreview.netWhy In-Context Learning Models are Good Few-Shot…by S Wu · Cited by 20 — Our findings show that ICL with transformers can ef…

A useful way to think about prompting is that it turns natural language into a universal interface. Instead of creating separate systems for each task, users describe the task using text, and the model responds by continuing that text appropriately.

GPT generators illustration 2

Why scaling mattered

Early language models could generate coherent sentences but struggled to generalise beyond narrow patterns. As model size, training data, and computation increased, researchers observed substantial improvements across a wide range of tasks. GPT-3 became influential not simply because it was larger, but because it showed that a single next-token predictor could display useful performance on translation, question answering, text generation, and other tasks without task-specific retraining. [arXiv+2NeurIPS Papers]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

Researchers later described some newly appearing capabilities as “emergent abilities”, meaning behaviours that seemed absent in smaller models but became visible in larger ones. Examples included stronger reasoning performance, improved instruction following, and more effective in-context learning. However, there remains active debate about how these behaviours arise and whether they are truly emergent or the result of more gradual scaling effects combined with prompting methods. [arXiv+2OpenReview]arxiv.orgarXiv Emergent abilities of large language modelsEmergent abilities of large language modelsJune 15, 2022 — by J Wei · 2022 · Cited by 5470 — We first discuss emergent abilities in…Published: June 15, 2022

What is clear is that scale transformed next-token prediction from a specialised language-modelling task into a practical foundation for general-purpose text systems.

What generation reveals and hides

The success of GPT-style models can create the impression that they possess a unified understanding of the world. In reality, users observe only the generated output. The underlying process remains a sequence of probability estimates over possible next tokens. [arXiv]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

This leads to both strengths and weaknesses.

Generation reveals:

  • Broad knowledge acquired during pre-training.
  • The ability to imitate formats and styles.
  • Adaptation to instructions and examples in prompts.
  • Flexible use across many language tasks. [researchgate.net]researchgate.netAre Emergent Abilities in Large Language Models just In-…Sep 4, 2023 — Large language models have exhibited emergent abilities, demons…

Generation can also hide:

  • Gaps in factual knowledge.
  • Uncertainty that is not always expressed explicitly.
  • Failures of reasoning behind fluent language.
  • Dependence on patterns learned from training data rather than genuine comprehension.

Because the model’s objective is to produce a plausible continuation, fluent output does not guarantee correctness. The same mechanism that enables creativity and flexibility can also produce confident mistakes.

The central lesson is that GPT-style systems became flexible generators not because engineers built separate modules for every task, but because large decoder-only Transformers learned an exceptionally broad predictive model of text. Once language itself became the interface, next-token prediction turned into a surprisingly general way of drafting, answering, translating, coding, and conversing. [arXiv+2NeurIPS Proceedings]arxiv.orgarXiv Language Models are Few-Shot LearnersLanguage Models are Few-Shot LearnersMay 28, 2020…Published: May 28, 2020

GPT generators illustration 3

Amazon book picks

Further Reading

Books and field guides related to Why can next token models do so much?. Use these as the next step if you want deeper reading beyond the article.

BookCover for Deep Learning

Deep Learning

By Ian Goodfellow, Yoshua Bengio et al.

Rating: 3.5/5 from 6 Google Books ratings

Provides theoretical foundations behind representation learning and generative models.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Title: arXiv Language Models are Few-Shot Learners
    Link: https://arxiv.org/abs/2005.14165
    Source snippet

    Language Models are Few-Shot LearnersMay 28, 2020...

    Published: May 28, 2020

  2. Source: arxiv.org
    Link: https://arxiv.org/abs/1706.03762
    Source snippet

    arXiv[1706.03762] Attention Is All You NeedJun 12, 2017 — We propose a new simple network architecture, the Transformer, based solely on...

  3. Source: proceedings.neurips.cc
    Title: 1457c0d6bfcb4967418bfb8ac142f64a Paper
    Link: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
    Source snippet

    NeurIPS ProceedingsLanguage Models are Few-Shot Learnersby T Brown · 2020 · Cited by 74705 — We demonstrate that scaling up language mode...

  4. Source: openreview.net
    Link: https://openreview.net/forum?id=iLUcsecZJp
    Source snippet

    Why In-Context Learning Models are Good Few-Shot...by S Wu · Cited by 20 — Our findings show that ICL with transformers can ef...

  5. Source: arxiv.org
    Title: arXiv Are Emergent Abilities in Large Language Models just In-Context Learning?
    Link: https://arxiv.org/abs/2309.01809

  6. Source: arxiv.org
    Title: arXiv Emergent abilities of large language models
    Link: https://arxiv.org/pdf/2206.07682
    Source snippet

    Emergent abilities of large language modelsJune 15, 2022 — by J Wei · 2022 · Cited by 5470 — We first discuss emergent abilities in...

    Published: June 15, 2022

  7. Source: openreview.net
    Link: https://openreview.net/pdf?id=yzkSU5zdwD
    Source snippet

    Emergent Abilities of Large Language Modelsby J Wei · Cited by 5337 — Brown et al. (2020) proposed few-shot prompting, which in...

  8. Source: arxiv.org
    Link: https://arxiv.org/html/2503.05788v2
    Source snippet

    Emergent Abilities in Large Language Models: A SurveySummarizes in-context learning (ICL), the capability for few-shot generalization to...

  9. Source: mbrenndoerfer.com
    Link: https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers
    Source snippet

    Michael BrenndoerferDecoder Architecture: Causal Masking & Autoregressive...Jun 17, 2025 — A decoder-only model consists of a stack of t...

  10. Source: papers.nips.cc
    Link: https://papers.nips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
    Source snippet

    Models are Few-Shot Learnersby T Brown · 2020 · Cited by 74011 — We demonstrate that scaling up language models greatly improves task-agn...

  11. Source: aclanthology.org
    Title: 2024.acl long.279
    Link: https://aclanthology.org/2024.acl-long.279.pdf
    Source snippet

    Are Emergent Abilities in Large Language Models just In-...by S Lu · 2024 · Cited by 202 — In-Context Learning ICL is a learning paradig...

  12. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Language
    Source snippet

    LanguageLanguage is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which hum...

  13. Source: mbrenndoerfer.com
    Title: gpt 3 scale few shot in context learning
    Link: https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning
    Source snippet

    GPT-3: Scale, Few-Shot Learning & In-Context...Jul 19, 2025 — The paper introducing GPT-3, "Language Models are Few-Shot Learners" by Br...

Additional References

  1. Source: medium.com
    Link: https://medium.com/%40alejandro.itoaramendia/attention-is-all-you-need-a-complete-guide-to-transformers-8670a3f09d02
    Source snippet

    Attention Is All You Need: A Complete Guide to TransformersAt each of these steps, the model is auto-regressive, meaning that previously...

  2. Source: medium.com
    Link: https://medium.com/%40qmsoqm2/auto-regressive-vs-sequence-to-sequence-d7362eda001e
    Source snippet

    Encoder-Decoder vs. Decoder-OnlyThe straightforward answer is that the auto-regressive one only features a decoder stack (dec-only), whil...

  3. Source: medium.com
    Link: https://medium.com/%40akankshasinha247/few-shot-prompting-teaching-ai-with-just-a-few-examples-6819273fd6e2
    Source snippet

    Few-Shot Prompting: Teaching AI With Just a Few ExamplesFew-shot prompting is one of the most practical and powerful prompt engineering t...

  4. Source: slideshare.net
    Link: https://www.slideshare.net/slideshow/llm-gpt-3-language-models-are-few-shot-learners/268660713
    Source snippet

    LLM GPT-3: Language models are few-shot learners | PPTXThe document outlines the evolution and capabilities of the GPT language models fr...

  5. Source: medium.com
    Link: https://medium.com/%40manindersingh120996/hands-on-with-transformers-recreating-attention-is-all-you-need-in-pytorch-step-by-step-ecfbf3e1985b
    Source snippet

    Recreating 'Attention Is All You Need' in PyTorch, Step by...In this blog, I'll walk you through everything I built, step by step, in Py...

  6. Source: researchgate.net
    Link: https://www.researchgate.net/publication/341724146_Language_Models_are_Few-Shot_Learners
    Source snippet

    (PDF) Language Models are Few-Shot LearnersHere we show that scaling up language models greatly improves task-agnostic, few-shot performa...

  7. Source: reddit.com
    Link: https://www.reddit.com/r/MachineLearning/comments/gsivhg/r_language_models_are_fewshot_learners/
    Source snippet

    [R] Language Models are Few-Shot LearnersThe performance on few-shot and zero-shot tasks improves dramatically as they increase model siz...

  8. Source: researchgate.net
    Link: https://www.researchgate.net/publication/392255772_In-Context_Learning_in_Large_Language_Models_LLMs_Mechanisms_Capabilities_and_Implications_for_Advanced_Knowledge_Representation_and_Reasoning
    Source snippet

    In-Context Learning in Large Language Models (LLMs)Mar 16, 2026 — We investigate how LLMs encode and use knowledge via ICL, the evolving...

  9. Source: researchgate.net
    Link: https://www.researchgate.net/publication/373686518_Are_Emergent_Abilities_in_Large_Language_Models_just_In-Context_Learning
    Source snippet

    Are Emergent Abilities in Large Language Models just In-...Sep 4, 2023 — Large language models have exhibited emergent abilities, demons...

  10. Source: youtube.com
    Link: https://www.youtube.com/watch?v=5i-SC-roENM
    Source snippet

    GPT-3: Language Models are Few-shot Learnerstpt-3 achieves promising results under the regimes of zero shot one shot and few shot learnin...

Topic Tree

Follow this branch

Parent topic

Transformers The Architecture Behind Modern AI

Related pages 4

More on this topic 3