Within Language Models
Why chatbots do not really read words
Tokenization turns text into uneven pieces, so a model's basic units are often fragments rather than the words a reader sees.
On this page
- What tokens are and why they vary
- How token IDs and embeddings carry patterns
- Where token boundaries can change outputs
Page outline Jump by section
Introduction
Before a chatbot can answer a question, it must convert the text into units it can process. These units are called tokens. A token may be a whole word, part of a word, punctuation, or even a space combined with nearby characters. As a result, a language model does not begin with the same building blocks that a human reader sees on the screen. The way text is divided into tokens influences what patterns the model learns during training and how it generates answers later. Research increasingly shows that tokenisation is not merely a technical pre-processing step; it can affect reasoning, multilingual performance, spelling behaviour, efficiency, and even the probability of particular responses. [Microsoft Learn+2Clearbox AI]learn.microsoft.comLearn Understanding tokensMicrosoft LearnUnderstanding tokens -.NETMar 13, 2026 — Understand how large language models (LLMs) use tokens to analyze semantic relat…
Why chatbots do not really read words
Large language models work with sequences of token IDs rather than written words. The tokenizer converts text into numerical identifiers, and the model processes those identifiers throughout training and generation. Only after the model has produced a sequence of output tokens are those tokens converted back into readable text. [Microsoft Learn+2Clearbox AI]learn.microsoft.comLearn Understanding tokensMicrosoft LearnUnderstanding tokens -.NETMar 13, 2026 — Understand how large language models (LLMs) use tokens to analyze semantic relat…
This means that the apparent word structure seen by a user is not necessarily the structure seen by the model. A common word might be represented as a single token, while a rare word could be split into several pieces. For example, a tokenizer may store a frequent word as one unit but represent an unusual technical term as multiple fragments. Modern GPT-style systems commonly use variants of Byte Pair Encoding (BPE), which build vocabularies from frequently occurring character patterns rather than from complete dictionary words. [Hugging Face+2Sebastian Raschka, PhD]huggingface.coHugging FaceByte-Pair Encoding tokenizationByte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then u…
The practical consequence is that two phrases with nearly identical meanings may be represented internally in very different ways. One version may align neatly with familiar tokens, while another may be fragmented into many smaller pieces. The model’s predictions are therefore shaped partly by how easily the tokenizer can represent the text. [arXiv]arxiv.orgEffect of Tokenization on Performance of LLMsDecember 26, 2025 — by S Pawar · 2025 · Cited by 2 — In this paper, we hypothesize that…
What tokens are and why they vary
Token boundaries are uneven because tokenizers are designed to balance vocabulary size against efficiency. Frequent patterns receive their own tokens, while uncommon patterns are assembled from smaller pieces. On average, a token often corresponds to only a few bytes of text rather than a complete word. [GitHub]github.comtiktoken is a fast BPE tokeniser for use with OpenAI's models.It's reversible and lossless, so you can convert tokens back into the…
Several factors influence how text is tokenised:
- Word frequency: Common words and phrases often become single tokens.
- Language structure: Languages with complex morphology or compound words may be split into many more tokens.
- Writing system: Languages using different scripts can receive very different token allocations.
- Content type: Code, mathematical notation, and specialised terminology are often fragmented differently from ordinary prose. [OpenReview+3machinelearningplus+3arXiv]machinelearningplus.comHow LLM Tokenization Works: Build a BPE TokenizerBPE merges the most frequent byte pairs step by step; tiktoken is the…
This variation matters because context windows, memory limits, and computational costs are measured in tokens, not words. A paragraph that appears short to a reader may consume substantially more tokens depending on language and vocabulary choices. [machinelearningplus+2LinkedIn]machinelearningplus.comHow LLM Tokenization Works: Build a BPE TokenizerBPE merges the most frequent byte pairs step by step; tiktoken is the…
How token IDs and embeddings carry patterns
After tokenisation, each token is mapped to a numerical ID. Those IDs are then transformed into embeddings, which are mathematical vectors that capture statistical relationships learned during training. Tokens that often appear in similar contexts tend to develop related representations. [Microsoft Learn]learn.microsoft.comLearn Understanding tokensMicrosoft LearnUnderstanding tokens -.NETMar 13, 2026 — Understand how large language models (LLMs) use tokens to analyze semantic relat…
Because embeddings are attached to tokens rather than directly to words, the model’s understanding is partly shaped by the tokenizer’s decisions. If a concept is represented by a stable token or token sequence, the model can accumulate rich statistical knowledge about it. If the same concept is frequently fragmented into different token combinations, learning may be less efficient. Research examining tokenisation quality has found measurable relationships between token fragmentation and downstream model performance. [arXiv+2ACL Anthology]arxiv.orgEffect of Tokenization on Performance of LLMsDecember 26, 2025 — by S Pawar · 2025 · Cited by 2 — In this paper, we hypothesize that…
This helps explain why tokenisation is often described as part of the model’s representational system rather than merely a storage format. The tokenizer determines the pieces from which all later patterns are built. [Clearbox AI]clearbox.aiAIA quickA quick introduction to tokenizersIt's important to note that tokenization is always performed before training a language model. It is an…
Where token boundaries can change outputs
One of the most surprising findings in recent research is that token boundaries can influence the probability of generated text even when the visible wording appears almost identical.
Whitespace provides a simple example. Some tokenizers treat a word with a leading space as a different token from the same word without that space. Researchers have shown that such distinctions can alter the probabilities assigned to subsequent tokens, subtly changing generation behaviour. [PMC]pmc.ncbi.nlm.nih.govThis behavior introduces subtle biases that affect the model's language generation process and response characteristics…
More dramatic effects appear when prompts end at awkward token boundaries. Recent studies describe a “partial token” problem in which a prompt may end in the middle of what the tokenizer expects to be a larger token. In these cases, models can assign dramatically different probabilities to continuations compared with equivalent prompts aligned to token boundaries. The effect has been observed across leading language models and can be especially important in code, highly compounding languages, and scripts that do not rely on spaces between words. [arXiv]arxiv.orgAre you going to finish that? A Practical Study of the Tokenization Boundary ProblemJanuary 30, 2026…
Researchers have also documented broader forms of tokenisation bias. Because some languages, writing systems, and word structures are segmented more efficiently than others, models may devote different amounts of representational capacity to similar meanings. These differences can affect performance across languages and tasks. [arXiv+2Emergent Mind]arxiv.orgOpen source on arxiv.org.
Why tokenisation remains an active area of research
Tokenisation was once treated as a largely solved engineering problem. Increasingly, researchers view it as a factor that can shape model behaviour in ways that remain poorly understood. Studies have linked tokenizer choice to downstream task performance, multilingual fairness, reasoning consistency, and sensitivity to prompt formatting. [Hugging Face+2OpenReview]huggingface.coHugging FaceTokSuite: Measuring the Impact of Tokenizer Choice on…Dec 23, 2025 — TokSuite enables systematic研究 of tokenization effects…
Some newer work explores alternatives, including byte-level and token-free approaches that aim to reduce distortions introduced by fixed vocabularies. Other research focuses on improving token alignment, reducing tokenisation bias, or designing tokenizers that better match linguistic structure. [arXiv+2Emergent Mind]arxiv.orgOpen source on arxiv.org.
For anyone trying to understand artificial intelligence, the key insight is simple: a chatbot’s answers begin not with words but with tokens. The way those tokens are defined influences what the model learns, what patterns it can recognise efficiently, and sometimes which answer it ultimately produces. [Microsoft Learn+2Hugging Face]learn.microsoft.comLearn Understanding tokensMicrosoft LearnUnderstanding tokens -.NETMar 13, 2026 — Understand how large language models (LLMs) use tokens to analyze semantic relat…
Amazon book picks
Further Reading
Books and field guides related to Why chatbots do not really read words. Use these as the next step if you want deeper reading beyond the article.
Build a Large Language Model (From Scratch)
Includes tokenization, embeddings, and transformer fundamentals.
Natural Language Processing with Transformers
Covers tokenization and transformer-based NLP.
Speech and Language Processing: Pearson New International Edi...
Provides deep treatment of text representation and language processing.
Endnotes
-
Source: learn.microsoft.com
Title: Learn Understanding tokens
Link: https://learn.microsoft.com/en-us/dotnet/ai/conceptual/understanding-tokensSource snippet
Microsoft LearnUnderstanding tokens -.NETMar 13, 2026 — Understand how large language models (LLMs) use tokens to analyze semantic relat...
-
Source: clearbox.ai
Title: AIA quick
Link: https://www.clearbox.ai/blog/quick-introduction-tokenizersSource snippet
A quick introduction to tokenizersIt's important to note that tokenization is always performed before training a language model. It is an...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/prashant-lakhera-696119b_llm-nlp-machinelearning-activity-7344544552529088513-E5SASource snippet
Building a Small Language Model with BPE and tiktoken🏗️ Inside the Model Text → Tokenizer → Token IDs → Embedding Layer → Transformer Blo...
-
Source: github.com
Link: https://github.com/openai/tiktokenSource snippet
tiktoken is a fast BPE tokeniser for use with OpenAI's models.It's reversible and lossless, so you can convert tokens back into the...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2512.21933Source snippet
Effect of Tokenization on Performance of LLMsDecember 26, 2025 — by S Pawar · 2025 · Cited by 2 — In this paper, we hypothesize that...
Published: December 26, 2025
-
Source: machinelearningplus.com
Link: https://machinelearningplus.com/gen-ai/build-bpe-tokenizer/Source snippet
How LLM Tokenization Works: Build a BPE TokenizerBPE merges the most frequent byte pairs step by step; tiktoken is the...
-
Source: arxiv.org
Title: arXiv Impact of Tokenization on Language Models: An Analysis for Turkish
Link: https://arxiv.org/abs/2204.08832 -
Source: arxiv.org
Link: https://arxiv.org/abs/2305.17179 -
Source: openreview.net
Title: Open Review Do All Languages Cost the Same?
Link: https://openreview.net/forum?id=OUmxBN45GlSource snippet
Tokenization in the Era...by O Ahia · Cited by 201 — TL;DR: We investigate the effects of subword tokenization in LLMs across languages...
-
Source: linkedin.com
Link: https://www.linkedin.com/pulse/learnings-understanding-tokens-hidden-language-powers-blanchet-xohweSource snippet
Learnings- Understanding Tokens: The Hidden Language...AI models have a fixed "context window" measured in tokens, not words...
-
Source: learn.microsoft.com
Title: Learn What’s new in ML.NE T
Link: https://learn.microsoft.com/en-us/dotnet/[machine-learningSource snippet
Microsoft LearnWhat's new in ML.NETMay 16, 2025 — Tokenizers are responsible for breaking down a string of text into smaller, more manage...
Published: May 16, 2025
-
Source: learn.microsoft.com
Title: 300,000 tokens summed across all inputs. To learn more about
Link: https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/embeddingsSource snippet
Microsoft LearnHow to generate embeddings with Azure OpenAI in...May 13, 2026 — The maximum length of input text for the latest embeddin...
Published: May 13, 2026
-
Source: arxiv.org
Link: https://arxiv.org/abs/2601.23223Source snippet
Are you going to finish that? A Practical Study of the Tokenization Boundary ProblemJanuary 30, 2026...
Published: January 30, 2026
-
Source: arxiv.org
Link: https://arxiv.org/abs/2410.09303 -
Source: openreview.net
Title: Open Review Broken Tokens?
Link: https://openreview.net/forum?id=WrYWolqKh3&referrer=%5Bthe+profile+of+Noah+A.+Smith%5D%28%2Fprofile%3Fid%3D~Noah_A._Smith2%29Source snippet
Your Language Model can Secretly...The work focuses on an important yet under-explored aspect of LLMs—how BPE tokenization choices at in...
-
Source: arxiv.org
Link: https://arxiv.org/html/2601.14658v1Source snippet
When Tokenizer Betrays Reasoning in LLMs21 Jan 2026 — In this work, we show that tokenization can betray LLM reasoning through one-to-man...
-
Source: learn.microsoft.com
Title: what are tokens
Link: https://learn.microsoft.com/en-us/answers/questions/1343138/what-are-tokensSource snippet
are tokens? - Microsoft Q&A8 Aug 2023 — In the context of Azure OpenAI, tokens refer to the basic units of input and output that the serv...
-
Source: learn.microsoft.com
Title: token count for pdf input in azure openai
Link: https://learn.microsoft.com/en-in/answers/questions/5603760/token-count-for-pdf-input-in-azure-openaiSource snippet
count for PDF input in Azure OpenAIOct 30, 2025 — Please note that the following text has been translated into English using a translatio...
-
Source: learn.microsoft.com
Title: token count for pdf input in azure openai
Link: https://learn.microsoft.com/en-ie/answers/questions/5603760/token-count-for-pdf-input-in-azure-openaiSource snippet
count for PDF input in Azure OpenAIOct 30, 2025 — Please note that the following text has been translated into English using a translatio...
-
Source: learn.microsoft.com
Link: https://learn.microsoft.com/en-us/connectors/azureopenai/Source snippet
OpenAI (Preview) - ConnectorsAzure OpenAI On Your Data enables you to run advanced AI models such as GPT-35-Turbo and GPT-4 on your own e...
-
Source: learn.microsoft.com
Title: use tokenizers
Link: https://learn.microsoft.com/en-us/dotnet/ai/how-to/use-tokenizersSource snippet
for text tokenization -.NETApr 9, 2026 — Learn how to use the Microsoft.ML.Tokenizers library to tokenize text f...
-
Source: learn.microsoft.com
Title: token count for pdf input in azure openai
Link: https://learn.microsoft.com/en-ca/answers/questions/5603760/token-count-for-pdf-input-in-azure-openaiSource snippet
count for PDF input in Azure OpenAIOct 30, 2025 — Please note that the following text has been translated into English using a translatio...
-
Source: learn.microsoft.com
Link: https://learn.microsoft.com/en-us/answers/questions/2202015/unexplainable-token-usage-with-azure-[ai-searchSource snippet
token usage with azure ai search as...Mar 11, 2025 — You can view the detailed usage like active tokens, processed token etc, from Azure...
-
Source: learn.microsoft.com
Title: licensing prompt tokens
Link: https://learn.microsoft.com/en-us/ai-builder/licensing-prompt-tokensSource snippet
tokensJan 14, 2026 — We use the auto setting when passing images to Azure OpenAI. This means the token cost of an image depends on its in...
-
Source: learn.microsoft.com
Title: quotas limits
Link: https://learn.microsoft.com/en-us/azure/foundry/openai/quotas-limitsSource snippet
OpenAI in Microsoft Foundry Models Quotas and LimitsMay 27, 2026 — Tokens per minute (TPM) and requests per minute (RPM) limits are defin...
Published: May 27, 2026
-
Source: learn.microsoft.com
Title: how to enable usage report in azure open ai
Link: https://learn.microsoft.com/en-us/answers/questions/1293014/how-to-enable-usage-report-in-azure-open-aiSource snippet
to enable usage report in Azure Open AIMay 28, 2023 — How to enable the usage report in the Azure Open AI service, OpenAI uses tiktoken a...
Published: May 28, 2023
-
Source: learn.microsoft.com
Title: how do you calculate the token needed for azure op
Link: https://learn.microsoft.com/en-us/answers/questions/1660976/how-do-you-calculate-the-token-needed-for-azure-opSource snippet
Another simpler method is to check the token consumption for a request from the...Read more...
-
Source: learn.microsoft.com
Title: can azure genai gateway self hosted version enforc
Link: https://learn.microsoft.com/en-ca/answers/questions/2261794/can-azure-genai-gateway-self-hosted-version-enforcSource snippet
Azure GenAI Gateway self-hosted version enforce llm-...Apr 28, 2025 — Token counting depends on the model's tokenizer, which varies acro...
-
Source: learn.microsoft.com
Link: https://learn.microsoft.com/en-us/azure/developer/ai/gen-ai-concepts-considerations-developersSource snippet
concepts and considerations in [generative]({{ 'generative-ai/' | relative_url }}) AIJan 30, 2026 — Tokenization is the process of breaking text into tokens—the smallest units a...
-
Source: learn.microsoft.com
Title: word count limit for system message in gpt 4o real
Link: https://learn.microsoft.com/en-ca/answers/questions/2129642/word-count-limit-for-system-message-in-gpt-4o-realSource snippet
count limit for system message in gpt-4o-realtime API...Dec 12, 2024 — Hi Team, • Is there a word count limit for system message in gpt...
-
Source: learn.microsoft.com
Title: natural language processing
Link: https://learn.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/natural-language-processingSource snippet
a Natural Language Processing TechnologyApr 21, 2026 — This guide focuses on natural language processing capabilities available through A...
-
Source: learn.microsoft.com
Title: azure openai on your data stream options parameter
Link: https://learn.microsoft.com/en-sg/answers/questions/2280689/azure-openai-on-your-data-stream-options-parameterSource snippet
OpenAI "on your data": stream_options parameter...Jun 3, 2025 — I'm trying to implement streaming with token usage tracking for Azure Op...
-
Source: learn.microsoft.com
Title: azure open ai gpt 4o mini image input token limita
Link: https://learn.microsoft.com/en-us/answers/questions/2126187/azure-open-ai-gpt-4o-mini-image-input-token-limitaSource snippet
Open AI GPT 4o Mini - Image Input - Token...Dec 4, 2024 — The token limit error you are encountering when using the Azure OpenAI GPT-4o...
-
Source: arxiv.org
Link: https://arxiv.org/html/2412.10924v9Source snippet
Large language models, the distributional hypothesis, and...Nov 22, 2025 — Tokenization is a necessary component within the current arch...
-
Source: openreview.net
Title: Language Model Beats Diffusion
Link: https://openreview.net/forum?id=gzqrANCF4gSource snippet
Tokenizer is key to...by L Yu · Cited by 653 — With a good visual tokenizer, language model style transformer outperforms diffusion mode...
-
Source: openreview.net
Link: https://openreview.net/pdf?id=y1lzO94o65Source snippet
Parallel BPE Tokenization - BlockBPEby A You · Cited by 4 — By tokenizing inputs locally before making API calls, users can accurately es...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/code-byte_stanford-cs336-language-modeling-from-scratch-activity-7412696691868618752-jNUoSource snippet
ns) Most modern models rely on 𝗕𝘆𝘁𝗲 𝗣𝗮𝗶𝗿 𝗘𝗻𝗰𝗼𝗱𝗶𝗻𝗴 (𝗕𝗣𝗘).Read more...
-
Source: yannael.github.io
Link: https://yannael.github.io/video2blogpost/final_output/blogpost.htmlSource snippet
blogpostByte Pair Encoding (BPE) is an algorithm used for tokenizing text data. It works by iteratively finding the most common pair of b...
-
Source: github.blog
Link: https://github.blog/ai-and-ml/llms/so-many-tokens-so-little-time-introducing-a-faster-more-flexible-byte-pair-tokenizer/Source snippet
The byte-pair encoding (BPE) algorithm is such a tokenizer, used (for example) by...Read more...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZSource snippet
"our Discord channel: [https://discord.gg/3zy8kqD9Cp](https://discord.gg/3zy8kqD9Cp) my Twitter: [https://twitter.com/karpathy..."](https://twitter.com/karpathy...")...
-
Source: huggingface.co
Link: https://huggingface.co/learn/llm-course/en/chapter6/5Source snippet
Hugging FaceByte-Pair Encoding tokenizationByte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then u...
-
Source: sebastianraschka.com
Title: bpe from scratch
Link: https://sebastianraschka.com/blog/2025/bpe-from-scratch.htmlSource snippet
Sebastian Raschka, PhDBPE Tokenizer From Scratch17 Jan 2025 — This is a standalone notebook implementing the popular byte pair encoding (...
-
Source: aclanthology.org
Link: https://aclanthology.org/2025.ijcnlp-short.31.pdfSource snippet
ACL AnthologyEffect of Tokenization on Performance of LLMsby S Pawar · 2025 · Cited by 2 — In this paper, we hypothe- size that such brea...
-
Source: emergentmind.com
Title: tokenisation bias in language models
Link: https://www.emergentmind.com/topics/tokenisation-bias-in-language-modelsSource snippet
10 Feb 2026 — Tokenisation bias is a systematic distortion in LLMs arising from tokenization choices that fragment and misrepresent langu...
-
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC11339515/Source snippet
This behavior introduces subtle biases that affect the model's language generation process and response characteristics...
-
Source: huggingface.co
Link: https://huggingface.co/papers/2512.20757Source snippet
Hugging FaceTokSuite: Measuring the Impact of Tokenizer Choice on...Dec 23, 2025 — TokSuite enables systematic研究 of tokenization effects...
-
Source: huggingface.co
Link: https://huggingface.co/blog/omarkamali/tokenizationSource snippet
is Killing our Multilingual LLM DreamMar 15, 2026 — Tokenization is the step that converts raw text into numbers a language model actuall...
-
Source: raw.githubusercontent.com
Link: https://raw.githubusercontent.com/mlresearch/v258/main/assets/rauba25a/rauba25a.pdfSource snippet
Visualizing token importance for black-box language modelsby P Rauba · Cited by 1 — Such an approach allows practitioners to visualize an...
Additional References
-
Source: medium.com
Link: https://medium.com/%40ashishpandey2062/understanding-the-secret-language-of-ai-how-gpt-bert-and-t5-actually-read-text-742d126ce8b7Source snippet
How AI Reads Text: Tokenization in GPT, BERT & T5...Learn how BPE, WordPiece, and Unigram tokenizers work in GPT, BERT, and T5. Complete...
-
Source: ndss-symposium.org
Link: https://www.ndss-symposium.org/ndss-paper/auto-draft-627/Source snippet
How Different Tokenization Algorithms Impact LLMs and...Through intrinsic evaluations, we compare tokenizers based on tokenization effic...
-
Source: ml4devs.com
Link: https://www.ml4devs.com/what-is/tokenization/Source snippet
Tokenization: How LLMs Split Text into TokensUnderstand how LLMs split text into tokens using BPE and WordPiece, and why token count gove...
-
Source: devopslearning.medium.com
Link: https://devopslearning.medium.com/day-4-of-50-days-of-building-a-small-language-model-from-scratch-understanding-byte-pair-78d4f22d9d3cSource snippet
Byte Pair Encoding (BPE) Tokenizer | by...Before any text can be processed by a model, it needs to be tokenized, that is, broken into sm...
-
Source: facebook.com
Link: https://www.facebook.com/groups/698593531630485/posts/1356741835815648/ -
Source: researchgate.net
Link: https://www.researchgate.net/publication/389707683_Tokenization_Changes_Meaning_in_Large_Language_Models_Evidence_from_ChineseSource snippet
Tokenization Changes Meaning in Large Language ModelsFeb 15, 2025 — Large language models segment many words into multiple tokens, and th...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=jDwrK_-s-UsSource snippet
How does OpenAI Tokenizer Work for Different GPT models...Video explains how OpenAI's tokenizer divides text into smaller units called t...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=gstdcCDqdlc -
Source: themoonlight.io
Link: https://www.themoonlight.io/en/review/how-does-a-language-specific-tokenizer-affect-llmsSource snippet
summary worldwide for the paper titled How does a Language-Specific Tokenizer affect LLMs?Read more...
-
Source: francescopochetti.com
Title: byte pair encoding building the gpt tokenizer with karpathy
Link: https://francescopochetti.com/byte-pair-encoding-building-the-gpt-tokenizer-with-karpathy/Source snippet
Byte Pair Encoding: building the GPT tokenizer with Karpathy -17 Mar 2024 — I'll specifically try to cover the Byte Pair Encoding (BPE) a...
Topic Tree



