Within Tokenization
Do tokens make some languages harder for AI?
Languages and writing systems can be tokenised unevenly, which may change cost, context use, and answer quality across users.
On this page
- Why scripts and word structures get unequal token budgets
- How fragmentation affects context, cost, and performance
- What fairer tokenisation would need to improve
Page outline Jump by section
Introduction
Tokenisation does not affect all languages equally. Because chatbots measure input, memory, and cost in tokens rather than words, people using different languages can receive different levels of efficiency from the same AI system. A sentence that fits comfortably into a chatbot’s context window in one language may consume far more tokens in another. This difference can increase costs, reduce the amount of information the model can remember, and sometimes lower answer quality. Research over the past several years has increasingly identified this phenomenon as a form of tokenisation bias: a structural inequality created before the model even begins generating a response. [arXiv]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…
Within the broader question of how tokenisation shapes chatbot answers, multilingual tokenisation bias is important because it affects billions of users who interact with AI in languages other than English. The issue is not simply translation quality. It concerns how efficiently different languages are represented inside the model itself. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…
Do tokens make some languages harder for AI?
The short answer is yes. Modern language models typically use subword tokenisers that build vocabularies from frequently occurring character patterns. Languages that resemble the data used to construct those vocabularies often compress efficiently into relatively few tokens. Languages with different writing systems, longer word structures, or lower representation in training data may be fragmented into many more pieces. [arXiv+2ACL Anthology]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…
Researchers studying multilingual tokenisation found that identical content translated into different languages can require dramatically different token counts, with some comparisons showing differences of more than an order of magnitude. The disparity appears before any reasoning or generation occurs; it is built into the text representation itself. [arXiv]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…
This means that two users asking the same question in different languages may not be consuming the same amount of the model’s available resources. One user’s request may occupy a small portion of the context window, while another’s may consume a much larger share despite conveying the same information. [ACL Anthology]aclanthology.orgACL Anthology Do All Languages Cost the Same?Tokenization in the Era…by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce…
Why scripts and word structures get unequal token budgets
The unequal treatment of languages comes from several overlapping factors.
Writing systems matter. Many tokenisers were originally optimised using data dominated by English and other widely represented languages. Languages using Latin scripts often receive more efficient token allocations than languages written in other scripts. Studies examining hundreds of languages have found that non-Latin scripts frequently experience substantially higher token inflation. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and EfficiencyOctober 14, 2025…
Word formation matters. Some languages pack extensive grammatical information into a single word. Languages with rich morphology can create long word forms that are uncommon enough to be broken into many token fragments. Each fragment consumes part of the model’s context budget. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…
Vocabulary allocation matters. Tokenisers have limited vocabulary space. Languages that appear more often in training corpora tend to receive more dedicated token entries. Lower-resource languages may be forced to share vocabulary capacity, resulting in less efficient segmentation. Research on multilingual tokenizer design shows that training data composition strongly influences which languages receive efficient representations. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…
A useful concept in this area is token fertility, which measures how many tokens are needed to represent a unit of text. Higher fertility generally means greater fragmentation. Researchers increasingly use fertility as a way to quantify tokenisation inequality across languages. [arXiv]arxiv.orgarXiv Analyzing STRR as a Metric for Multilingual TokenizationAnalyzing STRR as a Metric for Multilingual Tokenization…October 11, 2025 — by MT Nayeem · 2025 · Cited by 3 — We analyze six wid…
How fragmentation affects context, cost, and performance
Less room for information
Large language models have fixed context windows measured in tokens. If a language requires more tokens to express the same ideas, users effectively receive less working memory from the system.
Imagine two users providing documents of similar meaning and length. If one language consumes twice as many tokens, that user may reach the context limit sooner. The chatbot then has less room available for instructions, examples, conversation history, or supporting evidence. [Interactive Optimization and Learning]pokutta.comInteractive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr…
This can influence answer quality in subtle ways. The model may have to truncate information earlier or compress more aggressively, increasing the chance of omissions and misunderstandings. [Interactive Optimization and Learning]pokutta.comInteractive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr…
Higher usage costs
Many commercial AI services charge according to token counts. When the same meaning requires more tokens in one language than another, users can end up paying more for equivalent interactions. Researchers analysing multilingual API usage have described this as a fairness problem because pricing appears language-neutral while actual token consumption is not. [ACL Anthology]aclanthology.orgACL Anthology Do All Languages Cost the Same?Tokenization in the Era…by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce…
Several studies refer to this phenomenon as a “token tax”: speakers of certain languages consume more computational resources and therefore incur greater costs for comparable tasks. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…
Lower task accuracy
The consequences extend beyond efficiency. Research evaluating large language models across African languages found that higher token fertility consistently predicted lower performance on knowledge and reasoning benchmarks. Languages requiring more fragmented representations tended to achieve lower accuracy scores. [ACL Anthology]aclanthology.orgACL AnthologyThe Token Tax: Systematic Bias in Multilingual Tokenizationby JM Lundin · 2026 · Cited by 6 — We evaluate 10 Large Language…
The likely explanation is that fragmentation makes learning harder. When a concept is repeatedly broken into varying token combinations, the model receives a less stable representation of that concept during training. Over time, this can reduce the quality of the statistical patterns the model learns. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…
Why this is a fairness issue, not just an engineering detail
Tokenisation bias is sometimes presented as a technical optimisation problem, but its effects are social as well as computational.
When speakers of some languages pay more, receive less context, and experience lower accuracy from the same system, access to AI becomes uneven. The disparity is especially significant for languages that already face disadvantages in digital resources and machine-learning datasets. [arXiv+2ACL Anthology]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…
Researchers have therefore begun describing tokenisation disparities as a form of infrastructure bias. The concern is that inequality is embedded into the foundational representation layer of AI systems, influencing downstream performance before model reasoning even begins. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…
The problem does not mean multilingual chatbots are failing across all non-English languages. Many modern systems have improved substantially, and recent multilingual models demonstrate strong performance in a growing number of underrepresented languages. However, improvements in model capability do not automatically eliminate tokenisation disparities. A model can become more multilingual while still allocating context and computational resources unevenly across languages. [TechRadar]techradar.comGoogle’s Gemini Pro model, for instance, scored over 4.5 out of 5 in Kinyarwanda, a language spoken by around 12 million people in East A…
What fairer tokenisation would need to improve
Researchers and developers are exploring several approaches to reduce multilingual tokenisation bias.
Better vocabulary allocation. Instead of allowing dominant languages to occupy most of the token vocabulary, tokenisers can be designed to distribute representation capacity more evenly across languages and scripts. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…
Language-aware tokenisation. Incorporating linguistic structure can help preserve meaningful units rather than splitting words into arbitrary fragments. This is particularly valuable for morphologically rich languages. [Hugging Face]huggingface.coHugging FaceTokenization is Killing our Multilingual LLM DreamMar 15, 2026 — The bad tokenizer destroys both plural and case information…
New fairness metrics. Researchers increasingly argue that simple token counts are not enough. New measures examine token premiums, relative tokenisation costs, vocabulary allocation, and cross-language parity to identify hidden inequities. [arXiv+2arXiv]arxiv.orgarXiv Explaining and Mitigating Crosslingual Tokenizer InequitiesExplaining and Mitigating Crosslingual Tokenizer InequitiesOctober 24, 2025…
Balanced multilingual training data. Studies indicate that the composition of multilingual corpora strongly affects tokeniser behaviour. More balanced language representation during tokeniser construction can improve efficiency and fairness. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…
The broader lesson is that multilingual chatbot quality depends not only on model size or training data volume but also on how language is broken into tokens. Tokenisation determines how much of a user’s language fits into the model’s memory, how much interaction costs, and how effectively the model learns linguistic patterns. As AI systems become global infrastructure, those seemingly small design choices increasingly shape who benefits equally from them. [arXiv+2ACL Anthology]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…
Amazon book picks
Further Reading
Books and field guides related to Do tokens make some languages harder for AI?. Use these as the next step if you want deeper reading beyond the article.
Natural Language Processing with Transformers
Discusses tokenization and multilingual transformer workflows.
Build a Large Language Model (From Scratch)
Provides foundations for understanding multilingual token handling.
Speech and Language Processing: Pearson New International Edi...
Covers multilingual NLP and language representation issues.
Endnotes
-
Source: arxiv.org
Title: arXiv Language Model Tokenizers Introduce Unfairness
Link: https://arxiv.org/abs/2305.15425Source snippet
Language Model Tokenizers Introduce Unfairness...May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp...
Published: May 17, 2023
-
Source: arxiv.org
Link: https://arxiv.org/html/2510.12389v1Source snippet
Tokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni...
-
Source: openreview.net
Link: https://openreview.net/forum?id=P2k908rWSPSource snippet
How Multilingual Dataset Composition Affects Tokenizer...by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset...
-
Source: openreview.net
Title: Open Review Do All Languages Cost the Same?
Link: https://openreview.net/forum?id=OUmxBN45GlSource snippet
[Tokenization]({{ 'tokenization/' | relative_url }}) in the Era...by O Ahia · Cited by 191 — We conduct a systematic analysis of the cost and utility of OpenAI's language model...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2510.12389Source snippet
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and EfficiencyOctober 14, 2025...
Published: October 14, 2025
-
Source: arxiv.org
Title: arXiv The Token Tax: Systematic Bias in Multilingual Tokenization
Link: https://arxiv.org/abs/2509.05486Source snippet
The Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat...
Published: September 5, 2025
-
Source: openreview.net
Link: https://openreview.net/forum?id=UJiP3m3EXsSource snippet
Rethinking Multilingual Tokenizer Designby A Thakur · Cited by 1 — This paper presents a systematic study of multilingual tokenizer desig...
-
Source: arxiv.org
Title: arXiv Analyzing STRR as a Metric for Multilingual Tokenization
Link: https://arxiv.org/abs/2510.09947Source snippet
Analyzing STRR as a Metric for Multilingual Tokenization...October 11, 2025 — by MT Nayeem · 2025 · Cited by 3 — We analyze six wid...
Published: October 11, 2025
-
Source: techradar.com
Link: https://www.techradar.com/pro/a-transformative-moment-research-shows-ai-could-become-the-king-of-babel-as-llms-master-rare-obscure-languagesSource snippet
Google’s Gemini Pro model, for instance, scored over 4.5 out of 5 in Kinyarwanda, a language spoken by around 12 million people in East A...
-
Source: arxiv.org
Title: arXiv Explaining and Mitigating Crosslingual Tokenizer Inequities
Link: https://arxiv.org/abs/2510.21909Source snippet
Explaining and Mitigating Crosslingual Tokenizer InequitiesOctober 24, 2025...
Published: October 24, 2025
-
Source: arxiv.org
Link: https://arxiv.org/html/2510.09947v1Source snippet
Analyzing STRR as a Metric for Multilingual Tokenization...Oct 11, 2025 — Tokenization is a foundational step in large language models (...
-
Source: aclanthology.org
Title: ACL Anthology Do All Languages Cost the Same?
Link: https://aclanthology.org/anthology-files/anthology-files/pdf/emnlp/2023.emnlp-main.614.pdfSource snippet
Tokenization in the Era...by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce...
-
Source: pokutta.com
Link: https://www.pokutta.com/blog/hidden-cost-tokenization/Source snippet
Interactive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr...
Published: May 14, 2026
-
Source: aclanthology.org
Link: https://aclanthology.org/2026.africanlp-main.10/Source snippet
ACL AnthologyThe Token Tax: Systematic Bias in Multilingual Tokenizationby JM Lundin · 2026 · Cited by 6 — We evaluate 10 Large Language...
-
Source: huggingface.co
Link: https://huggingface.co/blog/omarkamali/tokenizationSource snippet
Hugging FaceTokenization is Killing our Multilingual LLM DreamMar 15, 2026 — The bad tokenizer destroys both plural and case information...
-
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/LanguageSource snippet
LanguageLanguage is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which hum...
-
Source: aclanthology.org
Title: 2025.unlp 1.1
Link: https://aclanthology.org/2025.unlp-1.1.pdfSource snippet
From English-Centric to Effective Bilingual: LLMs with...by A Kiulian · 2025 · Cited by 11 — In this paper, we propose a model-agnostic...
Additional References
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/396459552_Beyond_Fertility_Analyzing_STRR_as_a_Metric_for_Multilingual_Tokenization_EvaluationSource snippet
(PDF) Beyond Fertility: Analyzing STRR as a Metric for...Oct 28, 2025 — We analyze six widely used tokenizers across seven languages and...
-
Source: linkedin.com
Link: https://www.linkedin.com/pulse/llm-tokenization-explained-your-guide-how-large-language-models-du7ffSource snippet
Your Guide to How Large Language Models Understand TextTokenization isn't just about slicing text; it also affects how much content can b...
-
Source: axios.com
Link: https://www.axios.com/2024/02/13/open-source-ai-languagesSource snippet
Aya was developed by pre-training a base model with diverse language data and then fine-tuning it for the same languages. This initiative...
-
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Language-Model-Tokenizers-Introduce-Unfairness-Petrov-Malfa/879a7f5abdb7ab803d48172d4f0830965f989d46Source snippet
[PDF] Language Model Tokenizers Introduce Unfairness...It is shown how disparity in the treatment of different languages arises at the t...
-
Source: medium.com
Link: https://medium.com/%40geosar/the-importance-of-tokenizers-for-multilingual-llms-a-case-study-on-greek-af5301b0bacfSource snippet
The Importance of Tokenizers for Multilingual LLMsIn this blog post, we explore the impact of tokenizers on the cost and downstream appli...
-
Source: medium.com
Link: https://medium.com/%40adnanmasood/history-and-state-of-llms-for-low-resource-languages-lrls-987986a3f2f5Source snippet
Ali et al. (2023) conducted a comprehensive study and found...Read more...
-
Source: reddit.com
Link: https://www.reddit.com/r/MachineLearning/comments/1n0r8b7/i_built_a_tool_to_benchmark_tokenizers_across_100/Source snippet
te way more vocabulary to English patterns. These compound...Read more...
-
Source: gigaspaces.com
Link: https://www.gigaspaces.com/question/are-there-specific-tokenization-strategies-for-multilingual-llmSource snippet
text into a form that the model can understand & process efficiently...
-
Source: aipmguru.substack.com
Title: the invisible upgrade how tokenization
Link: https://aipmguru.substack.com/p/the-invisible-upgrade-how-tokenizationSource snippet
Tokenization Quietly Got Better (And Why Your AI Costs...What this means for you as a PM: multilingual products are suddenly much cheape...
-
Source: emergentmind.com
Title: tokenisation bias in language models
Link: https://www.emergentmind.com/topics/tokenisation-bias-in-language-modelsSource snippet
10 Feb 2026 — Tokenisation bias is a systematic distortion in LLMs arising from tokenization choices that fragment and misrepresent langu...
Topic Tree



