Do tokens make some languages harder for AI?

Introduction

Tokenisation does not affect all languages equally. Because chatbots measure input, memory, and cost in tokens rather than words, people using different languages can receive different levels of efficiency from the same AI system. A sentence that fits comfortably into a chatbot’s context window in one language may consume far more tokens in another. This difference can increase costs, reduce the amount of information the model can remember, and sometimes lower answer quality. Research over the past several years has increasingly identified this phenomenon as a form of tokenisation bias: a structural inequality created before the model even begins generating a response. [arXiv]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

Language Bias illustration 1 Within the broader question of how tokenisation shapes chatbot answers, multilingual tokenisation bias is important because it affects billions of users who interact with AI in languages other than English. The issue is not simply translation quality. It concerns how efficiently different languages are represented inside the model itself. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…

Do tokens make some languages harder for AI?

The short answer is yes. Modern language models typically use subword tokenisers that build vocabularies from frequently occurring character patterns. Languages that resemble the data used to construct those vocabularies often compress efficiently into relatively few tokens. Languages with different writing systems, longer word structures, or lower representation in training data may be fragmented into many more pieces. [arXiv+2ACL Anthology]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

Researchers studying multilingual tokenisation found that identical content translated into different languages can require dramatically different token counts, with some comparisons showing differences of more than an order of magnitude. The disparity appears before any reasoning or generation occurs; it is built into the text representation itself. [arXiv]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

This means that two users asking the same question in different languages may not be consuming the same amount of the model’s available resources. One user’s request may occupy a small portion of the context window, while another’s may consume a much larger share despite conveying the same information. [ACL Anthology]aclanthology.orgACL Anthology Do All Languages Cost the Same?Tokenization in the Era…by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce…

Why scripts and word structures get unequal token budgets

The unequal treatment of languages comes from several overlapping factors.

Writing systems matter. Many tokenisers were originally optimised using data dominated by English and other widely represented languages. Languages using Latin scripts often receive more efficient token allocations than languages written in other scripts. Studies examining hundreds of languages have found that non-Latin scripts frequently experience substantially higher token inflation. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and EfficiencyOctober 14, 2025…Published: October 14, 2025

Word formation matters. Some languages pack extensive grammatical information into a single word. Languages with rich morphology can create long word forms that are uncommon enough to be broken into many token fragments. Each fragment consumes part of the model’s context budget. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…Published: September 5, 2025

Vocabulary allocation matters. Tokenisers have limited vocabulary space. Languages that appear more often in training corpora tend to receive more dedicated token entries. Lower-resource languages may be forced to share vocabulary capacity, resulting in less efficient segmentation. Research on multilingual tokenizer design shows that training data composition strongly influences which languages receive efficient representations. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…

A useful concept in this area is token fertility, which measures how many tokens are needed to represent a unit of text. Higher fertility generally means greater fragmentation. Researchers increasingly use fertility as a way to quantify tokenisation inequality across languages. [arXiv]arxiv.orgarXiv Analyzing STRR as a Metric for Multilingual TokenizationAnalyzing STRR as a Metric for Multilingual Tokenization…October 11, 2025 — by MT Nayeem · 2025 · Cited by 3 — We analyze six wid…Published: October 11, 2025

How fragmentation affects context, cost, and performance

Less room for information

Large language models have fixed context windows measured in tokens. If a language requires more tokens to express the same ideas, users effectively receive less working memory from the system.

Imagine two users providing documents of similar meaning and length. If one language consumes twice as many tokens, that user may reach the context limit sooner. The chatbot then has less room available for instructions, examples, conversation history, or supporting evidence. [Interactive Optimization and Learning]pokutta.comInteractive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr…Published: May 14, 2026

This can influence answer quality in subtle ways. The model may have to truncate information earlier or compress more aggressively, increasing the chance of omissions and misunderstandings. [Interactive Optimization and Learning]pokutta.comInteractive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr…Published: May 14, 2026

Language Bias illustration 2

Higher usage costs

Many commercial AI services charge according to token counts. When the same meaning requires more tokens in one language than another, users can end up paying more for equivalent interactions. Researchers analysing multilingual API usage have described this as a fairness problem because pricing appears language-neutral while actual token consumption is not. [ACL Anthology]aclanthology.orgACL Anthology Do All Languages Cost the Same?Tokenization in the Era…by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce…

Several studies refer to this phenomenon as a “token tax”: speakers of certain languages consume more computational resources and therefore incur greater costs for comparable tasks. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…Published: September 5, 2025

Lower task accuracy

The consequences extend beyond efficiency. Research evaluating large language models across African languages found that higher token fertility consistently predicted lower performance on knowledge and reasoning benchmarks. Languages requiring more fragmented representations tended to achieve lower accuracy scores. [ACL Anthology]aclanthology.orgACL AnthologyThe Token Tax: Systematic Bias in Multilingual Tokenizationby JM Lundin · 2026 · Cited by 6 — We evaluate 10 Large Language…

The likely explanation is that fragmentation makes learning harder. When a concept is repeatedly broken into varying token combinations, the model receives a less stable representation of that concept during training. Over time, this can reduce the quality of the statistical patterns the model learns. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…Published: September 5, 2025

Why this is a fairness issue, not just an engineering detail

Tokenisation bias is sometimes presented as a technical optimisation problem, but its effects are social as well as computational.

When speakers of some languages pay more, receive less context, and experience lower accuracy from the same system, access to AI becomes uneven. The disparity is especially significant for languages that already face disadvantages in digital resources and machine-learning datasets. [arXiv+2ACL Anthology]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…

Researchers have therefore begun describing tokenisation disparities as a form of infrastructure bias. The concern is that inequality is embedded into the foundational representation layer of AI systems, influencing downstream performance before model reasoning even begins. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…

The problem does not mean multilingual chatbots are failing across all non-English languages. Many modern systems have improved substantially, and recent multilingual models demonstrate strong performance in a growing number of underrepresented languages. However, improvements in model capability do not automatically eliminate tokenisation disparities. A model can become more multilingual while still allocating context and computational resources unevenly across languages. [TechRadar]techradar.comGoogle’s Gemini Pro model, for instance, scored over 4.5 out of 5 in Kinyarwanda, a language spoken by around 12 million people in East A…

Language Bias illustration 3

What fairer tokenisation would need to improve

Researchers and developers are exploring several approaches to reduce multilingual tokenisation bias.

Better vocabulary allocation. Instead of allowing dominant languages to occupy most of the token vocabulary, tokenisers can be designed to distribute representation capacity more evenly across languages and scripts. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…

Language-aware tokenisation. Incorporating linguistic structure can help preserve meaningful units rather than splitting words into arbitrary fragments. This is particularly valuable for morphologically rich languages. [Hugging Face]huggingface.coHugging FaceTokenization is Killing our Multilingual LLM DreamMar 15, 2026 — The bad tokenizer destroys both plural and case information…

New fairness metrics. Researchers increasingly argue that simple token counts are not enough. New measures examine token premiums, relative tokenisation costs, vocabulary allocation, and cross-language parity to identify hidden inequities. [arXiv+2arXiv]arxiv.orgarXiv Explaining and Mitigating Crosslingual Tokenizer InequitiesExplaining and Mitigating Crosslingual Tokenizer InequitiesOctober 24, 2025…Published: October 24, 2025

Balanced multilingual training data. Studies indicate that the composition of multilingual corpora strongly affects tokeniser behaviour. More balanced language representation during tokeniser construction can improve efficiency and fairness. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…

The broader lesson is that multilingual chatbot quality depends not only on model size or training data volume but also on how language is broken into tokens. Tokenisation determines how much of a user’s language fits into the model’s memory, how much interaction costs, and how effectively the model learns linguistic patterns. As AI systems become global infrastructure, those seemingly small design choices increasingly shape who benefits equally from them. [arXiv+2ACL Anthology]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Engineer Fixes Sh*t You Can't T-Shirt Funny Joke Novelty Engineering Gift Idea

Search eBay.co.uk: technology t shirt

Browse similar on eBay.co.uk

Example eBay listing

Director Of Technology T Shirt - We Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: technology t shirt

Browse similar on eBay.co.uk

Example eBay listing

Director Of Technology T Shirt - We Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: technology t shirt

Browse similar on eBay.co.uk

Example eBay listing

Technology T-shirt Funny Engineer Technology Lover Birthday Gift for Him Her

Search eBay.co.uk: technology t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Title: arXiv Language Model Tokenizers Introduce Unfairness
Link: https://arxiv.org/abs/2305.15425
Source snippet
Language Model Tokenizers Introduce Unfairness...May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp...

Published: May 17, 2023
Source: arxiv.org
Link: https://arxiv.org/html/2510.12389v1
Source snippet
Tokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni...
Source: openreview.net
Link: https://openreview.net/forum?id=P2k908rWSP
Source snippet
How Multilingual Dataset Composition Affects Tokenizer...by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset...
Source: openreview.net
Title: Open Review Do All Languages Cost the Same?
Link: https://openreview.net/forum?id=OUmxBN45Gl
Source snippet
[Tokenization]({{ 'tokenization/' | relative_url }}) in the Era...by O Ahia · Cited by 191 — We conduct a systematic analysis of the cost and utility of OpenAI's language model...
Source: arxiv.org
Link: https://arxiv.org/abs/2510.12389
Source snippet
Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and EfficiencyOctober 14, 2025...

Published: October 14, 2025
Source: arxiv.org
Title: arXiv The Token Tax: Systematic Bias in Multilingual Tokenization
Link: https://arxiv.org/abs/2509.05486
Source snippet
The Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat...

Published: September 5, 2025
Source: openreview.net
Link: https://openreview.net/forum?id=UJiP3m3EXs
Source snippet
Rethinking Multilingual Tokenizer Designby A Thakur · Cited by 1 — This paper presents a systematic study of multilingual tokenizer desig...
Source: arxiv.org
Title: arXiv Analyzing STRR as a Metric for Multilingual Tokenization
Link: https://arxiv.org/abs/2510.09947
Source snippet
Analyzing STRR as a Metric for Multilingual Tokenization...October 11, 2025 — by MT Nayeem · 2025 · Cited by 3 — We analyze six wid...

Published: October 11, 2025
Source: techradar.com
Link: https://www.techradar.com/pro/a-transformative-moment-research-shows-ai-could-become-the-king-of-babel-as-llms-master-rare-obscure-languages
Source snippet
Google’s Gemini Pro model, for instance, scored over 4.5 out of 5 in Kinyarwanda, a language spoken by around 12 million people in East A...
Source: arxiv.org
Title: arXiv Explaining and Mitigating Crosslingual Tokenizer Inequities
Link: https://arxiv.org/abs/2510.21909
Source snippet
Explaining and Mitigating Crosslingual Tokenizer InequitiesOctober 24, 2025...

Published: October 24, 2025
Source: arxiv.org
Link: https://arxiv.org/html/2510.09947v1
Source snippet
Analyzing STRR as a Metric for Multilingual Tokenization...Oct 11, 2025 — Tokenization is a foundational step in large language models (...
Source: aclanthology.org
Title: ACL Anthology Do All Languages Cost the Same?
Link: https://aclanthology.org/anthology-files/anthology-files/pdf/emnlp/2023.emnlp-main.614.pdf
Source snippet
Tokenization in the Era...by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce...
Source: pokutta.com
Link: https://www.pokutta.com/blog/hidden-cost-tokenization/
Source snippet
Interactive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr...

Published: May 14, 2026
Source: aclanthology.org
Link: https://aclanthology.org/2026.africanlp-main.10/
Source snippet
ACL AnthologyThe Token Tax: Systematic Bias in Multilingual Tokenizationby JM Lundin · 2026 · Cited by 6 — We evaluate 10 Large Language...
Source: huggingface.co
Link: https://huggingface.co/blog/omarkamali/tokenization
Source snippet
Hugging FaceTokenization is Killing our Multilingual LLM DreamMar 15, 2026 — The bad tokenizer destroys both plural and case information...
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Language
Source snippet
LanguageLanguage is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which hum...
Source: aclanthology.org
Title: 2025.unlp 1.1
Link: https://aclanthology.org/2025.unlp-1.1.pdf
Source snippet
From English-Centric to Effective Bilingual: LLMs with...by A Kiulian · 2025 · Cited by 11 — In this paper, we propose a model-agnostic...

Additional References

Source: researchgate.net
Link: https://www.researchgate.net/publication/396459552_Beyond_Fertility_Analyzing_STRR_as_a_Metric_for_Multilingual_Tokenization_Evaluation
Source snippet
(PDF) Beyond Fertility: Analyzing STRR as a Metric for...Oct 28, 2025 — We analyze six widely used tokenizers across seven languages and...
Source: linkedin.com
Link: https://www.linkedin.com/pulse/llm-tokenization-explained-your-guide-how-large-language-models-du7ff
Source snippet
Your Guide to How Large Language Models Understand TextTokenization isn't just about slicing text; it also affects how much content can b...
Source: axios.com
Link: https://www.axios.com/2024/02/13/open-source-ai-languages
Source snippet
Aya was developed by pre-training a base model with diverse language data and then fine-tuning it for the same languages. This initiative...
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Language-Model-Tokenizers-Introduce-Unfairness-Petrov-Malfa/879a7f5abdb7ab803d48172d4f0830965f989d46
Source snippet
[PDF] Language Model Tokenizers Introduce Unfairness...It is shown how disparity in the treatment of different languages arises at the t...
Source: medium.com
Link: https://medium.com/%40geosar/the-importance-of-tokenizers-for-multilingual-llms-a-case-study-on-greek-af5301b0bacf
Source snippet
The Importance of Tokenizers for Multilingual LLMsIn this blog post, we explore the impact of tokenizers on the cost and downstream appli...
Source: medium.com
Link: https://medium.com/%40adnanmasood/history-and-state-of-llms-for-low-resource-languages-lrls-987986a3f2f5
Source snippet
Ali et al. (2023) conducted a comprehensive study and found...Read more...
Source: reddit.com
Link: https://www.reddit.com/r/MachineLearning/comments/1n0r8b7/i_built_a_tool_to_benchmark_tokenizers_across_100/
Source snippet
te way more vocabulary to English patterns. These compound...Read more...
Source: gigaspaces.com
Link: https://www.gigaspaces.com/question/are-there-specific-tokenization-strategies-for-multilingual-llm
Source snippet
text into a form that the model can understand & process efficiently...
Source: aipmguru.substack.com
Title: the invisible upgrade how tokenization
Link: https://aipmguru.substack.com/p/the-invisible-upgrade-how-tokenization
Source snippet
Tokenization Quietly Got Better (And Why Your AI Costs...What this means for you as a PM: multilingual products are suddenly much cheape...
Source: emergentmind.com
Title: tokenisation bias in language models
Link: https://www.emergentmind.com/topics/tokenisation-bias-in-language-models
Source snippet
10 Feb 2026 — Tokenisation bias is a systematic distortion in LLMs arising from tokenization choices that fragment and misrepresent langu...

Do tokens make some languages harder for AI?

Introduction

Do tokens make some languages harder for AI?

Why scripts and word structures get unequal token budgets

How fragmentation affects context, cost, and performance

Less room for information

Higher usage costs

Lower task accuracy

Why this is a fairness issue, not just an engineering detail

What fairer tokenisation would need to improve

Further Reading

Hands-On Large Language Models

Natural Language Processing with Transformers

Build a Large Language Model (From Scratch)

Speech and Language Processing: Pearson New International Edi...

Marketplace Samples

Engineer Fixes Sh*t You Can't T-Shirt Funny Joke Novelty Engineering Gift Idea

Director Of Technology T Shirt - We Framed Wall Art Poster Canvas Print Picture

Director Of Technology T Shirt - We Framed Wall Art Poster Canvas Print Picture

Technology T-shirt Funny Engineer Technology Lover Birthday Gift for Him Her

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2