Within Tokenization

Do tokens make some languages harder for AI?

Languages and writing systems can be tokenised unevenly, which may change cost, context use, and answer quality across users.

On this page

  • Why scripts and word structures get unequal token budgets
  • How fragmentation affects context, cost, and performance
  • What fairer tokenisation would need to improve
Preview for Do tokens make some languages harder for AI?

Introduction

Tokenisation does not affect all languages equally. Because chatbots measure input, memory, and cost in tokens rather than words, people using different languages can receive different levels of efficiency from the same AI system. A sentence that fits comfortably into a chatbot’s context window in one language may consume far more tokens in another. This difference can increase costs, reduce the amount of information the model can remember, and sometimes lower answer quality. Research over the past several years has increasingly identified this phenomenon as a form of tokenisation bias: a structural inequality created before the model even begins generating a response. [arXiv]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

Language Bias illustration 1 Within the broader question of how tokenisation shapes chatbot answers, multilingual tokenisation bias is important because it affects billions of users who interact with AI in languages other than English. The issue is not simply translation quality. It concerns how efficiently different languages are represented inside the model itself. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…

Do tokens make some languages harder for AI?

The short answer is yes. Modern language models typically use subword tokenisers that build vocabularies from frequently occurring character patterns. Languages that resemble the data used to construct those vocabularies often compress efficiently into relatively few tokens. Languages with different writing systems, longer word structures, or lower representation in training data may be fragmented into many more pieces. [arXiv+2ACL Anthology]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

Researchers studying multilingual tokenisation found that identical content translated into different languages can require dramatically different token counts, with some comparisons showing differences of more than an order of magnitude. The disparity appears before any reasoning or generation occurs; it is built into the text representation itself. [arXiv]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

This means that two users asking the same question in different languages may not be consuming the same amount of the model’s available resources. One user’s request may occupy a small portion of the context window, while another’s may consume a much larger share despite conveying the same information. [ACL Anthology]aclanthology.orgACL Anthology Do All Languages Cost the Same?Tokenization in the Era…by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce…

Why scripts and word structures get unequal token budgets

The unequal treatment of languages comes from several overlapping factors.

Writing systems matter. Many tokenisers were originally optimised using data dominated by English and other widely represented languages. Languages using Latin scripts often receive more efficient token allocations than languages written in other scripts. Studies examining hundreds of languages have found that non-Latin scripts frequently experience substantially higher token inflation. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and EfficiencyOctober 14, 2025…Published: October 14, 2025

Word formation matters. Some languages pack extensive grammatical information into a single word. Languages with rich morphology can create long word forms that are uncommon enough to be broken into many token fragments. Each fragment consumes part of the model’s context budget. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…Published: September 5, 2025

Vocabulary allocation matters. Tokenisers have limited vocabulary space. Languages that appear more often in training corpora tend to receive more dedicated token entries. Lower-resource languages may be forced to share vocabulary capacity, resulting in less efficient segmentation. Research on multilingual tokenizer design shows that training data composition strongly influences which languages receive efficient representations. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…

A useful concept in this area is token fertility, which measures how many tokens are needed to represent a unit of text. Higher fertility generally means greater fragmentation. Researchers increasingly use fertility as a way to quantify tokenisation inequality across languages. [arXiv]arxiv.orgarXiv Analyzing STRR as a Metric for Multilingual TokenizationAnalyzing STRR as a Metric for Multilingual Tokenization…October 11, 2025 — by MT Nayeem · 2025 · Cited by 3 — We analyze six wid…Published: October 11, 2025

How fragmentation affects context, cost, and performance

Less room for information

Large language models have fixed context windows measured in tokens. If a language requires more tokens to express the same ideas, users effectively receive less working memory from the system.

Imagine two users providing documents of similar meaning and length. If one language consumes twice as many tokens, that user may reach the context limit sooner. The chatbot then has less room available for instructions, examples, conversation history, or supporting evidence. [Interactive Optimization and Learning]pokutta.comInteractive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr…Published: May 14, 2026

This can influence answer quality in subtle ways. The model may have to truncate information earlier or compress more aggressively, increasing the chance of omissions and misunderstandings. [Interactive Optimization and Learning]pokutta.comInteractive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr…Published: May 14, 2026

Language Bias illustration 2

Higher usage costs

Many commercial AI services charge according to token counts. When the same meaning requires more tokens in one language than another, users can end up paying more for equivalent interactions. Researchers analysing multilingual API usage have described this as a fairness problem because pricing appears language-neutral while actual token consumption is not. [ACL Anthology]aclanthology.orgACL Anthology Do All Languages Cost the Same?Tokenization in the Era…by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce…

Several studies refer to this phenomenon as a “token tax”: speakers of certain languages consume more computational resources and therefore incur greater costs for comparable tasks. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…Published: September 5, 2025

Lower task accuracy

The consequences extend beyond efficiency. Research evaluating large language models across African languages found that higher token fertility consistently predicted lower performance on knowledge and reasoning benchmarks. Languages requiring more fragmented representations tended to achieve lower accuracy scores. [ACL Anthology]aclanthology.orgACL AnthologyThe Token Tax: Systematic Bias in Multilingual Tokenizationby JM Lundin · 2026 · Cited by 6 — We evaluate 10 Large Language…

The likely explanation is that fragmentation makes learning harder. When a concept is repeatedly broken into varying token combinations, the model receives a less stable representation of that concept during training. Over time, this can reduce the quality of the statistical patterns the model learns. [arXiv]arxiv.orgarXiv The Token Tax: Systematic Bias in Multilingual TokenizationThe Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat…Published: September 5, 2025

Why this is a fairness issue, not just an engineering detail

Tokenisation bias is sometimes presented as a technical optimisation problem, but its effects are social as well as computational.

When speakers of some languages pay more, receive less context, and experience lower accuracy from the same system, access to AI becomes uneven. The disparity is especially significant for languages that already face disadvantages in digital resources and machine-learning datasets. [arXiv+2ACL Anthology]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…

Researchers have therefore begun describing tokenisation disparities as a form of infrastructure bias. The concern is that inequality is embedded into the foundational representation layer of AI systems, influencing downstream performance before model reasoning even begins. [arXiv]arxiv.orgTokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni…

The problem does not mean multilingual chatbots are failing across all non-English languages. Many modern systems have improved substantially, and recent multilingual models demonstrate strong performance in a growing number of underrepresented languages. However, improvements in model capability do not automatically eliminate tokenisation disparities. A model can become more multilingual while still allocating context and computational resources unevenly across languages. [TechRadar]techradar.comGoogle’s Gemini Pro model, for instance, scored over 4.5 out of 5 in Kinyarwanda, a language spoken by around 12 million people in East A…

Language Bias illustration 3

What fairer tokenisation would need to improve

Researchers and developers are exploring several approaches to reduce multilingual tokenisation bias.

Better vocabulary allocation. Instead of allowing dominant languages to occupy most of the token vocabulary, tokenisers can be designed to distribute representation capacity more evenly across languages and scripts. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…

Language-aware tokenisation. Incorporating linguistic structure can help preserve meaningful units rather than splitting words into arbitrary fragments. This is particularly valuable for morphologically rich languages. [Hugging Face]huggingface.coHugging FaceTokenization is Killing our Multilingual LLM DreamMar 15, 2026 — The bad tokenizer destroys both plural and case information…

New fairness metrics. Researchers increasingly argue that simple token counts are not enough. New measures examine token premiums, relative tokenisation costs, vocabulary allocation, and cross-language parity to identify hidden inequities. [arXiv+2arXiv]arxiv.orgarXiv Explaining and Mitigating Crosslingual Tokenizer InequitiesExplaining and Mitigating Crosslingual Tokenizer InequitiesOctober 24, 2025…Published: October 24, 2025

Balanced multilingual training data. Studies indicate that the composition of multilingual corpora strongly affects tokeniser behaviour. More balanced language representation during tokeniser construction can improve efficiency and fairness. [OpenReview]openreview.netHow Multilingual Dataset Composition Affects Tokenizer…by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset…

The broader lesson is that multilingual chatbot quality depends not only on model size or training data volume but also on how language is broken into tokens. Tokenisation determines how much of a user’s language fits into the model’s memory, how much interaction costs, and how effectively the model learns linguistic patterns. As AI systems become global infrastructure, those seemingly small design choices increasingly shape who benefits equally from them. [arXiv+2ACL Anthology]arxiv.orgarXiv Language Model Tokenizers Introduce UnfairnessLanguage Model Tokenizers Introduce Unfairness…May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp…Published: May 17, 2023

Amazon book picks

Further Reading

Books and field guides related to Do tokens make some languages harder for AI?. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Title: arXiv Language Model Tokenizers Introduce Unfairness
    Link: https://arxiv.org/abs/2305.15425
    Source snippet

    Language Model Tokenizers Introduce Unfairness...May 17, 2023 — by A Petrov · 2023 · Cited by 283 — In this paper, we show how disp...

    Published: May 17, 2023

  2. Source: arxiv.org
    Link: https://arxiv.org/html/2510.12389v1
    Source snippet

    Tokenization Disparities as Infrastructure Bias14 Oct 2025 — This study conducts a large-scale cross-linguistic evaluation of tokeni...

  3. Source: openreview.net
    Link: https://openreview.net/forum?id=P2k908rWSP
    Source snippet

    How Multilingual Dataset Composition Affects Tokenizer...by A Selvamurugan · Cited by 1 — TL;DR: Balanced multilingual dataset...

  4. Source: openreview.net
    Title: Open Review Do All Languages Cost the Same?
    Link: https://openreview.net/forum?id=OUmxBN45Gl
    Source snippet

    [Tokenization]({{ 'tokenization/' | relative_url }}) in the Era...by O Ahia · Cited by 191 — We conduct a systematic analysis of the cost and utility of OpenAI's language model...

  5. Source: arxiv.org
    Link: https://arxiv.org/abs/2510.12389
    Source snippet

    Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and EfficiencyOctober 14, 2025...

    Published: October 14, 2025

  6. Source: arxiv.org
    Title: arXiv The Token Tax: Systematic Bias in Multilingual Tokenization
    Link: https://arxiv.org/abs/2509.05486
    Source snippet

    The Token Tax: Systematic Bias in Multilingual TokenizationSeptember 5, 2025 — by JM Lundin · 2025 · Cited by 6 — Abstract:Tokenizat...

    Published: September 5, 2025

  7. Source: openreview.net
    Link: https://openreview.net/forum?id=UJiP3m3EXs
    Source snippet

    Rethinking Multilingual Tokenizer Designby A Thakur · Cited by 1 — This paper presents a systematic study of multilingual tokenizer desig...

  8. Source: arxiv.org
    Title: arXiv Analyzing STRR as a Metric for Multilingual Tokenization
    Link: https://arxiv.org/abs/2510.09947
    Source snippet

    Analyzing STRR as a Metric for Multilingual Tokenization...October 11, 2025 — by MT Nayeem · 2025 · Cited by 3 — We analyze six wid...

    Published: October 11, 2025

  9. Source: techradar.com
    Link: https://www.techradar.com/pro/a-transformative-moment-research-shows-ai-could-become-the-king-of-babel-as-llms-master-rare-obscure-languages
    Source snippet

    Google’s Gemini Pro model, for instance, scored over 4.5 out of 5 in Kinyarwanda, a language spoken by around 12 million people in East A...

  10. Source: arxiv.org
    Title: arXiv Explaining and Mitigating Crosslingual Tokenizer Inequities
    Link: https://arxiv.org/abs/2510.21909
    Source snippet

    Explaining and Mitigating Crosslingual Tokenizer InequitiesOctober 24, 2025...

    Published: October 24, 2025

  11. Source: arxiv.org
    Link: https://arxiv.org/html/2510.09947v1
    Source snippet

    Analyzing STRR as a Metric for Multilingual Tokenization...Oct 11, 2025 — Tokenization is a foundational step in large language models (...

  12. Source: aclanthology.org
    Title: ACL Anthology Do All Languages Cost the Same?
    Link: https://aclanthology.org/anthology-files/anthology-files/pdf/emnlp/2023.emnlp-main.614.pdf
    Source snippet

    Tokenization in the Era...by O Ahia · Cited by 201 — Many commercial LMs are multilingual, and text from languages that suffer from exce...

  13. Source: pokutta.com
    Link: https://www.pokutta.com/blog/hidden-cost-tokenization/
    Source snippet

    Interactive Optimization and LearningThe Hidden Cost of TokenizationMay 14, 2026 — The basic point is simple: tokenization is not a neutr...

    Published: May 14, 2026

  14. Source: aclanthology.org
    Link: https://aclanthology.org/2026.africanlp-main.10/
    Source snippet

    ACL AnthologyThe Token Tax: Systematic Bias in Multilingual Tokenizationby JM Lundin · 2026 · Cited by 6 — We evaluate 10 Large Language...

  15. Source: huggingface.co
    Link: https://huggingface.co/blog/omarkamali/tokenization
    Source snippet

    Hugging FaceTokenization is Killing our Multilingual LLM DreamMar 15, 2026 — The bad tokenizer destroys both plural and case information...

  16. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Language
    Source snippet

    LanguageLanguage is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which hum...

  17. Source: aclanthology.org
    Title: 2025.unlp 1.1
    Link: https://aclanthology.org/2025.unlp-1.1.pdf
    Source snippet

    From English-Centric to Effective Bilingual: LLMs with...by A Kiulian · 2025 · Cited by 11 — In this paper, we propose a model-agnostic...

Additional References

  1. Source: researchgate.net
    Link: https://www.researchgate.net/publication/396459552_Beyond_Fertility_Analyzing_STRR_as_a_Metric_for_Multilingual_Tokenization_Evaluation
    Source snippet

    (PDF) Beyond Fertility: Analyzing STRR as a Metric for...Oct 28, 2025 — We analyze six widely used tokenizers across seven languages and...

  2. Source: linkedin.com
    Link: https://www.linkedin.com/pulse/llm-tokenization-explained-your-guide-how-large-language-models-du7ff
    Source snippet

    Your Guide to How Large Language Models Understand TextTokenization isn't just about slicing text; it also affects how much content can b...

  3. Source: axios.com
    Link: https://www.axios.com/2024/02/13/open-source-ai-languages
    Source snippet

    Aya was developed by pre-training a base model with diverse language data and then fine-tuning it for the same languages. This initiative...

  4. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/Language-Model-Tokenizers-Introduce-Unfairness-Petrov-Malfa/879a7f5abdb7ab803d48172d4f0830965f989d46
    Source snippet

    [PDF] Language Model Tokenizers Introduce Unfairness...It is shown how disparity in the treatment of different languages arises at the t...

  5. Source: medium.com
    Link: https://medium.com/%40geosar/the-importance-of-tokenizers-for-multilingual-llms-a-case-study-on-greek-af5301b0bacf
    Source snippet

    The Importance of Tokenizers for Multilingual LLMsIn this blog post, we explore the impact of tokenizers on the cost and downstream appli...

  6. Source: medium.com
    Link: https://medium.com/%40adnanmasood/history-and-state-of-llms-for-low-resource-languages-lrls-987986a3f2f5
    Source snippet

    Ali et al. (2023) conducted a comprehensive study and found...Read more...

  7. Source: reddit.com
    Link: https://www.reddit.com/r/MachineLearning/comments/1n0r8b7/i_built_a_tool_to_benchmark_tokenizers_across_100/
    Source snippet

    te way more vocabulary to English patterns. These compound...Read more...

  8. Source: gigaspaces.com
    Link: https://www.gigaspaces.com/question/are-there-specific-tokenization-strategies-for-multilingual-llm
    Source snippet

    text into a form that the model can understand & process efficiently...

  9. Source: aipmguru.substack.com
    Title: the invisible upgrade how tokenization
    Link: https://aipmguru.substack.com/p/the-invisible-upgrade-how-tokenization
    Source snippet

    Tokenization Quietly Got Better (And Why Your AI Costs...What this means for you as a PM: multilingual products are suddenly much cheape...

  10. Source: emergentmind.com
    Title: tokenisation bias in language models
    Link: https://www.emergentmind.com/topics/tokenisation-bias-in-language-models
    Source snippet

    10 Feb 2026 — Tokenisation bias is a systematic distortion in LLMs arising from tokenization choices that fragment and misrepresent langu...

Topic Tree

Follow this branch

Parent topic

Tokenization Why chatbots do not really read words

Related pages 2