Within Training Choices

Can cleaning data make AI less fair?

Dataset cleaning can remove spam and low-quality text, but it can also erase dialects, identity terms, and marginalised voices.

On this page

  • Why web scale datasets need filtering
  • What C4 audits revealed about removed text
  • How cleaning choices become model assumptions
Preview for Can cleaning data make AI less fair?

Introduction

Large language models learn from patterns in the text they are given. Because the modern web contains spam, duplicated pages, machine-generated content, pornography, malware, and other low-value material, developers typically filter large web datasets before training. That cleaning process is often presented as a technical necessity, but it is also a powerful editorial decision. The filter determines which voices remain visible, which language varieties are treated as acceptable, and which topics appear frequently enough for a model to learn them. In practice, a language model learns not only from the web but also from the rules used to clean the web. Research on major training datasets has shown that some filtering methods remove disproportionate amounts of text associated with minority groups, dialects, and identity-related discussions, raising concerns about fairness and representation. arXiv+2Dr Alan D. Thompson – LifeArchitect.ai [arxiv.org]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Web Filters illustration 1

Can cleaning data make AI less fair?

Why web-scale datasets need filtering

Web-scale datasets are assembled from enormous internet archives such as Common Crawl, which contain billions of pages. Without filtering, training data would include large amounts of duplicated text, search-engine spam, corrupted pages, automatically generated content, and material that contributes little to language understanding. Cleaning therefore serves legitimate goals: improving data quality, reducing noise, and making training more efficient. [Sites@Rutgers]sites.rutgers.eduSites@Rutgers A Case Study on the Colossal Clean Crawled CorpusSites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th…Published: September 13, 2021

Typical filters remove:

  • Duplicate or near-duplicate pages.
  • Extremely short or malformed documents.
  • Non-target languages.
  • Boilerplate website text.
  • Content containing words from profanity or block lists.
  • Pages judged to be low quality by automated heuristics. [Sites@Rutgers]sites.rutgers.eduSites@Rutgers A Case Study on the Colossal Clean Crawled CorpusSites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th…Published: September 13, 2021

The challenge is that language is social as well as technical. A filter cannot easily distinguish between genuinely harmful content and legitimate discussion that uses the same vocabulary. As a result, cleaning rules may remove valuable cultural and linguistic information alongside the material they were designed to exclude. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

What C4 audits revealed about removed text

One of the most influential examples comes from the Colossal Clean Crawled Corpus (C4), a large dataset created from Common Crawl and used in prominent language-model research. Researchers who audited C4 examined not only what remained in the dataset but also what had been removed during cleaning. Their findings showed that block-list filtering disproportionately excluded documents associated with minority groups. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

The audit found several notable patterns:

  • Documents written in dialects associated with African American and Hispanic communities were removed at higher rates than text associated with White American English. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate…
  • Text discussing gender identity, sexual orientation, race, and religion was frequently filtered because identity-related terms overlapped with words appearing on offensive-language block lists. [Maarten Sap]maartensap.com21We investigate mentions related to gender identity, sexual orientationMaarten SapA Case Study on the Colossal Clean Crawled Corpusby J Dodge · Cited by 875 — Documenting Large Webtext Corpora: A Case Study o…
  • Many excluded documents were not abusive or hateful; they simply contained vocabulary that triggered automated filters. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Researchers concluded that the filtering process systematically changed the composition of the dataset. Rather than merely removing noise, it altered which communities and forms of expression were represented in the training corpus. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

These findings became especially important because C4 was widely reused in language-model development. A filtering choice made once during dataset construction could therefore influence many downstream systems. [ACL Anthology]aclanthology.org2021.emnlp main.98ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs…

Web Filters illustration 2

How cleaning choices become model assumptions

A language model learns statistical regularities from whatever survives the filtering stage. If a particular dialect appears less often in the training data, the model receives fewer opportunities to learn its vocabulary, grammar, and cultural references. If discussions of certain identities are systematically removed, the model may learn weaker or distorted associations about those groups. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate…

This process creates an important feedback effect. Developers may believe they are removing undesirable content, but they are also shaping the model’s picture of normal language use. The model can come to treat the remaining text as the default version of reality because alternative forms of expression were filtered out before training began. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Researchers have argued that large web datasets already tend to overrepresent dominant social and cultural perspectives. Heavy filtering can amplify this tendency if it disproportionately removes language from marginalised communities while preserving mainstream sources. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiOn the Dangers of Stochastic Parrots: Can Language Models…by EM Bender · Cited by 14366 — In §4, we discuss…

The result is not usually an obvious failure. Models often remain fluent and capable. Instead, the effects may appear in subtler ways:

  • Reduced performance on underrepresented dialects.
  • Less accurate responses about minority communities.
  • Greater reliance on majority-language norms.
  • Lower visibility of alternative cultural viewpoints.
  • Difficulty distinguishing offensive uses of identity terms from self-description or community discussion. Dr Alan D. Thompson – LifeArchitect.ai+2arXiv [s10251.pcdn.co]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate…

The central trade-off

The evidence does not suggest that web filtering should be abandoned. Unfiltered web data contains significant amounts of spam, misinformation, duplicated content, and harmful material that can degrade model quality. The key lesson is that filtering is not a neutral housekeeping step. It is part of the training signal itself. [Sites@Rutgers]sites.rutgers.eduSites@Rutgers A Case Study on the Colossal Clean Crawled CorpusSites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th…Published: September 13, 2021

Understanding artificial intelligence therefore requires looking beyond model architecture and examining dataset construction. Every decision about what to keep, remove, or down-rank changes what the model encounters during learning. When a language model appears biased, insensitive to certain communities, or unusually confident about some viewpoints, part of the explanation may lie not in the model’s design but in the web-filtering choices that shaped its training data. [arXiv+2Knowing Machines]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Web Filters illustration 3

Amazon book picks

Further Reading

Books and field guides related to Can cleaning data make AI less fair?. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/2104.08758
    Source snippet

    A Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show...

  2. Source: s10251.pcdn.co
    Title: Dr Alan D
    Link: https://s10251.pcdn.co/pdf/2021-dodge-c4.pdf
    Source snippet

    Thompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate...

  3. Source: sites.rutgers.edu
    Title: Sites@Rutgers A Case Study on the Colossal Clean Crawled Corpus
    Link: https://sites.rutgers.edu/critical-ai/wp-content/uploads/sites/586/2021/09/dodge2021documentingC4.pdf
    Source snippet

    Sites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th...

    Published: September 13, 2021

  4. Source: s10251.pcdn.co
    Title: Dr Alan D
    Link: https://s10251.pcdn.co/pdf/2021-bender-parrots.pdf
    Source snippet

    Thompson – LifeArchitect.aiOn the Dangers of Stochastic Parrots: Can Language Models...by EM Bender · Cited by 14366 — In §4, we discuss...

  5. Source: knowingmachines.org
    Link: https://knowingmachines.org/publications/9-ways-to-see/essays/c4
    Source snippet

    The case of 'Colossal Cleaned Common Crawl' (C4)While audits of C4 reveal the presence of harmful content, misinformation, and the exclus...

  6. Source: aclanthology.org
    Title: 2021.emnlp main.98
    Link: https://aclanthology.org/2021.emnlp-main.98/
    Source snippet

    ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs...

  7. Source: maartensap.com
    Title: 21We investigate mentions related to gender identity, sexual orientation
    Link: https://maartensap.com/pdfs/dodge2021documentingC4.pdf
    Source snippet

    Maarten SapA Case Study on the Colossal Clean Crawled Corpusby J Dodge · Cited by 875 — Documenting Large Webtext Corpora: A Case Study o...

  8. Source: github.com
    Link: https://github.com/allenai/allennlp/discussions/5265
    Source snippet

    We now have almost 27TB of clean-ish data, in 101 different languages (plus the "undetected"...Read more...

Additional References

  1. Source: aiaaic.org
    Link: https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-[automation
    Source snippet

    C4 datasetAI text detector [language bias]({{ 'language-bias/' | relative_url }}) ・ hur ethnic minority analytics Tesla ・ generates inaccurate, racist, homophobic and offensive r...

  2. Source: code4rena.com
    Link: https://code4rena.com/
    Source snippet

    Keeping high severity bugs out of productionAfter 5 years of securing DeFi, Code4rena is closing its doors. Active competitio...

  3. Source: sh-tsang.medium.com
    Link: https://sh-tsang.medium.com/review-documenting-largewebtext-corpora-a-case-study-on-the-colossal-clean-crawled-corpus-0bcc6554e4b6
    Source snippet

    Large Webtext Corpora: A Case Study on the...Crawled Corpus (C4) used in T5, is a dataset removing text that is not natural English. Iss...

  4. Source: medium.com
    Link: https://medium.com/%40emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11

  5. Source: proceedings.neurips.cc
    Title: 1c6bed78d3813886d3d72595dbecb80b Paper Datasets and [Benchmarks]({{ ‘benchmarks/’ | relative_url }})
    Link: https://proceedings.neurips.cc/paper_files/paper/2023/file/1c6bed78d3813886d3d72595dbecb80b-Paper-Datasets_and_Benchmarks.pdf
    Source snippet

    C4: An Open, Billion-scale Corpus of Images...by W Zhu · 2023 · Cited by 269 — Multimodal C4 (mmc4), a public, billion-scale image-text...

  6. Source: youtube.com
    Title: On the dangers of stochastic parrots: Can language models be too big?
    Link: https://www.youtube.com/watch?v=N5c2X8vhfBE
    Source snippet

    🦜Professor Emily M. Bender will present her recent (co-authored) paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?...

  7. Source: archive.org
    Title: stochastic parrots 3442188.3445922
    Link: https://archive.org/details/stochastic-parrots-3442188.3445922
    Source snippet

    On the Dangers of Stochastic Parrots: Can Language...13 Jun 2022 — In this paper, we take a step back and ask: How big is too big? What...

  8. Source: gist.github.com
    Link: https://gist.github.com/yoavg/9fc9be2f98b47c189a513573d902fb27
    Source snippet

    criticism of "On the Dangers of Stochastic ParrotsThe criticism has two parts: The paper is attacking the wrong target. racist, sexist, b...

  9. Source: buildcognitiveresonance.substack.com
    Title: in defense of stochastic parrots
    Link: https://buildcognitiveresonance.substack.com/p/in-defense-of-stochastic-parrots
    Source snippet

    defense of stochastic parrots - by Benjamin RileyHere is an overly simplistic yet defensible story of how large-language models do what t...

  10. Source: youtube.com
    Title: LLMs Pretrain Better Without Data Filtering
    Link: https://www.youtube.com/watch?v=yqmsr9bHNcI
    Source snippet

    A Bitter Lesson for Data Filtering discusses recent research demonstrating how data filtering methods directly alter the scaling behavior...

Topic Tree

Follow this branch

Parent topic

Training Choices What AI Learns Depends on Its Goals

Related pages 4

More on this topic 3