Within Web Filters

The dataset audit that found missing voices

The C4 dataset audit showed that removed web text was not random, with minority dialects and identity-related pages filtered out more often.

On this page

  • What researchers checked in the C4 removals
  • Which communities were disproportionately affected
  • Why one dataset's choices spread downstream
Preview for The dataset audit that found missing voices

Introduction

The audit of the Colossal Clean Crawled Corpus (C4) changed how many researchers think about training data. C4 was not just another web dataset: it became a foundation for influential language-model projects, including Google’s T5 family. When researchers later examined what had been removed during C4’s cleaning process, they found that the missing material was not distributed evenly across the web. Text associated with certain communities, dialects, and identity discussions disappeared at much higher rates than others. The finding mattered because it suggested that dataset filtering was not simply removing noise. It was also shaping whose language and experiences remained available for models to learn. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

C4 audit illustration 1

What researchers checked in the C4 removals

The most influential investigation came from the 2021 study Documenting Large Webtext Corpora, which compared multiple versions of C4. Rather than examining only the final dataset, the researchers analysed both the retained documents and the material excluded during cleaning. This allowed them to ask a straightforward question: what kinds of text were disproportionately removed? [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

A central focus was C4’s blocklist filter. The dataset creators had used a list of words intended to exclude pornography, obscenity, hateful content, and other undesirable material. Any document containing a blocked term could be removed. On paper, this looked like a practical quality-control step. In practice, the audit showed that many legitimate discussions also contained words that appeared on the blocklist. As a result, entire pages could be excluded even when their purpose was educational, descriptive, political, religious, or community-oriented rather than abusive. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 2021April 20, 2021 — 18 Apr 2021 — One of the main components of the C4 pipe…Published: April 20, 2021

The researchers therefore treated the removed documents as evidence in their own right. Instead of assuming the filter had successfully identified low-quality content, they investigated whether the exclusions followed identifiable social patterns. The answer was yes. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

Which communities were disproportionately affected

One of the audit’s most striking findings involved dialect representation. Using established methods for identifying dialectal patterns in text, the researchers estimated how often documents associated with different varieties of American English were filtered out.

The results showed a dramatic imbalance. Documents classified as African American English were removed at a rate of roughly 42%, while documents associated with Hispanic-aligned English were removed at about 32%. By comparison, documents associated with White American English were removed at only about 6.2%. [Maarten Sap]maartensap.comMaarten Sap A Case Study on the Colossal Clean Crawled CorpusMaarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec…Published: September 30, 2021

The final dataset reflected those differences. After filtering, the overwhelming majority of dialect-labelled documents in C4 were classified as White American English, while African American English and Hispanic-aligned English appeared only in very small proportions. The audit therefore suggested that the cleaning process did not merely reduce volume; it altered the linguistic composition of the corpus. [Maarten Sap]maartensap.comMaarten Sap A Case Study on the Colossal Clean Crawled CorpusMaarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec…Published: September 30, 2021

Identity-related content was also affected. The researchers found that pages discussing sexual orientation, gender identity, race, ethnicity, and religion were more likely to be removed because many community-specific terms overlapped with words appearing on the blocklist. Educational or supportive discussions could therefore be filtered alongside genuinely offensive material. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

A frequently cited example concerns references to sexual orientation. Analyses associated with the C4 audit found that mentions of terms such as “gay” and “lesbian” were disproportionately filtered, and that a substantial share of the excluded documents containing those words were not offensive at all. [Stanford CS324]stanford-cs324.github.ion, gay) more likely to be filtered out; of those…Read more…

The significance of these findings was not that the dataset intentionally targeted particular groups. Rather, automated rules designed for one purpose ended up removing legitimate language associated with particular communities at much higher rates than mainstream language varieties. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

C4 audit illustration 2

Why one dataset’s choices spread downstream

Many datasets have biases, but C4 attracted special attention because of its influence. It was not an obscure research resource. It became one of the most widely used web-scale corpora for language-model pretraining, helping shape research directions across academia and industry. [ACL Anthology]aclanthology.org2021.emnlp main.98ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs…

That meant the effects of filtering could propagate beyond a single project. If a major corpus under-represented certain dialects or identity-related discussions, models trained on that corpus would encounter fewer examples of those forms of language. Researchers worried that this could affect everything from language understanding to the quality of generated responses when discussing under-represented communities. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

The audit also became an important case study in dataset governance. For years, large web corpora were often described mainly through their size. The C4 investigation demonstrated that documentation of filtering decisions could be just as important as the number of tokens collected. A dataset containing billions of words can still encode strong assumptions if its cleaning rules systematically remove particular kinds of text. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

What the audit changed in AI data discussions

The lasting contribution of the C4 audit was not simply identifying one problematic filter. It showed that “cleaning” is never a purely technical operation. Every filtering rule embodies assumptions about what counts as acceptable language, useful content, or high-quality text.

After the audit, researchers increasingly began examining not only what datasets contain but also what they exclude. Questions about representation, dialect diversity, identity-related language, and documentation became more prominent in discussions of training data. The C4 case provided concrete evidence that missing voices can emerge from ordinary preprocessing choices rather than explicit decisions to exclude particular groups. [arXiv+2ACL Anthology]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

Within the broader story of how web filtering changes what language models learn, the C4 audit remains a landmark example. It revealed that the absence of certain voices in training data is often not random. Instead, it can be a direct consequence of the rules used to decide which parts of the web are worth keeping. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

C4 audit illustration 3

Amazon book picks

Further Reading

Books and field guides related to The dataset audit that found missing voices. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: arxiv.org
    Link: https://arxiv.org/abs/2104.08758
    Source snippet

    Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021...

    Published: April 18, 2021

  2. Source: s10251.pcdn.co
    Title: Dr Alan D
    Link: https://s10251.pcdn.co/pdf/2021-dodge-c4.pdf
    Source snippet

    Thompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 2021April 20, 2021 — 18 Apr 2021 — One of the main components of the C4 pipe...

    Published: April 20, 2021

  3. Source: arxiv.org
    Link: https://arxiv.org/html/2309.04027v2
    Source snippet

    Textual Identity Detection for Evaluating and Augmenting...12 Jan 2024 — In this paper, we present a dataset coupled with an approach to...

  4. Source: aclanthology.org
    Title: 2021.emnlp main.98
    Link: https://aclanthology.org/2021.emnlp-main.98/
    Source snippet

    ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs...

  5. Source: maartensap.com
    Title: Maarten Sap A Case Study on the Colossal Clean Crawled Corpus
    Link: https://maartensap.com/pdfs/dodge2021documentingC4.pdf
    Source snippet

    Maarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec...

    Published: September 30, 2021

  6. Source: stanford-cs324.github.io
    Link: https://stanford-cs324.github.io/winter2022/lectures/data/
    Source snippet

    n, gay) more likely to be filtered out; of those...Read more...

  7. Source: aclanthology.org
    Title: 2021.emnlp main.98
    Link: https://aclanthology.org/2021.emnlp-main.98.pdf
    Source snippet

    A Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — NOCLEAN), which is the snapshot of Common Crawl ident...

  8. Source: aclanthology.org
    Link: https://aclanthology.org/2023.acl-long.507v1.pdf
    Source snippet

    WinoQueer: A Community-in-the-Loop Benchmark for Anti-...by VK Felkner · Cited by 171 — This dataset is not specifically focused on scru...

  9. Source: antmarakis.github.io
    Title: documenting large corpora
    Link: https://antmarakis.github.io/2021/documenting_large_corpora/
    Source snippet

    Documenting Large Webtext Corpora21 Oct 2021 — The Colossal Clean Crawled Corpus (C4) is a corpus curated for pretraining large language...

Additional References

  1. Source: aiaaic.org
    Link: https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-[automation
    Source snippet

    C4 datasetStudy finds Amazon Rekognition suffers from racial and gender bias · BDD100K dataset · Deepfake CFO scams finance worker for US...

  2. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/Documenting-the-English-Colossal-Clean-Crawled-Dodge-Sap/40c3327a6ddb0603b6892344509c7f428ab43d81
    Source snippet

    Documenting the English Colossal Clean Crawled CorpusThis work provides the first documentation for the Colossal Clean Crawled Corpus (C4...

  3. Source: sites.rutgers.edu
    Link: https://sites.rutgers.edu/critical-ai/wp-content/uploads/sites/586/2021/09/dodge2021documentingC4.pdf

  4. Source: researchgate.net
    Link: https://www.researchgate.net/publication/372915406_WinoQueer_A_Community-in-the-Loop_Benchmark_for_Anti-LGBTQ_Bias_in_Large_Language_Models
    Source snippet

    WinoQueer: A Community-in-the-Loop Benchmark for Anti-...For LGBTQ+ specifically, recent work notes a gap in sexuality -focused represen...

  5. Source: sh-tsang.medium.com
    Link: https://sh-tsang.medium.com/review-documenting-largewebtext-corpora-a-case-study-on-the-colossal-clean-crawled-corpus-0bcc6554e4b6
    Source snippet

    Large Webtext Corpora: A Case Study on the...The English Colossal Clean Crawled Corpus (C4) is created by taking the April 2019 snapshot...

    Published: April 2019

  6. Source: eurac.edu
    Title: riding the third wave what s new in minority language media research
    Link: https://www.eurac.edu/en/blogs/midas/riding-the-third-wave-what-s-new-in-minority-language-media-research
    Source snippet

    Riding the 'Third Wave': What's New in Minority Language...7 Oct 2024 — Craig Willis is a researcher at the European Centre for Minority...

  7. Source: researchgate.net
    Title: 350991473 Documenting the English Colossal Clean Crawled Corpus
    Link: https://www.researchgate.net/publication/350991473_Documenting_the_English_Colossal_Clean_Crawled_Corpus
    Source snippet

    Documenting the English Colossal Clean Crawled Corpus18 Apr 2021 — In this work we provide the first documentation for the Colossal Clean...

  8. Source: proceedings.neurips.cc
    Title: 1c6bed78d3813886d3d72595dbecb80b Paper Datasets and [Benchmarks]({{ ‘benchmarks/’ | relative_url }})
    Link: https://proceedings.neurips.cc/paper_files/paper/2023/file/1c6bed78d3813886d3d72595dbecb80b-Paper-Datasets_and_Benchmarks.pdf
    Source snippet

    C4: An Open, Billion-scale Corpus of Images...by W Zhu · 2023 · Cited by 269 — Documenting large webtext corpora: A case study on the co...

  9. Source: deepai.org
    Title: documenting the english colossal clean crawled corpus
    Link: https://www.deepai.org/publication/documenting-the-english-colossal-clean-crawled-corpus
    Source snippet

    18 Apr 2021 — In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset...

  10. Source: direct.mit.edu
    Title: Quality at a Glance An Audit of Web Crawled
    Link: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00447/109285/Quality-at-a-Glance-An-Audit-of-Web-Crawled
    Source snippet

    MIT Press DirectQuality at a Glance: An Audit of Web-Crawled [Multilingual]({{ 'language-bias/' | relative_url }})...by J Kreutzer · 2022 · Cited by 313 — We manually audit the...

Topic Tree

Follow this branch

Parent topic

Web Filters Can cleaning data make AI less fair?

Related pages 2