Within Web Filters
The dataset audit that found missing voices
The C4 dataset audit showed that removed web text was not random, with minority dialects and identity-related pages filtered out more often.
On this page
- What researchers checked in the C4 removals
- Which communities were disproportionately affected
- Why one dataset's choices spread downstream
Page outline Jump by section
Introduction
The audit of the Colossal Clean Crawled Corpus (C4) changed how many researchers think about training data. C4 was not just another web dataset: it became a foundation for influential language-model projects, including Google’s T5 family. When researchers later examined what had been removed during C4’s cleaning process, they found that the missing material was not distributed evenly across the web. Text associated with certain communities, dialects, and identity discussions disappeared at much higher rates than others. The finding mattered because it suggested that dataset filtering was not simply removing noise. It was also shaping whose language and experiences remained available for models to learn. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
What researchers checked in the C4 removals
The most influential investigation came from the 2021 study Documenting Large Webtext Corpora, which compared multiple versions of C4. Rather than examining only the final dataset, the researchers analysed both the retained documents and the material excluded during cleaning. This allowed them to ask a straightforward question: what kinds of text were disproportionately removed? [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
A central focus was C4’s blocklist filter. The dataset creators had used a list of words intended to exclude pornography, obscenity, hateful content, and other undesirable material. Any document containing a blocked term could be removed. On paper, this looked like a practical quality-control step. In practice, the audit showed that many legitimate discussions also contained words that appeared on the blocklist. As a result, entire pages could be excluded even when their purpose was educational, descriptive, political, religious, or community-oriented rather than abusive. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 2021April 20, 2021 — 18 Apr 2021 — One of the main components of the C4 pipe…
The researchers therefore treated the removed documents as evidence in their own right. Instead of assuming the filter had successfully identified low-quality content, they investigated whether the exclusions followed identifiable social patterns. The answer was yes. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
Which communities were disproportionately affected
One of the audit’s most striking findings involved dialect representation. Using established methods for identifying dialectal patterns in text, the researchers estimated how often documents associated with different varieties of American English were filtered out.
The results showed a dramatic imbalance. Documents classified as African American English were removed at a rate of roughly 42%, while documents associated with Hispanic-aligned English were removed at about 32%. By comparison, documents associated with White American English were removed at only about 6.2%. [Maarten Sap]maartensap.comMaarten Sap A Case Study on the Colossal Clean Crawled CorpusMaarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec…
The final dataset reflected those differences. After filtering, the overwhelming majority of dialect-labelled documents in C4 were classified as White American English, while African American English and Hispanic-aligned English appeared only in very small proportions. The audit therefore suggested that the cleaning process did not merely reduce volume; it altered the linguistic composition of the corpus. [Maarten Sap]maartensap.comMaarten Sap A Case Study on the Colossal Clean Crawled CorpusMaarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec…
Identity-related content was also affected. The researchers found that pages discussing sexual orientation, gender identity, race, ethnicity, and religion were more likely to be removed because many community-specific terms overlapped with words appearing on the blocklist. Educational or supportive discussions could therefore be filtered alongside genuinely offensive material. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
A frequently cited example concerns references to sexual orientation. Analyses associated with the C4 audit found that mentions of terms such as “gay” and “lesbian” were disproportionately filtered, and that a substantial share of the excluded documents containing those words were not offensive at all. [Stanford CS324]stanford-cs324.github.ion, gay) more likely to be filtered out; of those…Read more…
The significance of these findings was not that the dataset intentionally targeted particular groups. Rather, automated rules designed for one purpose ended up removing legitimate language associated with particular communities at much higher rates than mainstream language varieties. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
Why one dataset’s choices spread downstream
Many datasets have biases, but C4 attracted special attention because of its influence. It was not an obscure research resource. It became one of the most widely used web-scale corpora for language-model pretraining, helping shape research directions across academia and industry. [ACL Anthology]aclanthology.org2021.emnlp main.98ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs…
That meant the effects of filtering could propagate beyond a single project. If a major corpus under-represented certain dialects or identity-related discussions, models trained on that corpus would encounter fewer examples of those forms of language. Researchers worried that this could affect everything from language understanding to the quality of generated responses when discussing under-represented communities. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
The audit also became an important case study in dataset governance. For years, large web corpora were often described mainly through their size. The C4 investigation demonstrated that documentation of filtering decisions could be just as important as the number of tokens collected. A dataset containing billions of words can still encode strong assumptions if its cleaning rules systematically remove particular kinds of text. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
What the audit changed in AI data discussions
The lasting contribution of the C4 audit was not simply identifying one problematic filter. It showed that “cleaning” is never a purely technical operation. Every filtering rule embodies assumptions about what counts as acceptable language, useful content, or high-quality text.
After the audit, researchers increasingly began examining not only what datasets contain but also what they exclude. Questions about representation, dialect diversity, identity-related language, and documentation became more prominent in discussions of training data. The C4 case provided concrete evidence that missing voices can emerge from ordinary preprocessing choices rather than explicit decisions to exclude particular groups. [arXiv+2ACL Anthology]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
Within the broader story of how web filtering changes what language models learn, the C4 audit remains a landmark example. It revealed that the absence of certain voices in training data is often not random. Instead, it can be a direct consequence of the rules used to decide which parts of the web are worth keeping. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…
Amazon book picks
Further Reading
Books and field guides related to The dataset audit that found missing voices. Use these as the next step if you want deeper reading beyond the article.
The Atlas of AI
Explains how AI datasets are assembled and why those choices matter socially.
Algorithms of Oppression
Complements C4 audit concerns by showing how digital systems marginalise groups.
Weapons of Math Destruction
Frames why seemingly neutral technical choices can scale into unfair outcomes.
Endnotes
-
Source: arxiv.org
Link: https://arxiv.org/abs/2104.08758Source snippet
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021...
Published: April 18, 2021
-
Source: s10251.pcdn.co
Title: Dr Alan D
Link: https://s10251.pcdn.co/pdf/2021-dodge-c4.pdfSource snippet
Thompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 2021April 20, 2021 — 18 Apr 2021 — One of the main components of the C4 pipe...
Published: April 20, 2021
-
Source: arxiv.org
Link: https://arxiv.org/html/2309.04027v2Source snippet
Textual Identity Detection for Evaluating and Augmenting...12 Jan 2024 — In this paper, we present a dataset coupled with an approach to...
-
Source: aclanthology.org
Title: 2021.emnlp main.98
Link: https://aclanthology.org/2021.emnlp-main.98/Source snippet
ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs...
-
Source: maartensap.com
Title: Maarten Sap A Case Study on the Colossal Clean Crawled Corpus
Link: https://maartensap.com/pdfs/dodge2021documentingC4.pdfSource snippet
Maarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec...
Published: September 30, 2021
-
Source: stanford-cs324.github.io
Link: https://stanford-cs324.github.io/winter2022/lectures/data/Source snippet
n, gay) more likely to be filtered out; of those...Read more...
-
Source: aclanthology.org
Title: 2021.emnlp main.98
Link: https://aclanthology.org/2021.emnlp-main.98.pdfSource snippet
A Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — NOCLEAN), which is the snapshot of Common Crawl ident...
-
Source: aclanthology.org
Link: https://aclanthology.org/2023.acl-long.507v1.pdfSource snippet
WinoQueer: A Community-in-the-Loop Benchmark for Anti-...by VK Felkner · Cited by 171 — This dataset is not specifically focused on scru...
-
Source: antmarakis.github.io
Title: documenting large corpora
Link: https://antmarakis.github.io/2021/documenting_large_corpora/Source snippet
Documenting Large Webtext Corpora21 Oct 2021 — The Colossal Clean Crawled Corpus (C4) is a corpus curated for pretraining large language...
Additional References
-
Source: aiaaic.org
Link: https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-[automationSource snippet
C4 datasetStudy finds Amazon Rekognition suffers from racial and gender bias · BDD100K dataset · Deepfake CFO scams finance worker for US...
-
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Documenting-the-English-Colossal-Clean-Crawled-Dodge-Sap/40c3327a6ddb0603b6892344509c7f428ab43d81Source snippet
Documenting the English Colossal Clean Crawled CorpusThis work provides the first documentation for the Colossal Clean Crawled Corpus (C4...
-
Source: sites.rutgers.edu
Link: https://sites.rutgers.edu/critical-ai/wp-content/uploads/sites/586/2021/09/dodge2021documentingC4.pdf -
Source: researchgate.net
Link: https://www.researchgate.net/publication/372915406_WinoQueer_A_Community-in-the-Loop_Benchmark_for_Anti-LGBTQ_Bias_in_Large_Language_ModelsSource snippet
WinoQueer: A Community-in-the-Loop Benchmark for Anti-...For LGBTQ+ specifically, recent work notes a gap in sexuality -focused represen...
-
Source: sh-tsang.medium.com
Link: https://sh-tsang.medium.com/review-documenting-largewebtext-corpora-a-case-study-on-the-colossal-clean-crawled-corpus-0bcc6554e4b6Source snippet
Large Webtext Corpora: A Case Study on the...The English Colossal Clean Crawled Corpus (C4) is created by taking the April 2019 snapshot...
Published: April 2019
-
Source: eurac.edu
Title: riding the third wave what s new in minority language media research
Link: https://www.eurac.edu/en/blogs/midas/riding-the-third-wave-what-s-new-in-minority-language-media-researchSource snippet
Riding the 'Third Wave': What's New in Minority Language...7 Oct 2024 — Craig Willis is a researcher at the European Centre for Minority...
-
Source: researchgate.net
Title: 350991473 Documenting the English Colossal Clean Crawled Corpus
Link: https://www.researchgate.net/publication/350991473_Documenting_the_English_Colossal_Clean_Crawled_CorpusSource snippet
Documenting the English Colossal Clean Crawled Corpus18 Apr 2021 — In this work we provide the first documentation for the Colossal Clean...
-
Source: proceedings.neurips.cc
Title: 1c6bed78d3813886d3d72595dbecb80b Paper Datasets and [Benchmarks]({{ ‘benchmarks/’ | relative_url }})
Link: https://proceedings.neurips.cc/paper_files/paper/2023/file/1c6bed78d3813886d3d72595dbecb80b-Paper-Datasets_and_Benchmarks.pdfSource snippet
C4: An Open, Billion-scale Corpus of Images...by W Zhu · 2023 · Cited by 269 — Documenting large webtext corpora: A case study on the co...
-
Source: deepai.org
Title: documenting the english colossal clean crawled corpus
Link: https://www.deepai.org/publication/documenting-the-english-colossal-clean-crawled-corpusSource snippet
18 Apr 2021 — In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset...
-
Source: direct.mit.edu
Title: Quality at a Glance An Audit of Web Crawled
Link: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00447/109285/Quality-at-a-Glance-An-Audit-of-Web-CrawledSource snippet
MIT Press DirectQuality at a Glance: An Audit of Web-Crawled [Multilingual]({{ 'language-bias/' | relative_url }})...by J Kreutzer · 2022 · Cited by 313 — We manually audit the...
Topic Tree



