The dataset audit that found missing voices

Introduction

The audit of the Colossal Clean Crawled Corpus (C4) changed how many researchers think about training data. C4 was not just another web dataset: it became a foundation for influential language-model projects, including Google’s T5 family. When researchers later examined what had been removed during C4’s cleaning process, they found that the missing material was not distributed evenly across the web. Text associated with certain communities, dialects, and identity discussions disappeared at much higher rates than others. The finding mattered because it suggested that dataset filtering was not simply removing noise. It was also shaping whose language and experiences remained available for models to learn. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

C4 audit illustration 1

What researchers checked in the C4 removals

The most influential investigation came from the 2021 study Documenting Large Webtext Corpora, which compared multiple versions of C4. Rather than examining only the final dataset, the researchers analysed both the retained documents and the material excluded during cleaning. This allowed them to ask a straightforward question: what kinds of text were disproportionately removed? [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

A central focus was C4’s blocklist filter. The dataset creators had used a list of words intended to exclude pornography, obscenity, hateful content, and other undesirable material. Any document containing a blocked term could be removed. On paper, this looked like a practical quality-control step. In practice, the audit showed that many legitimate discussions also contained words that appeared on the blocklist. As a result, entire pages could be excluded even when their purpose was educational, descriptive, political, religious, or community-oriented rather than abusive. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 2021April 20, 2021 — 18 Apr 2021 — One of the main components of the C4 pipe…Published: April 20, 2021

The researchers therefore treated the removed documents as evidence in their own right. Instead of assuming the filter had successfully identified low-quality content, they investigated whether the exclusions followed identifiable social patterns. The answer was yes. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

Which communities were disproportionately affected

One of the audit’s most striking findings involved dialect representation. Using established methods for identifying dialectal patterns in text, the researchers estimated how often documents associated with different varieties of American English were filtered out.

The results showed a dramatic imbalance. Documents classified as African American English were removed at a rate of roughly 42%, while documents associated with Hispanic-aligned English were removed at about 32%. By comparison, documents associated with White American English were removed at only about 6.2%. [Maarten Sap]maartensap.comMaarten Sap A Case Study on the Colossal Clean Crawled CorpusMaarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec…Published: September 30, 2021

The final dataset reflected those differences. After filtering, the overwhelming majority of dialect-labelled documents in C4 were classified as White American English, while African American English and Hispanic-aligned English appeared only in very small proportions. The audit therefore suggested that the cleaning process did not merely reduce volume; it altered the linguistic composition of the corpus. [Maarten Sap]maartensap.comMaarten Sap A Case Study on the Colossal Clean Crawled CorpusMaarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec…Published: September 30, 2021

Identity-related content was also affected. The researchers found that pages discussing sexual orientation, gender identity, race, ethnicity, and religion were more likely to be removed because many community-specific terms overlapped with words appearing on the blocklist. Educational or supportive discussions could therefore be filtered alongside genuinely offensive material. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

A frequently cited example concerns references to sexual orientation. Analyses associated with the C4 audit found that mentions of terms such as “gay” and “lesbian” were disproportionately filtered, and that a substantial share of the excluded documents containing those words were not offensive at all. [Stanford CS324]stanford-cs324.github.ion, gay) more likely to be filtered out; of those…Read more…

The significance of these findings was not that the dataset intentionally targeted particular groups. Rather, automated rules designed for one purpose ended up removing legitimate language associated with particular communities at much higher rates than mainstream language varieties. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

C4 audit illustration 2

Why one dataset’s choices spread downstream

Many datasets have biases, but C4 attracted special attention because of its influence. It was not an obscure research resource. It became one of the most widely used web-scale corpora for language-model pretraining, helping shape research directions across academia and industry. [ACL Anthology]aclanthology.org2021.emnlp main.98ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs…

That meant the effects of filtering could propagate beyond a single project. If a major corpus under-represented certain dialects or identity-related discussions, models trained on that corpus would encounter fewer examples of those forms of language. Researchers worried that this could affect everything from language understanding to the quality of generated responses when discussing under-represented communities. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

The audit also became an important case study in dataset governance. For years, large web corpora were often described mainly through their size. The C4 investigation demonstrated that documentation of filtering decisions could be just as important as the number of tokens collected. A dataset containing billions of words can still encode strong assumptions if its cleaning rules systematically remove particular kinds of text. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

What the audit changed in AI data discussions

The lasting contribution of the C4 audit was not simply identifying one problematic filter. It showed that “cleaning” is never a purely technical operation. Every filtering rule embodies assumptions about what counts as acceptable language, useful content, or high-quality text.

After the audit, researchers increasingly began examining not only what datasets contain but also what they exclude. Questions about representation, dialect diversity, identity-related language, and documentation became more prominent in discussions of training data. The C4 case provided concrete evidence that missing voices can emerge from ordinary preprocessing choices rather than explicit decisions to exclude particular groups. [arXiv+2ACL Anthology]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

Within the broader story of how web filtering changes what language models learn, the C4 audit remains a landmark example. It revealed that the absence of certain voices in training data is often not random. Instead, it can be a direct consequence of the rules used to decide which parts of the web are worth keeping. [arXiv]arxiv.orgDocumenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021…Published: April 18, 2021

C4 audit illustration 3

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Data Drives Decisions Mens T-Shirt Data Science Technology Fathers Day Gift

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Data Encoder I Love Statistics Data Science Data Analysts T-Shirt

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

WARNING MAY SPONTANEOUSLY START TALKING ABOUT DATA SCIENCE T-SHIRT

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Trust The Process Algorithmic Data Science Design T-Shirt

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Link: https://arxiv.org/abs/2104.08758
Source snippet
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled CorpusApril 18, 2021...

Published: April 18, 2021
Source: s10251.pcdn.co
Title: Dr Alan D
Link: https://s10251.pcdn.co/pdf/2021-dodge-c4.pdf
Source snippet
Thompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 2021April 20, 2021 — 18 Apr 2021 — One of the main components of the C4 pipe...

Published: April 20, 2021
Source: arxiv.org
Link: https://arxiv.org/html/2309.04027v2
Source snippet
Textual Identity Detection for Evaluating and Augmenting...12 Jan 2024 — In this paper, we present a dataset coupled with an approach to...
Source: aclanthology.org
Title: 2021.emnlp main.98
Link: https://aclanthology.org/2021.emnlp-main.98/
Source snippet
ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs...
Source: maartensap.com
Title: Maarten Sap A Case Study on the Colossal Clean Crawled Corpus
Link: https://maartensap.com/pdfs/dodge2021documentingC4.pdf
Source snippet
Maarten SapA Case Study on the Colossal Clean Crawled CorpusSeptember 30, 2021 — by J Dodge · Cited by 875 — Using the most likely dialec...

Published: September 30, 2021
Source: stanford-cs324.github.io
Link: https://stanford-cs324.github.io/winter2022/lectures/data/
Source snippet
n, gay) more likely to be filtered out; of those...Read more...
Source: aclanthology.org
Title: 2021.emnlp main.98
Link: https://aclanthology.org/2021.emnlp-main.98.pdf
Source snippet
A Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — NOCLEAN), which is the snapshot of Common Crawl ident...
Source: aclanthology.org
Link: https://aclanthology.org/2023.acl-long.507v1.pdf
Source snippet
WinoQueer: A Community-in-the-Loop Benchmark for Anti-...by VK Felkner · Cited by 171 — This dataset is not specifically focused on scru...
Source: antmarakis.github.io
Title: documenting large corpora
Link: https://antmarakis.github.io/2021/documenting_large_corpora/
Source snippet
Documenting Large Webtext Corpora21 Oct 2021 — The Colossal Clean Crawled Corpus (C4) is a corpus curated for pretraining large language...

Additional References

Source: aiaaic.org
Link: https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-[automation
Source snippet
C4 datasetStudy finds Amazon Rekognition suffers from racial and gender bias · BDD100K dataset · Deepfake CFO scams finance worker for US...
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Documenting-the-English-Colossal-Clean-Crawled-Dodge-Sap/40c3327a6ddb0603b6892344509c7f428ab43d81
Source snippet
Documenting the English Colossal Clean Crawled CorpusThis work provides the first documentation for the Colossal Clean Crawled Corpus (C4...
Source: sites.rutgers.edu
Link: https://sites.rutgers.edu/critical-ai/wp-content/uploads/sites/586/2021/09/dodge2021documentingC4.pdf
Source: researchgate.net
Link: https://www.researchgate.net/publication/372915406_WinoQueer_A_Community-in-the-Loop_Benchmark_for_Anti-LGBTQ_Bias_in_Large_Language_Models
Source snippet
WinoQueer: A Community-in-the-Loop Benchmark for Anti-...For LGBTQ+ specifically, recent work notes a gap in sexuality -focused represen...
Source: sh-tsang.medium.com
Link: https://sh-tsang.medium.com/review-documenting-largewebtext-corpora-a-case-study-on-the-colossal-clean-crawled-corpus-0bcc6554e4b6
Source snippet
Large Webtext Corpora: A Case Study on the...The English Colossal Clean Crawled Corpus (C4) is created by taking the April 2019 snapshot...

Published: April 2019
Source: eurac.edu
Title: riding the third wave what s new in minority language media research
Link: https://www.eurac.edu/en/blogs/midas/riding-the-third-wave-what-s-new-in-minority-language-media-research
Source snippet
Riding the 'Third Wave': What's New in Minority Language...7 Oct 2024 — Craig Willis is a researcher at the European Centre for Minority...
Source: researchgate.net
Title: 350991473 Documenting the English Colossal Clean Crawled Corpus
Link: https://www.researchgate.net/publication/350991473_Documenting_the_English_Colossal_Clean_Crawled_Corpus
Source snippet
Documenting the English Colossal Clean Crawled Corpus18 Apr 2021 — In this work we provide the first documentation for the Colossal Clean...
Source: proceedings.neurips.cc
Title: 1c6bed78d3813886d3d72595dbecb80b Paper Datasets and [Benchmarks]({{ ‘benchmarks/’ | relative_url }})
Link: https://proceedings.neurips.cc/paper_files/paper/2023/file/1c6bed78d3813886d3d72595dbecb80b-Paper-Datasets_and_Benchmarks.pdf
Source snippet
C4: An Open, Billion-scale Corpus of Images...by W Zhu · 2023 · Cited by 269 — Documenting large webtext corpora: A case study on the co...
Source: deepai.org
Title: documenting the english colossal clean crawled corpus
Link: https://www.deepai.org/publication/documenting-the-english-colossal-clean-crawled-corpus
Source snippet
18 Apr 2021 — In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset...
Source: direct.mit.edu
Title: Quality at a Glance An Audit of Web Crawled
Link: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00447/109285/Quality-at-a-Glance-An-Audit-of-Web-Crawled
Source snippet
MIT Press DirectQuality at a Glance: An Audit of Web-Crawled [Multilingual]({{ 'language-bias/' | relative_url }})...by J Kreutzer · 2022 · Cited by 313 — We manually audit the...

The dataset audit that found missing voices

Introduction

What researchers checked in the C4 removals

Which communities were disproportionately affected

Why one dataset’s choices spread downstream

What the audit changed in AI data discussions

Further Reading

The Atlas of AI

Data Feminism

Algorithms of Oppression

Weapons of Math Destruction

Marketplace Samples

Data Drives Decisions Mens T-Shirt Data Science Technology Fathers Day Gift

Data Encoder I Love Statistics Data Science Data Analysts T-Shirt

WARNING MAY SPONTANEOUSLY START TALKING ABOUT DATA SCIENCE T-SHIRT

Trust The Process Algorithmic Data Science Design T-Shirt

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2