Can cleaning data make AI less fair?

Introduction

Large language models learn from patterns in the text they are given. Because the modern web contains spam, duplicated pages, machine-generated content, pornography, malware, and other low-value material, developers typically filter large web datasets before training. That cleaning process is often presented as a technical necessity, but it is also a powerful editorial decision. The filter determines which voices remain visible, which language varieties are treated as acceptable, and which topics appear frequently enough for a model to learn them. In practice, a language model learns not only from the web but also from the rules used to clean the web. Research on major training datasets has shown that some filtering methods remove disproportionate amounts of text associated with minority groups, dialects, and identity-related discussions, raising concerns about fairness and representation. arXiv+2Dr Alan D. Thompson – LifeArchitect.ai [arxiv.org]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Web Filters illustration 1

Can cleaning data make AI less fair?

Why web-scale datasets need filtering

Web-scale datasets are assembled from enormous internet archives such as Common Crawl, which contain billions of pages. Without filtering, training data would include large amounts of duplicated text, search-engine spam, corrupted pages, automatically generated content, and material that contributes little to language understanding. Cleaning therefore serves legitimate goals: improving data quality, reducing noise, and making training more efficient. [Sites@Rutgers]sites.rutgers.eduSites@Rutgers A Case Study on the Colossal Clean Crawled CorpusSites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th…Published: September 13, 2021

Typical filters remove:

Duplicate or near-duplicate pages.
Extremely short or malformed documents.
Non-target languages.
Boilerplate website text.
Content containing words from profanity or block lists.
Pages judged to be low quality by automated heuristics. [Sites@Rutgers]sites.rutgers.eduSites@Rutgers A Case Study on the Colossal Clean Crawled CorpusSites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th…Published: September 13, 2021

The challenge is that language is social as well as technical. A filter cannot easily distinguish between genuinely harmful content and legitimate discussion that uses the same vocabulary. As a result, cleaning rules may remove valuable cultural and linguistic information alongside the material they were designed to exclude. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

What C4 audits revealed about removed text

One of the most influential examples comes from the Colossal Clean Crawled Corpus (C4), a large dataset created from Common Crawl and used in prominent language-model research. Researchers who audited C4 examined not only what remained in the dataset but also what had been removed during cleaning. Their findings showed that block-list filtering disproportionately excluded documents associated with minority groups. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

The audit found several notable patterns:

Documents written in dialects associated with African American and Hispanic communities were removed at higher rates than text associated with White American English. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate…
Text discussing gender identity, sexual orientation, race, and religion was frequently filtered because identity-related terms overlapped with words appearing on offensive-language block lists. [Maarten Sap]maartensap.com21We investigate mentions related to gender identity, sexual orientationMaarten SapA Case Study on the Colossal Clean Crawled Corpusby J Dodge · Cited by 875 — Documenting Large Webtext Corpora: A Case Study o…
Many excluded documents were not abusive or hateful; they simply contained vocabulary that triggered automated filters. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Researchers concluded that the filtering process systematically changed the composition of the dataset. Rather than merely removing noise, it altered which communities and forms of expression were represented in the training corpus. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

These findings became especially important because C4 was widely reused in language-model development. A filtering choice made once during dataset construction could therefore influence many downstream systems. [ACL Anthology]aclanthology.org2021.emnlp main.98ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs…

Web Filters illustration 2

How cleaning choices become model assumptions

A language model learns statistical regularities from whatever survives the filtering stage. If a particular dialect appears less often in the training data, the model receives fewer opportunities to learn its vocabulary, grammar, and cultural references. If discussions of certain identities are systematically removed, the model may learn weaker or distorted associations about those groups. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate…

This process creates an important feedback effect. Developers may believe they are removing undesirable content, but they are also shaping the model’s picture of normal language use. The model can come to treat the remaining text as the default version of reality because alternative forms of expression were filtered out before training began. [arXiv]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Researchers have argued that large web datasets already tend to overrepresent dominant social and cultural perspectives. Heavy filtering can amplify this tendency if it disproportionately removes language from marginalised communities while preserving mainstream sources. [Dr Alan D. Thompson – LifeArchitect.ai]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiOn the Dangers of Stochastic Parrots: Can Language Models…by EM Bender · Cited by 14366 — In §4, we discuss…

The result is not usually an obvious failure. Models often remain fluent and capable. Instead, the effects may appear in subtler ways:

Reduced performance on underrepresented dialects.
Less accurate responses about minority communities.
Greater reliance on majority-language norms.
Lower visibility of alternative cultural viewpoints.
Difficulty distinguishing offensive uses of identity terms from self-description or community discussion. Dr Alan D. Thompson – LifeArchitect.ai+2arXiv [s10251.pcdn.co]s10251.pcdn.coDr Alan DThompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate…

The central trade-off

The evidence does not suggest that web filtering should be abandoned. Unfiltered web data contains significant amounts of spam, misinformation, duplicated content, and harmful material that can degrade model quality. The key lesson is that filtering is not a neutral housekeeping step. It is part of the training signal itself. [Sites@Rutgers]sites.rutgers.eduSites@Rutgers A Case Study on the Colossal Clean Crawled CorpusSites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th…Published: September 13, 2021

Understanding artificial intelligence therefore requires looking beyond model architecture and examining dataset construction. Every decision about what to keep, remove, or down-rank changes what the model encounters during learning. When a language model appears biased, insensitive to certain communities, or unusually confident about some viewpoints, part of the explanation may lie not in the model’s design but in the web-filtering choices that shaped its training data. [arXiv+2Knowing Machines]arxiv.orgA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show…

Web Filters illustration 3

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Trust The Process Algorithmic Data Science Design T-Shirt

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Data Is Greater Than Opinion Data Analyst Science Mens T Shirts #P1#Or#A

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

I Love Anal Analytics T-Shirt Unisex Funny Data Science Cartoon Graphic Tee

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Data Encoder I Love Statistics Data Science Data Analysts T-Shirt

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: arxiv.org
Link: https://arxiv.org/abs/2104.08758
Source snippet
A Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — we evaluate the text that was removed, and show...
Source: s10251.pcdn.co
Title: Dr Alan D
Link: https://s10251.pcdn.co/pdf/2021-dodge-c4.pdf
Source snippet
Thompson – LifeArchitect.aiarXiv:2104.08758v1 [cs.CL] 18 Apr 202118 Apr 2021 — These findings suggest that the blocklist disproportionate...
Source: sites.rutgers.edu
Title: Sites@Rutgers A Case Study on the Colossal Clean Crawled Corpus
Link: https://sites.rutgers.edu/critical-ai/wp-content/uploads/sites/586/2021/09/dodge2021documentingC4.pdf
Source snippet
Sites@RutgersA Case Study on the Colossal Clean Crawled CorpusSeptember 13, 2021 — by J Dodge · Cited by 876 — C4 is created by taking th...

Published: September 13, 2021
Source: s10251.pcdn.co
Title: Dr Alan D
Link: https://s10251.pcdn.co/pdf/2021-bender-parrots.pdf
Source snippet
Thompson – LifeArchitect.aiOn the Dangers of Stochastic Parrots: Can Language Models...by EM Bender · Cited by 14366 — In §4, we discuss...
Source: knowingmachines.org
Link: https://knowingmachines.org/publications/9-ways-to-see/essays/c4
Source snippet
The case of 'Colossal Cleaned Common Crawl' (C4)While audits of C4 reveal the presence of harmful content, misinformation, and the exclus...
Source: aclanthology.org
Title: 2021.emnlp main.98
Link: https://aclanthology.org/2021.emnlp-main.98/
Source snippet
ACL AnthologyA Case Study on the Colossal Clean Crawled Corpusby J Dodge · 2021 · Cited by 876 — In this work we provide some of the firs...
Source: maartensap.com
Title: 21We investigate mentions related to gender identity, sexual orientation
Link: https://maartensap.com/pdfs/dodge2021documentingC4.pdf
Source snippet
Maarten SapA Case Study on the Colossal Clean Crawled Corpusby J Dodge · Cited by 875 — Documenting Large Webtext Corpora: A Case Study o...
Source: github.com
Link: https://github.com/allenai/allennlp/discussions/5265
Source snippet
We now have almost 27TB of clean-ish data, in 101 different languages (plus the "undetected"...Read more...

Additional References

Source: aiaaic.org
Link: https://www.aiaaic.org/aiaaic-repository/ai-algorithmic-and-[automation
Source snippet
C4 datasetAI text detector [language bias]({{ 'language-bias/' | relative_url }}) ・ hur ethnic minority analytics Tesla ・ generates inaccurate, racist, homophobic and offensive r...
Source: code4rena.com
Link: https://code4rena.com/
Source snippet
Keeping high severity bugs out of productionAfter 5 years of securing DeFi, Code4rena is closing its doors. Active competitio...
Source: sh-tsang.medium.com
Link: https://sh-tsang.medium.com/review-documenting-largewebtext-corpora-a-case-study-on-the-colossal-clean-crawled-corpus-0bcc6554e4b6
Source snippet
Large Webtext Corpora: A Case Study on the...Crawled Corpus (C4) used in T5, is a dataset removing text that is not natural English. Iss...
Source: medium.com
Link: https://medium.com/%40emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11
Source: proceedings.neurips.cc
Title: 1c6bed78d3813886d3d72595dbecb80b Paper Datasets and [Benchmarks]({{ ‘benchmarks/’ | relative_url }})
Link: https://proceedings.neurips.cc/paper_files/paper/2023/file/1c6bed78d3813886d3d72595dbecb80b-Paper-Datasets_and_Benchmarks.pdf
Source snippet
C4: An Open, Billion-scale Corpus of Images...by W Zhu · 2023 · Cited by 269 — Multimodal C4 (mmc4), a public, billion-scale image-text...
Source: youtube.com
Title: On the dangers of stochastic parrots: Can language models be too big?
Link: https://www.youtube.com/watch?v=N5c2X8vhfBE
Source snippet
🦜Professor Emily M. Bender will present her recent (co-authored) paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?...
Source: archive.org
Title: stochastic parrots 3442188.3445922
Link: https://archive.org/details/stochastic-parrots-3442188.3445922
Source snippet
On the Dangers of Stochastic Parrots: Can Language...13 Jun 2022 — In this paper, we take a step back and ask: How big is too big? What...
Source: gist.github.com
Link: https://gist.github.com/yoavg/9fc9be2f98b47c189a513573d902fb27
Source snippet
criticism of "On the Dangers of Stochastic ParrotsThe criticism has two parts: The paper is attacking the wrong target. racist, sexist, b...
Source: buildcognitiveresonance.substack.com
Title: in defense of stochastic parrots
Link: https://buildcognitiveresonance.substack.com/p/in-defense-of-stochastic-parrots
Source snippet
defense of stochastic parrots - by Benjamin RileyHere is an overly simplistic yet defensible story of how large-language models do what t...
Source: youtube.com
Title: LLMs Pretrain Better Without Data Filtering
Link: https://www.youtube.com/watch?v=yqmsr9bHNcI
Source snippet
A Bitter Lesson for Data Filtering discusses recent research demonstrating how data filtering methods directly alter the scaling behavior...

Can cleaning data make AI less fair?

Introduction

Can cleaning data make AI less fair?

Why web-scale datasets need filtering

What C4 audits revealed about removed text

How cleaning choices become model assumptions

The central trade-off

Further Reading

The Atlas of AI

Data Feminism

Artificial Intelligence

Weapons of Math Destruction

Marketplace Samples

Trust The Process Algorithmic Data Science Design T-Shirt

Data Is Greater Than Opinion Data Analyst Science Mens T Shirts #P1#Or#A

I Love Anal Analytics T-Shirt Unisex Funny Data Science Cartoon Graphic Tee

Data Encoder I Love Statistics Data Science Data Analysts T-Shirt

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 4

More on this topic 3