When face AI fails some people first

Introduction

Face-analysis systems became one of the clearest examples of how an AI model can appear highly accurate while still failing badly for specific groups of people. The problem was not simply that algorithms made mistakes. It was that the datasets used to train and evaluate them often contained far fewer darker-skinned women than lighter-skinned men. When benchmarks were dominated by certain faces, strong overall accuracy scores could hide serious weaknesses that only became visible when results were broken down by both skin tone and gender. The best-known evidence came from the Gender Shades research project, which showed that darker-skinned women experienced dramatically higher error rates than other groups and helped change how researchers think about AI evaluation. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

Face Bias illustration 1

When face AI fails some people first

The central lesson from facial-analysis bias is that representation matters at multiple stages. If a dataset contains relatively few examples of a subgroup, the model receives fewer opportunities to learn its patterns. If the benchmark used for testing is similarly unbalanced, poor performance may never be detected before deployment.

In the years before the Gender Shades study, several influential face datasets were widely used as benchmarks for facial analysis systems. Researchers often treated high scores on these datasets as evidence that a system worked well in general. However, the composition of the benchmarks themselves received less scrutiny than the algorithms being tested. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

What the Gender Shades study tested

In 2018, Joy Buolamwini and Timnit Gebru examined commercial gender-classification systems from major technology companies. Rather than reporting only an overall accuracy figure, they evaluated performance across four intersectional groups: lighter-skinned women, lighter-skinned men, darker-skinned women and darker-skinned men. This approach combined gender and skin tone rather than treating them as separate categories. Proceedings of Machine Learning Research+2Proceedings of Machine Learning Research [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The results were striking. Across the systems tested, darker-skinned women were consistently the most misclassified group. The highest reported error rate reached 34.7%, while the best-performing group, lighter-skinned men, saw error rates as low as 0.8%. Looking only at average accuracy would have concealed much of this disparity. Proceedings of Machine Learning Research+2MIT News [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The study did more than expose unequal outcomes. It also investigated the datasets that underpinned many facial-analysis evaluations, revealing a structural reason why such disparities could emerge. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

How benchmark composition hid subgroup errors

A key finding of Gender Shades was that commonly used facial-analysis benchmarks were heavily skewed toward lighter-skinned individuals. The researchers reported that the IJB-A benchmark contained about 79.6% lighter-skinned subjects, while the Adience benchmark contained about 86.2% lighter-skinned subjects. Proceedings of Machine Learning Research+2Just Tech [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The imbalance became even more apparent when examining subgroup representation. In IJB-A, darker-skinned women represented only a small fraction of the dataset. Because benchmark scores aggregate results across all examples, strong performance on the majority group could dominate the final metric. A system could therefore achieve an impressive overall score while making frequent mistakes on underrepresented faces. [ResearchGate]researchgate.netGender shades: intersectional phenotypic and…The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%…

This is a classic measurement problem. Imagine a test in which most questions cover one topic and only a few cover another. A student could perform very well overall while still lacking competence in the less-tested area. Similarly, a facial-analysis model evaluated mainly on lighter-skinned faces could appear successful despite weak performance elsewhere. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The issue was not necessarily deliberate exclusion. Many datasets were assembled from available image sources, public figures, media archives or existing collections. Yet the selection process reflected broader patterns of visibility and representation. Once these datasets became standard benchmarks, their composition shaped both model development and perceptions of success. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

Face Bias illustration 2

Why subgroup testing changes the story

Before intersectional evaluation became more common, researchers often reported performance by a single category, such as gender alone. That approach can hide important interactions.

Suppose a system performs reasonably well on women overall and reasonably well on darker-skinned people overall. Those averages do not guarantee strong performance for darker-skinned women. The Gender Shades study demonstrated that combining demographic dimensions can reveal failures invisible in broader statistics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

This insight changed the conversation around AI fairness. Instead of asking whether a model is accurate, researchers increasingly ask:

Accurate for whom?
Under what conditions?
Compared with which groups?
Using what benchmark?

These questions shift attention from headline accuracy figures to the distribution of errors across populations. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The Gender Shades researchers responded to the benchmark problem by creating a more balanced evaluation dataset, making it easier to measure performance across different combinations of skin tone and gender. Their work helped establish subgroup auditing as a standard practice in AI fairness research. [Proceedings of Machine Learning Research+2ResearchGate]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The broader lesson about datasets

The case of darker-skinned women in facial analysis illustrates a broader principle in artificial intelligence: datasets do not merely provide examples for learning; they define what success looks like. If a benchmark contains mostly one kind of person, a model can optimise for that majority and still appear highly successful.

The significance of Gender Shades was therefore not only the discovery of unequal error rates. It showed how benchmark design itself can hide those errors. Once researchers examined performance at the intersection of skin tone and gender, the apparent reliability of several systems looked very different. The episode remains a foundational example of why training data and evaluation data must be examined not only for size and quality, but also for who is represented within them. Proceedings of Machine Learning Research+2Just Tech [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

Face Bias illustration 3

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

I WAS AI before IT WAS COOL Enamel Pin Quotes Brooch Lapel Pins Clothing

Search eBay.co.uk: AI enamel pin

Browse similar on eBay.co.uk

Example eBay listing

Hold On Let Me Chat GPT This Pin Badge Brooch Black & White AI Computer Enamel

Search eBay.co.uk: AI enamel pin

Browse similar on eBay.co.uk

Example eBay listing

Terminator Movie Enamel Pin Badge Cyberdyne Systems AI Skynet Metal Alloy Brooch

Search eBay.co.uk: AI enamel pin

Browse similar on eBay.co.uk

Example eBay listing

New Red Crystal Enamel Pin Brooches for Women UK Ladies Dress Art Deco Gold Plt

Search eBay.co.uk: AI enamel pin

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: news.mit.edu
Title: Watch Video.Read more
Link: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212
Source snippet
MIT NewsStudy finds gender and skin-type bias in commercial...Feb 11, 2018 — Examination of facial-analysis software shows error rate of...
Source: researchgate.net
Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiers
Source snippet
Gender shades: intersectional phenotypic and...The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%...
Source: proceedings.mlr.press
Link: https://proceedings.mlr.press/v81/buolamwini18a.html
Source snippet
Proceedings of [Machine Learning]({{ 'machine-learning/' | relative_url }}) ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10687...
Source: proceedings.mlr.press
Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
Source snippet
Proceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10687...
Source: just-tech.ssrc.org
Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classi%EF%AC%81cation/
Source snippet
Just TechGender Shades: Intersectional Accuracy Disparities in...We evaluate 3 commercial gender classiﬁcation systems using our dataset...
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Gender
Source snippet
GenderGender is the range of social, psychological, cultural, and behavioral aspects of being a man (or boy), woman (or girl), or port...

Additional References

Source: ars.electronica.art
Link: https://ars.electronica.art/outofthebox/en/gender-shades/
Source snippet
Gender Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display...
Source: digitalgovernmenthub.org
Link: https://digitalgovernmenthub.org/library/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
Source snippet
Gender Shades: Intersectional Accuracy Disparities in...Darker-skinned females experience the highest misclassification rates (up to 34...
Source: gs.ajl.org
Link: https://gs.ajl.org/
Source snippet
Shades... ” - Dr. Buolamwini. Average face of a darker females in PPB. Average face of a darker male in PPB. Average face of a lighter fe...
Source: commons.opencivics.co
Link: https://commons.opencivics.co/Gender-Shades-Intersectional-Accuracy-Disparities-in-Commercial-Gender-Classification-2fb06d2570f281d0bd85de0fc932264a
Source snippet
Shades: Intersectional Accuracy Disparities in...The paper systematically documents that commercial facial recognition systems exhibit d...
Source: gendershades.org
Link: https://gendershades.org/overview.html
Source: fr.scribd.com
Link: https://fr.scribd.com/document/470788356/Gender-Shades-Intersectional-Accuracy-Disparities-in-Commercial-Gender-Classification-pdf
Source snippet
Terms of Service). None of the commercial gen- faces (20.8% − 34.7% error rate) der classifiers chosen for...Read more...
Source: youtube.com
Link: https://www.youtube.com/watch?v=TWWsW1w-BVo
Source snippet
Gender ShadesI wanted to see how well different gender classification systems worked across different peoples faces and if the results ch...
Source: klover.ai
Title: dr timnit gebru translating gender shades into corporate governance
Link: https://www.klover.ai/dr-timnit-gebru-translating-gender-shades-into-corporate-governance/
Source snippet
Dr. Timnit Gebru: Translating 'Gender Shades' into...23 Jun 2025 — Gender Shades proved that algorithmic bias often originates not in th...
Source: dataprivacyadvisory.com
Title: how gender shades sheds light on bias in machine learning
Link: https://www.dataprivacyadvisory.com/how-gender-shades-sheds-light-on-bias-in-machine-learning/
Source snippet
How 'Gender Shades' Sheds Light on Bias in Machine Learning17 Jan 2024 — For darker-skinned males, the systems performed better than for...
Source: maquinacoes.rafaelg.net.br
Link: https://maquinacoes.rafaelg.net.br/gender-shades
Source snippet
de gênero: disparidades interseccionais de acurácia em...All classifiers perform worst on darker female faces (20.8% − 34.7% error rate)...

When face AI fails some people first

Introduction

When face AI fails some people first

What the Gender Shades study tested

How benchmark composition hid subgroup errors

Why subgroup testing changes the story

The broader lesson about datasets

Further Reading

Atlas of AI

Weapons of Math Destruction

Artificial Intelligence

Race After Technology

Marketplace Samples

I WAS AI before IT WAS COOL Enamel Pin Quotes Brooch Lapel Pins Clothing

Hold On Let Me Chat GPT This Pin Badge Brooch Black & White AI Computer Enamel

Terminator Movie Enamel Pin Badge Cyberdyne Systems AI Skynet Metal Alloy Brooch

New Red Crystal Enamel Pin Brooches for Women UK Ladies Dress Art Deco Gold Plt

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 4

More on this topic 3