Within Training Choices
When face AI fails some people first
Facial analysis failures show how an accurate-looking model can hide serious subgroup errors when training data is uneven.
On this page
- What the Gender Shades study tested
- How benchmark composition hid subgroup errors
- Why subgroup testing changes the story
Page outline Jump by section
Introduction
Face-analysis systems became one of the clearest examples of how an AI model can appear highly accurate while still failing badly for specific groups of people. The problem was not simply that algorithms made mistakes. It was that the datasets used to train and evaluate them often contained far fewer darker-skinned women than lighter-skinned men. When benchmarks were dominated by certain faces, strong overall accuracy scores could hide serious weaknesses that only became visible when results were broken down by both skin tone and gender. The best-known evidence came from the Gender Shades research project, which showed that darker-skinned women experienced dramatically higher error rates than other groups and helped change how researchers think about AI evaluation. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
When face AI fails some people first
The central lesson from facial-analysis bias is that representation matters at multiple stages. If a dataset contains relatively few examples of a subgroup, the model receives fewer opportunities to learn its patterns. If the benchmark used for testing is similarly unbalanced, poor performance may never be detected before deployment.
In the years before the Gender Shades study, several influential face datasets were widely used as benchmarks for facial analysis systems. Researchers often treated high scores on these datasets as evidence that a system worked well in general. However, the composition of the benchmarks themselves received less scrutiny than the algorithms being tested. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
What the Gender Shades study tested
In 2018, Joy Buolamwini and Timnit Gebru examined commercial gender-classification systems from major technology companies. Rather than reporting only an overall accuracy figure, they evaluated performance across four intersectional groups: lighter-skinned women, lighter-skinned men, darker-skinned women and darker-skinned men. This approach combined gender and skin tone rather than treating them as separate categories. Proceedings of Machine Learning Research+2Proceedings of Machine Learning Research [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
The results were striking. Across the systems tested, darker-skinned women were consistently the most misclassified group. The highest reported error rate reached 34.7%, while the best-performing group, lighter-skinned men, saw error rates as low as 0.8%. Looking only at average accuracy would have concealed much of this disparity. Proceedings of Machine Learning Research+2MIT News [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
The study did more than expose unequal outcomes. It also investigated the datasets that underpinned many facial-analysis evaluations, revealing a structural reason why such disparities could emerge. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
How benchmark composition hid subgroup errors
A key finding of Gender Shades was that commonly used facial-analysis benchmarks were heavily skewed toward lighter-skinned individuals. The researchers reported that the IJB-A benchmark contained about 79.6% lighter-skinned subjects, while the Adience benchmark contained about 86.2% lighter-skinned subjects. Proceedings of Machine Learning Research+2Just Tech [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
The imbalance became even more apparent when examining subgroup representation. In IJB-A, darker-skinned women represented only a small fraction of the dataset. Because benchmark scores aggregate results across all examples, strong performance on the majority group could dominate the final metric. A system could therefore achieve an impressive overall score while making frequent mistakes on underrepresented faces. [ResearchGate]researchgate.netGender shades: intersectional phenotypic and…The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%…
This is a classic measurement problem. Imagine a test in which most questions cover one topic and only a few cover another. A student could perform very well overall while still lacking competence in the less-tested area. Similarly, a facial-analysis model evaluated mainly on lighter-skinned faces could appear successful despite weak performance elsewhere. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
The issue was not necessarily deliberate exclusion. Many datasets were assembled from available image sources, public figures, media archives or existing collections. Yet the selection process reflected broader patterns of visibility and representation. Once these datasets became standard benchmarks, their composition shaped both model development and perceptions of success. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
Why subgroup testing changes the story
Before intersectional evaluation became more common, researchers often reported performance by a single category, such as gender alone. That approach can hide important interactions.
Suppose a system performs reasonably well on women overall and reasonably well on darker-skinned people overall. Those averages do not guarantee strong performance for darker-skinned women. The Gender Shades study demonstrated that combining demographic dimensions can reveal failures invisible in broader statistics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
This insight changed the conversation around AI fairness. Instead of asking whether a model is accurate, researchers increasingly ask:
- Accurate for whom?
- Under what conditions?
- Compared with which groups?
- Using what benchmark?
These questions shift attention from headline accuracy figures to the distribution of errors across populations. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
The Gender Shades researchers responded to the benchmark problem by creating a more balanced evaluation dataset, making it easier to measure performance across different combinations of skin tone and gender. Their work helped establish subgroup auditing as a standard practice in AI fairness research. [Proceedings of Machine Learning Research+2ResearchGate]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
The broader lesson about datasets
The case of darker-skinned women in facial analysis illustrates a broader principle in artificial intelligence: datasets do not merely provide examples for learning; they define what success looks like. If a benchmark contains mostly one kind of person, a model can optimise for that majority and still appear highly successful.
The significance of Gender Shades was therefore not only the discovery of unequal error rates. It showed how benchmark design itself can hide those errors. Once researchers examined performance at the intersection of skin tone and gender, the apparent reliability of several systems looked very different. The episode remains a foundational example of why training data and evaluation data must be examined not only for size and quality, but also for who is represented within them. Proceedings of Machine Learning Research+2Just Tech [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…
Amazon book picks
Further Reading
Books and field guides related to When face AI fails some people first. Use these as the next step if you want deeper reading beyond the article.
Weapons of Math Destruction
Explains how data and metrics can create unequal outcomes across groups.
Artificial Intelligence
Rating: 4.5/5 from 10 Google Books ratings
Provides foundational coverage of learning objectives and evaluation.
Race After Technology
Directly addresses how technology can reproduce racial inequities.
Endnotes
-
Source: news.mit.edu
Title: Watch Video.Read more
Link: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212Source snippet
MIT NewsStudy finds gender and skin-type bias in commercial...Feb 11, 2018 — Examination of facial-analysis software shows error rate of...
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiersSource snippet
Gender shades: intersectional phenotypic and...The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%...
-
Source: proceedings.mlr.press
Link: https://proceedings.mlr.press/v81/buolamwini18a.htmlSource snippet
Proceedings of [Machine Learning]({{ 'machine-learning/' | relative_url }}) ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10687...
-
Source: proceedings.mlr.press
Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdfSource snippet
Proceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10687...
-
Source: just-tech.ssrc.org
Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classi%EF%AC%81cation/Source snippet
Just TechGender Shades: Intersectional Accuracy Disparities in...We evaluate 3 commercial gender classification systems using our dataset...
-
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/GenderSource snippet
GenderGender is the range of social, psychological, cultural, and behavioral aspects of being a man (or boy), woman (or girl), or port...
Additional References
-
Source: ars.electronica.art
Link: https://ars.electronica.art/outofthebox/en/gender-shades/Source snippet
Gender Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display...
-
Source: digitalgovernmenthub.org
Link: https://digitalgovernmenthub.org/library/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/Source snippet
Gender Shades: Intersectional Accuracy Disparities in...Darker-skinned females experience the highest misclassification rates (up to 34...
-
Source: gs.ajl.org
Link: https://gs.ajl.org/Source snippet
Shades... ” - Dr. Buolamwini. Average face of a darker females in PPB. Average face of a darker male in PPB. Average face of a lighter fe...
-
Source: commons.opencivics.co
Link: https://commons.opencivics.co/Gender-Shades-Intersectional-Accuracy-Disparities-in-Commercial-Gender-Classification-2fb06d2570f281d0bd85de0fc932264aSource snippet
Shades: Intersectional Accuracy Disparities in...The paper systematically documents that commercial facial recognition systems exhibit d...
-
Source: gendershades.org
Link: https://gendershades.org/overview.html -
Source: fr.scribd.com
Link: https://fr.scribd.com/document/470788356/Gender-Shades-Intersectional-Accuracy-Disparities-in-Commercial-Gender-Classification-pdfSource snippet
Terms of Service). None of the commercial gen- faces (20.8% − 34.7% error rate) der classifiers chosen for...Read more...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=TWWsW1w-BVoSource snippet
Gender ShadesI wanted to see how well different gender classification systems worked across different peoples faces and if the results ch...
-
Source: klover.ai
Title: dr timnit gebru translating gender shades into corporate governance
Link: https://www.klover.ai/dr-timnit-gebru-translating-gender-shades-into-corporate-governance/Source snippet
Dr. Timnit Gebru: Translating 'Gender Shades' into...23 Jun 2025 — Gender Shades proved that algorithmic bias often originates not in th...
-
Source: dataprivacyadvisory.com
Title: how gender shades sheds light on bias in machine learning
Link: https://www.dataprivacyadvisory.com/how-gender-shades-sheds-light-on-bias-in-machine-learning/Source snippet
How 'Gender Shades' Sheds Light on Bias in Machine Learning17 Jan 2024 — For darker-skinned males, the systems performed better than for...
-
Source: maquinacoes.rafaelg.net.br
Link: https://maquinacoes.rafaelg.net.br/gender-shadesSource snippet
de gênero: disparidades interseccionais de acurácia em...All classifiers perform worst on darker female faces (20.8% − 34.7% error rate)...
Topic Tree



