Within Training Choices

When face AI fails some people first

Facial analysis failures show how an accurate-looking model can hide serious subgroup errors when training data is uneven.

On this page

  • What the Gender Shades study tested
  • How benchmark composition hid subgroup errors
  • Why subgroup testing changes the story
Preview for When face AI fails some people first

Introduction

Face-analysis systems became one of the clearest examples of how an AI model can appear highly accurate while still failing badly for specific groups of people. The problem was not simply that algorithms made mistakes. It was that the datasets used to train and evaluate them often contained far fewer darker-skinned women than lighter-skinned men. When benchmarks were dominated by certain faces, strong overall accuracy scores could hide serious weaknesses that only became visible when results were broken down by both skin tone and gender. The best-known evidence came from the Gender Shades research project, which showed that darker-skinned women experienced dramatically higher error rates than other groups and helped change how researchers think about AI evaluation. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

Face Bias illustration 1

When face AI fails some people first

The central lesson from facial-analysis bias is that representation matters at multiple stages. If a dataset contains relatively few examples of a subgroup, the model receives fewer opportunities to learn its patterns. If the benchmark used for testing is similarly unbalanced, poor performance may never be detected before deployment.

In the years before the Gender Shades study, several influential face datasets were widely used as benchmarks for facial analysis systems. Researchers often treated high scores on these datasets as evidence that a system worked well in general. However, the composition of the benchmarks themselves received less scrutiny than the algorithms being tested. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

What the Gender Shades study tested

In 2018, Joy Buolamwini and Timnit Gebru examined commercial gender-classification systems from major technology companies. Rather than reporting only an overall accuracy figure, they evaluated performance across four intersectional groups: lighter-skinned women, lighter-skinned men, darker-skinned women and darker-skinned men. This approach combined gender and skin tone rather than treating them as separate categories. Proceedings of Machine Learning Research+2Proceedings of Machine Learning Research [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The results were striking. Across the systems tested, darker-skinned women were consistently the most misclassified group. The highest reported error rate reached 34.7%, while the best-performing group, lighter-skinned men, saw error rates as low as 0.8%. Looking only at average accuracy would have concealed much of this disparity. Proceedings of Machine Learning Research+2MIT News [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The study did more than expose unequal outcomes. It also investigated the datasets that underpinned many facial-analysis evaluations, revealing a structural reason why such disparities could emerge. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

How benchmark composition hid subgroup errors

A key finding of Gender Shades was that commonly used facial-analysis benchmarks were heavily skewed toward lighter-skinned individuals. The researchers reported that the IJB-A benchmark contained about 79.6% lighter-skinned subjects, while the Adience benchmark contained about 86.2% lighter-skinned subjects. Proceedings of Machine Learning Research+2Just Tech [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The imbalance became even more apparent when examining subgroup representation. In IJB-A, darker-skinned women represented only a small fraction of the dataset. Because benchmark scores aggregate results across all examples, strong performance on the majority group could dominate the final metric. A system could therefore achieve an impressive overall score while making frequent mistakes on underrepresented faces. [ResearchGate]researchgate.netGender shades: intersectional phenotypic and…The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%…

This is a classic measurement problem. Imagine a test in which most questions cover one topic and only a few cover another. A student could perform very well overall while still lacking competence in the less-tested area. Similarly, a facial-analysis model evaluated mainly on lighter-skinned faces could appear successful despite weak performance elsewhere. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The issue was not necessarily deliberate exclusion. Many datasets were assembled from available image sources, public figures, media archives or existing collections. Yet the selection process reflected broader patterns of visibility and representation. Once these datasets became standard benchmarks, their composition shaped both model development and perceptions of success. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

Face Bias illustration 2

Why subgroup testing changes the story

Before intersectional evaluation became more common, researchers often reported performance by a single category, such as gender alone. That approach can hide important interactions.

Suppose a system performs reasonably well on women overall and reasonably well on darker-skinned people overall. Those averages do not guarantee strong performance for darker-skinned women. The Gender Shades study demonstrated that combining demographic dimensions can reveal failures invisible in broader statistics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

This insight changed the conversation around AI fairness. Instead of asking whether a model is accurate, researchers increasingly ask:

  • Accurate for whom?
  • Under what conditions?
  • Compared with which groups?
  • Using what benchmark?

These questions shift attention from headline accuracy figures to the distribution of errors across populations. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The Gender Shades researchers responded to the benchmark problem by creating a more balanced evaluation dataset, making it easier to measure performance across different combinations of skin tone and gender. Their work helped establish subgroup auditing as a standard practice in AI fairness research. [Proceedings of Machine Learning Research+2ResearchGate]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

The broader lesson about datasets

The case of darker-skinned women in facial analysis illustrates a broader principle in artificial intelligence: datasets do not merely provide examples for learning; they define what success looks like. If a benchmark contains mostly one kind of person, a model can optimise for that majority and still appear highly successful.

The significance of Gender Shades was therefore not only the discovery of unequal error rates. It showed how benchmark design itself can hide those errors. Once researchers examined performance at the intersection of skin tone and gender, the apparent reliability of several systems looked very different. The episode remains a foundational example of why training data and evaluation data must be examined not only for size and quality, but also for who is represented within them. Proceedings of Machine Learning Research+2Just Tech [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10687…

Face Bias illustration 3

Amazon book picks

Further Reading

Books and field guides related to When face AI fails some people first. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: news.mit.edu
    Title: Watch Video.Read more
    Link: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212
    Source snippet

    MIT NewsStudy finds gender and skin-type bias in commercial...Feb 11, 2018 — Examination of facial-analysis software shows error rate of...

  2. Source: researchgate.net
    Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiers
    Source snippet

    Gender shades: intersectional phenotypic and...The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%...

  3. Source: proceedings.mlr.press
    Link: https://proceedings.mlr.press/v81/buolamwini18a.html
    Source snippet

    Proceedings of [Machine Learning]({{ 'machine-learning/' | relative_url }}) ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10687...

  4. Source: proceedings.mlr.press
    Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
    Source snippet

    Proceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10687...

  5. Source: just-tech.ssrc.org
    Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classi%EF%AC%81cation/
    Source snippet

    Just TechGender Shades: Intersectional Accuracy Disparities in...We evaluate 3 commercial gender classification systems using our dataset...

  6. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Gender
    Source snippet

    GenderGender is the range of social, psychological, cultural, and behavioral aspects of being a man (or boy), woman (or girl), or port...

Additional References

  1. Source: ars.electronica.art
    Link: https://ars.electronica.art/outofthebox/en/gender-shades/
    Source snippet

    Gender Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display...

  2. Source: digitalgovernmenthub.org
    Link: https://digitalgovernmenthub.org/library/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
    Source snippet

    Gender Shades: Intersectional Accuracy Disparities in...Darker-skinned females experience the highest misclassification rates (up to 34...

  3. Source: gs.ajl.org
    Link: https://gs.ajl.org/
    Source snippet

    Shades... ” - Dr. Buolamwini. Average face of a darker females in PPB. Average face of a darker male in PPB. Average face of a lighter fe...

  4. Source: commons.opencivics.co
    Link: https://commons.opencivics.co/Gender-Shades-Intersectional-Accuracy-Disparities-in-Commercial-Gender-Classification-2fb06d2570f281d0bd85de0fc932264a
    Source snippet

    Shades: Intersectional Accuracy Disparities in...The paper systematically documents that commercial facial recognition systems exhibit d...

  5. Source: gendershades.org
    Link: https://gendershades.org/overview.html

  6. Source: fr.scribd.com
    Link: https://fr.scribd.com/document/470788356/Gender-Shades-Intersectional-Accuracy-Disparities-in-Commercial-Gender-Classification-pdf
    Source snippet

    Terms of Service). None of the commercial gen- faces (20.8% − 34.7% error rate) der classifiers chosen for...Read more...

  7. Source: youtube.com
    Link: https://www.youtube.com/watch?v=TWWsW1w-BVo
    Source snippet

    Gender ShadesI wanted to see how well different gender classification systems worked across different peoples faces and if the results ch...

  8. Source: klover.ai
    Title: dr timnit gebru translating gender shades into corporate governance
    Link: https://www.klover.ai/dr-timnit-gebru-translating-gender-shades-into-corporate-governance/
    Source snippet

    Dr. Timnit Gebru: Translating 'Gender Shades' into...23 Jun 2025 — Gender Shades proved that algorithmic bias often originates not in th...

  9. Source: dataprivacyadvisory.com
    Title: how gender shades sheds light on bias in machine learning
    Link: https://www.dataprivacyadvisory.com/how-gender-shades-sheds-light-on-bias-in-machine-learning/
    Source snippet

    How 'Gender Shades' Sheds Light on Bias in Machine Learning17 Jan 2024 — For darker-skinned males, the systems performed better than for...

  10. Source: maquinacoes.rafaelg.net.br
    Link: https://maquinacoes.rafaelg.net.br/gender-shades
    Source snippet

    de gênero: disparidades interseccionais de acurácia em...All classifiers perform worst on darker female faces (20.8% − 34.7% error rate)...

Topic Tree

Follow this branch

Parent topic

Training Choices What AI Learns Depends on Its Goals

Related pages 4

More on this topic 3