Within Face Bias

The Study That Changed Face AI Testing

The Gender Shades study made facial-analysis bias visible by measuring errors across skin tone and gender together.

On this page

  • What the researchers tested
  • Why darker skinned women stood out
  • How intersectional results changed the debate
Preview for The Study That Changed Face AI Testing

Introduction

In 2018, the Gender Shades study transformed the discussion about fairness in artificial intelligence by showing that face-analysis systems did not make mistakes equally across all people. Rather than asking whether commercial systems were accurate on average, researchers Joy Buolamwini and Timnit Gebru examined how accuracy changed when both skin tone and gender were considered together. Their results revealed a striking pattern: darker-skinned women experienced far higher error rates than any other group. In some systems, the error rate for darker-skinned women reached 34.7%, while the highest error rate for lighter-skinned men was only 0.8%. The study became a landmark because it made a previously hidden problem measurable and impossible to ignore. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10758…

Gender Shades illustration 1

What the researchers tested

Before Gender Shades, many developers and companies reported overall accuracy scores for facial-analysis systems. Those aggregate figures could create the impression that a system worked well for everyone, even if performance varied dramatically between groups.

The researchers took a different approach. They created a benchmark designed to balance gender and skin type and then evaluated commercial gender-classification systems from major technology providers. Instead of reporting one overall score, they measured performance separately for four groups:

  • Lighter-skinned men [dspace.mit.edu]dspace.mit.edu1026503582 MITIJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter…Read more… * Lighter-skinned women [news.mit.edu]news.mit.edustudy finds gender skin type bias artificial intelligence systems 0212MIT NewsStudy finds gender and skin-type bias in commercial…Feb 11, 2018 — Examination of facial-analysis software shows error rate of… * Darker-skinned men [thegradient.pub]thegradient.pubgender bias in aiThe GradientA Brief Overview of Gender Bias in AI8 Apr 2024 —… darker female faces (with error rates up to 34.7%). In contrast, the ma… * Darker-skinned women [news.mit.edu]news.mit.edustudy finds gender skin type bias artificial intelligence systems 0212MIT NewsStudy finds gender and skin-type bias in commercial…Feb 11, 2018 — Examination of facial-analysis software shows error rate of…

This method is known as an intersectional evaluation because it examines how multiple characteristics combine rather than treating them independently. The study showed that analysing gender alone or skin tone alone would have missed some of the most important disparities. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10758…

A crucial part of the research was its examination of the datasets used to evaluate face systems. The researchers found that widely used benchmarks were heavily skewed towards lighter-skinned subjects, with approximately 79.6% lighter-skinned individuals in IJB-A and 86.2% in Adience. Such imbalances created conditions in which poor performance on underrepresented groups could remain largely invisible. [Proceedings of Machine Learning Research+2DSpace]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10758…

Why darker-skinned women stood out

The most influential finding was not simply that errors existed, but that they clustered around a specific intersection of characteristics.

Across the tested systems, darker-skinned women were consistently the most misclassified group. Reported error rates ranged from roughly 20.8% to 34.7%, depending on the system. By contrast, lighter-skinned men experienced extremely low error rates, in some cases below 1%. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 11167…

This mattered because the disparity could not be explained by gender alone. Women generally experienced higher error rates than men, but darker-skinned women faced substantially larger problems than lighter-skinned women. Nor could the gap be explained by skin tone alone, because darker-skinned men generally performed better than darker-skinned women. The interaction between the two categories revealed a distinct pattern of disadvantage that became visible only when both were measured together. [Computer Science Classes]classes.cs.uchicago.eduDarker and Lighter Error Rates. To conduct a phenotypic performance analysisComputer Science ClassesGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · Cited by 11167 — The FPR for females i…

The study therefore challenged a common assumption in AI evaluation. A system could appear highly accurate overall while still performing poorly for a relatively small subgroup. If that subgroup was underrepresented in testing data, its experience would barely affect the headline accuracy figure. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10758…

Gender Shades illustration 2

How intersectional results changed the debate

The most lasting contribution of Gender Shades was methodological. The study shifted attention from average performance to subgroup performance.

Before the study, discussions about facial-analysis bias often focused on anecdotal failures. Gender Shades provided systematic evidence. The researchers showed that the issue was measurable, reproducible and connected to how datasets and benchmarks were constructed. This changed the conversation from isolated mistakes to structural evaluation problems. [Proceedings of Machine Learning Research]proceedings.mlr.pressThe substantial disparities in the accuracy of classifying darker females, lighter females, darker…

The study also demonstrated why intersectional testing matters. If researchers had only compared men and women, or only compared lighter and darker skin tones, they would have detected disparities but missed the full scale of the problem affecting darker-skinned women. The findings became one of the most widely cited examples of intersectional bias in machine learning and inspired later fairness audits across computer vision systems. [The Gradient]thegradient.pubgender bias in aiThe GradientA Brief Overview of Gender Bias in AI8 Apr 2024 —… darker female faces (with error rates up to 34.7%). In contrast, the ma…

Another important consequence was increased scrutiny of benchmark design. Gender Shades argued that fairness cannot be evaluated solely through overall accuracy scores. Evaluation datasets must contain sufficient representation from different groups, and results should be reported separately rather than hidden inside aggregate metrics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10758…

Why the evidence remains important

The headline figure of 34.7% versus 0.8% became memorable because it illustrated how a seemingly successful AI system could produce very different experiences for different people. The study did not merely identify a technical flaw; it exposed a measurement blind spot. When benchmarks were dominated by lighter-skinned faces, systems could achieve impressive overall results while repeatedly failing darker-skinned women. [MIT News]news.mit.edustudy finds gender skin type bias artificial intelligence systems 0212MIT NewsStudy finds gender and skin-type bias in commercial…Feb 11, 2018 — Examination of facial-analysis software shows error rate of…

For understanding artificial intelligence, Gender Shades remains a foundational case because it showed that evaluating AI is not only about how often a model is correct. It is also about asking who is being measured, who is missing from the data, and whether performance is being examined across the groups most likely to reveal hidden weaknesses. The study made darker-skinned women’s error rates visible and, in doing so, changed expectations for how face AI systems should be tested and reported. Proceedings of Machine Learning Research+2Gender Shades [proceedings.mlr.press]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · 2018 · Cited by 10758…

Gender Shades illustration 3

Amazon book picks

Further Reading

Books and field guides related to The Study That Changed Face AI Testing. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: news.mit.edu
    Title: study finds gender skin type bias artificial intelligence systems 0212
    Link: https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212
    Source snippet

    MIT NewsStudy finds gender and skin-type bias in commercial...Feb 11, 2018 — Examination of facial-analysis software shows error rate of...

  2. Source: dspace.mit.edu
    Title: 1026503582 MIT
    Link: https://dspace.mit.edu/bitstream/handle/1721.1/114068/1026503582-MIT.pdf
    Source snippet

    IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter...Read more...

  3. Source: who.int
    Link: https://www.who.int/health-topics/gender
    Source snippet

    Gender and healthGender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, b...

  4. Source: proceedings.mlr.press
    Link: https://proceedings.mlr.press/v81/buolamwini18a.html
    Source snippet

    Proceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 10758...

  5. Source: proceedings.mlr.press
    Link: https://proceedings.mlr.press/v81/buolamwini18a
    Source snippet

    The substantial disparities in the accuracy of classifying darker females, lighter females, darker...

  6. Source: proceedings.mlr.press
    Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
    Source snippet

    Proceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 11167...

  7. Source: classes.cs.uchicago.edu
    Title: Darker and Lighter Error Rates. To conduct a phenotypic performance analysis
    Link: https://www.classes.cs.uchicago.edu/archive/2020/winter/20370-1/readings/gendershadesAIbias.pdf
    Source snippet

    Computer Science ClassesGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · Cited by 11167 — The FPR for females i...

  8. Source: thegradient.pub
    Title: gender bias in ai
    Link: https://thegradient.pub/gender-bias-in-ai/
    Source snippet

    The GradientA Brief Overview of Gender Bias in AI8 Apr 2024 —... darker female faces (with error rates up to 34.7%). In contrast, the ma...

  9. Source: gendershades.org
    Link: https://gendershades.org/
    Source snippet

    Gender ShadesGender Shades. Home Results Research Paper Dataset. How well do IBM, Microsoft, and Face++ AI services guess the gender of a...

  10. Source: studocu.com
    Link: https://www.studocu.com/en-us/document/stanford-university/game-studies-issues-in-design-technology-and-player-creativity/gender-shades-intersectional-accuracy-disparities-in-ml-algorithms-mlr-2018/144319630
    Source snippet

    t accuracy disparities based on gender and skin type.Read more...

  11. Source: gendershades.org
    Link: https://gendershades.org/overview.html
    Source snippet

    Gender ShadesTable of subgroup error rates. IBM. IBM had the largest gap in accuracy, with a difference of 34.4% in error rate between li...

  12. Source: ars.electronica.art
    Link: https://ars.electronica.art/outofthebox/en/gender-shades/
    Source snippet

    Gender Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display obvious discri...

  13. Source: Wikipedia
    Link: https://en.wikipedia.org/wiki/Gender
    Source snippet

    GenderGender is the range of social, psychological, cultural, and behavioral aspects of being a man (or boy), woman (or girl), or port...

  14. Source: digitalgovernmenthub.org
    Link: https://digitalgovernmenthub.org/library/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
    Source snippet

    7%), while lighter-skinned males have much lower error rates (as low as 0.8%)...

  15. Source: arxiv.org
    Link: https://arxiv.org/pdf/2505.20637
    Source snippet

    Gender Shades study exemplified this, showing error rates of 34.7% for darker-skinned women versus 0.8% for lighter- skinned men in...Re...

  16. Source: youtube.com
    Link: https://www.youtube.com/watch?v=TWWsW1w-BVo
    Source snippet

    Gender ShadesThe Gender Shades Project pilots an intersectional approach to inclusive product testing for AI. Gender Shades is a prelimin...

  17. Source: scispace.com
    Link: https://scispace.com/papers/gender-shades-intersectional-accuracy-disparities-in-4qgeu0c1i3
    Source snippet

    while the most accurate result is for light-skinned men, in commercial...Read more...

  18. Source: opencasebook.org
    Link: https://opencasebook.org/casebooks/2554-governing-digital-technology/resources/5.1.2.2-gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification-by-joy-buolamwini-and-timnit-gebru-conference-of-fairness-accountability-and-transparency-2018/
    Source snippet

    fication” by Joy Buolamwini and Timnit Gebru, Conference of Fairness...Read more...

  19. Source: bibbase.org
    Title: Buolamwini, J. & Gebru, T. Proceedings of Machine Learning Research.Read more
    Link: https://bibbase.org/network/publication/buolamwini-gebru-gendershadesintersectionalaccuracydisparitiesincommercialgenderclassification-2018
    Source snippet

    Gender Shades: Intersectional Accuracy Disparities in...Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classifi...

Additional References

  1. Source: openaccess.thecvf.com
    Link: https://openaccess.thecvf.com/content/WACV2023W/DVPBA/papers/Gbekevi_Analyzing_the_Impact_of_Gender_Misclassification_on_Face_Recognition_Accuracy_WACVW_2023_paper.pdf
    Source snippet

    the Impact of Gender Misclassification on Face...by AEE Gbekevi · 2023 · Cited by 4 — The maximum er- ror rate for darker-skin-tone fema...

  2. Source: just-tech.ssrc.org
    Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classi%EF%AC%81cation/
    Source snippet

    Just TechGender Shades: Intersectional Accuracy Disparities in...We find that these datasets are overwhelmingly composed of lighter-skinn...

  3. Source: merriam-webster.com
    Link: https://www.merriam-webster.com/dictionary/gender
    Source snippet

    GENDER Definition & MeaningThe meaning of GENDER is a subclass within a grammatical class (such as noun, pronoun, adjective, or verb) of...

  4. Source: maquinacoes.rafaelg.net.br
    Link: https://maquinacoes.rafaelg.net.br/gender-shades
    Source snippet

    The substantial disparities in the accuracy of classifying darker females, lighter females, darker...Read more...

  5. Source: klover.ai
    Title: dr timnit gebru translating gender shades into corporate governance
    Link: https://www.klover.ai/dr-timnit-gebru-translating-gender-shades-into-corporate-governance/
    Source snippet

    Timnit Gebru: Translating 'Gender Shades' into...23 Jun 2025 — Facial recognition systems failed darker-skinned women not because of mal...

  6. Source: researchgate.net
    Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiers
    Source snippet

    IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter...Read more...

  7. Source: medium.com
    Title: Diversity, Equity, and Inclusion — A Human Factors Imperative
    Link: https://medium.com/%40dennishenry/diversity-equity-and-inclusion-a-human-factors-imperative-for-better-outcomes-40267ea8fc7c
    Source snippet

    error rate of 20–34% in identifying darker-skinned female faces, but error rates near 0% for lighter-skinned male faces (Buolamwini & Geb...

  8. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/Gender-Shades%3A-Intersectional-Accuracy-Disparities-Buolamwini-Gebru/18858cc936947fc96b5c06bbe3c6c2faa5614540
    Source snippet

    ghter females, darker males, and lighter males in gender classification systems...

  9. Source: arxiv.org
    Link: https://arxiv.org/pdf/2304.07175
    Source snippet

    Exploring Causes of Demographic Variations In Face...by G Pangelinan · 2023 · Cited by 10 — males were only 0.8%, while error rates for...

  10. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/Gender-shades-%3A-intersectional-phenotypic-and-of-Buolamwini/a73bc5398c1ecf9ab8c755ad6af4d7e4774ca7ec
    Source snippet

    mparing gender classification accuracies of females vs males and darker...

Topic Tree

Follow this branch

Parent topic

Face Bias When face AI fails some people first

Related pages 2