Within Face Bias

When Benchmarks Decide Who Counts

Skewed benchmark datasets made facial-analysis tools look more dependable than they were for underrepresented groups.

On this page

  • What IJB A and Adience overrepresented
  • Why small subgroups were easy to miss
  • How balanced benchmarks improve scrutiny
Preview for When Benchmarks Decide Who Counts

Introduction

Benchmarks do more than measure artificial intelligence systems. They shape which systems are trusted, funded, deployed and improved. In facial analysis, some of the most influential benchmarks gave the impression that algorithms were performing reliably across populations when, in reality, important groups were barely represented in the tests. As a result, failures affecting darker-skinned women often remained hidden behind strong overall accuracy scores. The problem was not simply that datasets were unbalanced; it was that the benchmarks used to judge success were unbalanced as well. When the tests themselves overlooked certain populations, developers, customers and regulators had fewer signals that something was wrong. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Skewed Tests illustration 1 Understanding this dynamic helps explain why concerns about facial-analysis bias emerged relatively late despite years of reported progress in the field. Benchmark design influenced what researchers noticed, what companies celebrated and what users came to trust. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

When Benchmarks Decide Who Counts

A benchmark is a standard test dataset used to compare AI systems. Researchers often rely on benchmark scores as evidence that a model works well. High scores can influence scientific publications, commercial adoption and public confidence.

The difficulty arises when benchmark populations differ substantially from the populations that systems will encounter in the real world. If a benchmark contains mostly lighter-skinned faces, a model can achieve impressive overall results by performing well on those faces while performing poorly on others. Because benchmark results are usually summarised into a single headline number, subgroup failures may receive little attention. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

In this way, trust becomes tied not only to algorithm quality but also to benchmark composition. A benchmark effectively determines whose experiences count when accuracy claims are made.

What IJB-A and Adience Overrepresented

The Gender Shades study examined two influential facial-analysis benchmarks: IJB-A, a government-supported facial recognition benchmark, and Adience, a widely used benchmark for age and gender classification. Researchers found that both datasets were heavily skewed towards lighter-skinned subjects. Approximately 79.6% of IJB-A subjects and 86.2% of Adience subjects were classified as lighter-skinned. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

The imbalance became even more striking when gender and skin tone were examined together. In IJB-A, lighter-skinned men represented nearly 60% of subjects, while darker-skinned women accounted for only about 4.4%. Adience also contained very small proportions of some darker-skinned subgroups. [Computer Science Classes]classes.cs.uchicago.eduComputer Science ClassesGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · Cited by 11378 — Adience has the most…

These numbers mattered because benchmark influence extends beyond the datasets themselves. Researchers trained and evaluated systems against benchmarks that implicitly treated certain faces as typical and others as marginal. When a subgroup represents only a small fraction of test cases, its failures contribute relatively little to the final score. A model can therefore look dependable even when it struggles with that subgroup. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Why Small Subgroups Were Easy to Miss

The key issue was not merely underrepresentation but statistical invisibility.

Imagine a benchmark in which a subgroup constitutes only a few percent of all examples. Even if performance on that subgroup is poor, the impact on the overall accuracy figure may be small. Developers focusing on aggregate scores could conclude that a system was ready for deployment. Investors, customers and journalists reviewing the published metrics might reach the same conclusion. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

This created several reinforcing effects:

  • Aggregate metrics dominated reporting. Many evaluations highlighted a single accuracy number rather than performance broken down by demographic group. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
  • Benchmark success became a proxy for reliability. Strong benchmark performance was often interpreted as evidence of broad real-world effectiveness. [cs4fn]cs4fn.blogthe gender shades auditThe gender shades audit – cs4fn5 Jun 2023 — Joy Buolamwini and Timnit Gebru tested three different commercial systems and found that…
  • Error patterns remained hidden. When subgroup sizes were small, systematic failures could remain undetected until deployment or specialised audits. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
  • Research incentives favoured benchmark optimisation. Teams frequently focused on improving benchmark scores because those scores influenced publication and adoption decisions. If benchmarks underrepresented certain groups, incentives to improve performance for those groups were weaker. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

The result was a feedback loop. Benchmarks suggested systems were reliable, that reliability increased trust, and increased trust reduced pressure to investigate who might be experiencing higher error rates.

Skewed Tests illustration 2

The Gender Shades Challenge to Existing Trust

The significance of Gender Shades was not simply that it found errors. Researchers changed the evaluation method itself by introducing a benchmark designed to provide more balanced representation across gender and skin-tone groups. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Using this more balanced approach, the study found large disparities that earlier benchmark practices had obscured. Darker-skinned women experienced the highest error rates, reaching as high as 34.7% in some commercial systems, while lighter-skinned men experienced error rates as low as 0.8%. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

These findings challenged a widely held assumption: that strong benchmark performance automatically implied equitable performance. The systems had not suddenly become worse. Instead, the measurement framework had become better at revealing weaknesses. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

This distinction is important for understanding AI trust. Trust can be misplaced not because evaluations are absent, but because evaluations are incomplete.

How Balanced Benchmarks Improve Scrutiny

Balanced benchmarks change what researchers are able to see.

The Pilot Parliaments Benchmark (PPB), introduced alongside Gender Shades, was designed to provide much more even representation across lighter-skinned and darker-skinned subjects as well as across gender categories. Compared with IJB-A and Adience, it contained substantially stronger representation of groups that had previously been underrepresented. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

More balanced benchmarks improve scrutiny in several ways:

  • They expose disparities earlier. Problems can be identified before systems are widely deployed. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
  • They encourage subgroup reporting. Researchers become more likely to publish results for different demographic groups instead of relying solely on aggregate metrics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
  • They improve accountability. Companies cannot rely on a single impressive score if detailed evaluations reveal large performance gaps. [Just Tech]just-tech.ssrc.orgJust TechGender Shades: Intersectional Accuracy Disparities in…We evaluate 3 commercial gender classification systems using our datase…
  • They redefine success. A system is judged not only by average accuracy but also by whether performance is consistent across populations. [arXiv]arxiv.orgReview of Demographic Bias in Face Recognition4 Feb 2025 — IJB-A, Adience, GN, ST, Demonstrated lowest classifier performance for da…

The broader lesson extends beyond facial analysis. Benchmarks are not neutral scoreboards. They help determine which failures are visible and which remain hidden. In the case of face datasets and darker-skinned women, unbalanced benchmarks shaped trust by making some weaknesses difficult to detect. More balanced benchmarks did not create those weaknesses; they revealed them. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Skewed Tests illustration 3

Amazon book picks

Further Reading

Books and field guides related to When Benchmarks Decide Who Counts. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: cs4fn.blog
    Title: the gender shades audit
    Link: https://cs4fn.blog/2023/06/05/the-gender-shades-audit/
    Source snippet

    The gender shades audit – cs4fn5 Jun 2023 — Joy Buolamwini and Timnit Gebru tested three different commercial systems and found that...

  2. Source: arxiv.org
    Link: https://arxiv.org/html/2502.02309v1
    Source snippet

    Review of Demographic Bias in Face Recognition4 Feb 2025 — IJB-A, Adience, GN, ST, Demonstrated lowest classifier performance for da...

  3. Source: proceedings.mlr.press
    Link: https://proceedings.mlr.press/v81/buolamwini18a
    Source snippet

    Proceedings of [Machine Learning]({{ 'machine-learning/' | relative_url }}) ResearchGender Shades: Intersectional Accuracy Disparities in...January 21, 2018 — by J Buolamwini · 201...

    Published: January 21, 2018

  4. Source: classes.cs.uchicago.edu
    Link: https://www.classes.cs.uchicago.edu/archive/2020/winter/20370-1/readings/gendershadesAIbias.pdf
    Source snippet

    Computer Science ClassesGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · Cited by 11378 — Adience has the most...

  5. Source: just-tech.ssrc.org
    Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
    Source snippet

    Just TechGender Shades: Intersectional Accuracy Disparities in...We evaluate 3 commercial gender classification systems using our datase...

  6. Source: proceedings.mlr.press
    Title: the world, the categorizations are fairly coarse. Nonetheless,
    Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
    Source snippet

    Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 11391 — Only 4.4% of subjects in Adience are darker-s...

Additional References

  1. Source: gendershades.org
    Link: https://gendershades.org/overview.html
    Source snippet

    Gender ShadesThe Gender Shades project evaluates the accuracy of AI powered gender classification products. This evaluation focuses on ge...

  2. Source: ars.electronica.art
    Link: https://ars.electronica.art/outofthebox/en/gender-shades/
    Source snippet

    Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display obvious discriminatio...

  3. Source: rrapp.spia.princeton.edu
    Link: https://rrapp.spia.princeton.edu/algorithmic-bias-in-facial-recognition-technology-on-the-basis-of-gender-and-skin-tone/
    Source snippet

    13 Oct 2020 — Researchers identify discrepancies in classification of gender and skin tone by facial recognition technology indicati...

  4. Source: researchgate.net
    Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiers
    Source snippet

    IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter...Read more...

  5. Source: semanticscholar.org
    Link: https://www.semanticscholar.org/paper/Gender-shades-%3A-intersectional-phenotypic-and-of-Buolamwini/a73bc5398c1ecf9ab8c755ad6af4d7e4774ca7ec
    Source snippet

    group. This thesis (1) characterizes the gender and skin type distribution of IJB-A, a government facial recognition benchmark, and Adien...

  6. Source: academia.edu
    Link: https://www.academia.edu/117322358/Gender_Shades_Intersectional_Accuracy_Disparities_in_Commercial_Gender_Classification
    Source snippet

    Gender Shades: Intersectional Accuracy Disparities in...Existing datasets like IJB-A and Adience are skewed towards lighter-skinned subj...

  7. Source: youtube.com
    Title: Dr. Joy Buolamwini reflects on [decoding]({{ ‘decoding/’ | relative_url }}) algorithmic bias and the future of AI
    Link: https://www.youtube.com/watch?v=6n3zvya2lHs
    Source snippet

    AJL Gender Shades 5th Anniversary Celebration...

  8. Source: youtube.com
    Title: The Dangers of Supremely White Data and The Coded Gaze
    Link: https://www.youtube.com/watch?v=ZSJXKoD6mA8
    Source snippet

    Joy Buolamwini and Sam Altman | Unmasking the Future of AI...

  9. Source: youtube.com
    Title: Joy Buolamwini and Sam Altman | Unmasking the Future of AI
    Link: https://www.youtube.com/watch?v=BpOi5Icizjc
    Source snippet

    A Conversation with Dr. Joy Buolamwini | SXSW 2024...

  10. Source: youtube.com
    Title: AJL Gender Shades 5th Anniversary Celebration
    Link: https://www.youtube.com/watch?v=8JSxbZyivuE
    Source snippet

    The Dangers of Supremely White Data and The Coded Gaze...

Topic Tree

Follow this branch

Parent topic

Face Bias When face AI fails some people first

Related pages 2