When Benchmarks Decide Who Counts

Introduction

Benchmarks do more than measure artificial intelligence systems. They shape which systems are trusted, funded, deployed and improved. In facial analysis, some of the most influential benchmarks gave the impression that algorithms were performing reliably across populations when, in reality, important groups were barely represented in the tests. As a result, failures affecting darker-skinned women often remained hidden behind strong overall accuracy scores. The problem was not simply that datasets were unbalanced; it was that the benchmarks used to judge success were unbalanced as well. When the tests themselves overlooked certain populations, developers, customers and regulators had fewer signals that something was wrong. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Skewed Tests illustration 1 Understanding this dynamic helps explain why concerns about facial-analysis bias emerged relatively late despite years of reported progress in the field. Benchmark design influenced what researchers noticed, what companies celebrated and what users came to trust. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

When Benchmarks Decide Who Counts

A benchmark is a standard test dataset used to compare AI systems. Researchers often rely on benchmark scores as evidence that a model works well. High scores can influence scientific publications, commercial adoption and public confidence.

The difficulty arises when benchmark populations differ substantially from the populations that systems will encounter in the real world. If a benchmark contains mostly lighter-skinned faces, a model can achieve impressive overall results by performing well on those faces while performing poorly on others. Because benchmark results are usually summarised into a single headline number, subgroup failures may receive little attention. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

In this way, trust becomes tied not only to algorithm quality but also to benchmark composition. A benchmark effectively determines whose experiences count when accuracy claims are made.

What IJB-A and Adience Overrepresented

The Gender Shades study examined two influential facial-analysis benchmarks: IJB-A, a government-supported facial recognition benchmark, and Adience, a widely used benchmark for age and gender classification. Researchers found that both datasets were heavily skewed towards lighter-skinned subjects. Approximately 79.6% of IJB-A subjects and 86.2% of Adience subjects were classified as lighter-skinned. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

The imbalance became even more striking when gender and skin tone were examined together. In IJB-A, lighter-skinned men represented nearly 60% of subjects, while darker-skinned women accounted for only about 4.4%. Adience also contained very small proportions of some darker-skinned subgroups. [Computer Science Classes]classes.cs.uchicago.eduComputer Science ClassesGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · Cited by 11378 — Adience has the most…

These numbers mattered because benchmark influence extends beyond the datasets themselves. Researchers trained and evaluated systems against benchmarks that implicitly treated certain faces as typical and others as marginal. When a subgroup represents only a small fraction of test cases, its failures contribute relatively little to the final score. A model can therefore look dependable even when it struggles with that subgroup. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Why Small Subgroups Were Easy to Miss

The key issue was not merely underrepresentation but statistical invisibility.

Imagine a benchmark in which a subgroup constitutes only a few percent of all examples. Even if performance on that subgroup is poor, the impact on the overall accuracy figure may be small. Developers focusing on aggregate scores could conclude that a system was ready for deployment. Investors, customers and journalists reviewing the published metrics might reach the same conclusion. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

This created several reinforcing effects:

Aggregate metrics dominated reporting. Many evaluations highlighted a single accuracy number rather than performance broken down by demographic group. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
Benchmark success became a proxy for reliability. Strong benchmark performance was often interpreted as evidence of broad real-world effectiveness. [cs4fn]cs4fn.blogthe gender shades auditThe gender shades audit – cs4fn5 Jun 2023 — Joy Buolamwini and Timnit Gebru tested three different commercial systems and found that…
Error patterns remained hidden. When subgroup sizes were small, systematic failures could remain undetected until deployment or specialised audits. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
Research incentives favoured benchmark optimisation. Teams frequently focused on improving benchmark scores because those scores influenced publication and adoption decisions. If benchmarks underrepresented certain groups, incentives to improve performance for those groups were weaker. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

The result was a feedback loop. Benchmarks suggested systems were reliable, that reliability increased trust, and increased trust reduced pressure to investigate who might be experiencing higher error rates.

Skewed Tests illustration 2

The Gender Shades Challenge to Existing Trust

The significance of Gender Shades was not simply that it found errors. Researchers changed the evaluation method itself by introducing a benchmark designed to provide more balanced representation across gender and skin-tone groups. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Using this more balanced approach, the study found large disparities that earlier benchmark practices had obscured. Darker-skinned women experienced the highest error rates, reaching as high as 34.7% in some commercial systems, while lighter-skinned men experienced error rates as low as 0.8%. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

These findings challenged a widely held assumption: that strong benchmark performance automatically implied equitable performance. The systems had not suddenly become worse. Instead, the measurement framework had become better at revealing weaknesses. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

This distinction is important for understanding AI trust. Trust can be misplaced not because evaluations are absent, but because evaluations are incomplete.

How Balanced Benchmarks Improve Scrutiny

Balanced benchmarks change what researchers are able to see.

The Pilot Parliaments Benchmark (PPB), introduced alongside Gender Shades, was designed to provide much more even representation across lighter-skinned and darker-skinned subjects as well as across gender categories. Compared with IJB-A and Adience, it contained substantially stronger representation of groups that had previously been underrepresented. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

More balanced benchmarks improve scrutiny in several ways:

They expose disparities earlier. Problems can be identified before systems are widely deployed. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
They encourage subgroup reporting. Researchers become more likely to publish results for different demographic groups instead of relying solely on aggregate metrics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018
They improve accountability. Companies cannot rely on a single impressive score if detailed evaluations reveal large performance gaps. [Just Tech]just-tech.ssrc.orgJust TechGender Shades: Intersectional Accuracy Disparities in…We evaluate 3 commercial gender classification systems using our datase…
They redefine success. A system is judged not only by average accuracy but also by whether performance is consistent across populations. [arXiv]arxiv.orgReview of Demographic Bias in Face Recognition4 Feb 2025 — IJB-A, Adience, GN, ST, Demonstrated lowest classifier performance for da…

The broader lesson extends beyond facial analysis. Benchmarks are not neutral scoreboards. They help determine which failures are visible and which remain hidden. In the case of face datasets and darker-skinned women, unbalanced benchmarks shaped trust by making some weaknesses difficult to detect. More balanced benchmarks did not create those weaknesses; they revealed them. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…Published: January 21, 2018

Skewed Tests illustration 3

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Framed iPhone 7 Wall Art – Deconstructed Tech Frame | Unique Gift | UK Made

Search eBay.co.uk: technology wall art

Browse similar on eBay.co.uk

Example eBay listing

Technology girl Framed Art Print Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: technology wall art

Browse similar on eBay.co.uk

Example eBay listing

yellow technology tree Framed Art P Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: technology wall art

Browse similar on eBay.co.uk

Example eBay listing

Technology Framed Art Print Framed Wall Art Poster Canvas Print Picture

Search eBay.co.uk: technology wall art

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: cs4fn.blog
Title: the gender shades audit
Link: https://cs4fn.blog/2023/06/05/the-gender-shades-audit/
Source snippet
The gender shades audit – cs4fn5 Jun 2023 — Joy Buolamwini and Timnit Gebru tested three different commercial systems and found that...
Source: arxiv.org
Link: https://arxiv.org/html/2502.02309v1
Source snippet
Review of Demographic Bias in Face Recognition4 Feb 2025 — IJB-A, Adience, GN, ST, Demonstrated lowest classifier performance for da...
Source: proceedings.mlr.press
Link: https://proceedings.mlr.press/v81/buolamwini18a
Source snippet
Proceedings of [Machine Learning]({{ 'machine-learning/' | relative_url }}) ResearchGender Shades: Intersectional Accuracy Disparities in...January 21, 2018 — by J Buolamwini · 201...

Published: January 21, 2018
Source: classes.cs.uchicago.edu
Link: https://www.classes.cs.uchicago.edu/archive/2020/winter/20370-1/readings/gendershadesAIbias.pdf
Source snippet
Computer Science ClassesGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · Cited by 11378 — Adience has the most...
Source: just-tech.ssrc.org
Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/
Source snippet
Just TechGender Shades: Intersectional Accuracy Disparities in...We evaluate 3 commercial gender classification systems using our datase...
Source: proceedings.mlr.press
Title: the world, the categorizations are fairly coarse. Nonetheless,
Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdf
Source snippet
Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 11391 — Only 4.4% of subjects in Adience are darker-s...

Additional References

Source: gendershades.org
Link: https://gendershades.org/overview.html
Source snippet
Gender ShadesThe Gender Shades project evaluates the accuracy of AI powered gender classification products. This evaluation focuses on ge...
Source: ars.electronica.art
Link: https://ars.electronica.art/outofthebox/en/gender-shades/
Source snippet
Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display obvious discriminatio...
Source: rrapp.spia.princeton.edu
Link: https://rrapp.spia.princeton.edu/algorithmic-bias-in-facial-recognition-technology-on-the-basis-of-gender-and-skin-tone/
Source snippet
13 Oct 2020 — Researchers identify discrepancies in classification of gender and skin tone by facial recognition technology indicati...
Source: researchgate.net
Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiers
Source snippet
IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter...Read more...
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Gender-shades-%3A-intersectional-phenotypic-and-of-Buolamwini/a73bc5398c1ecf9ab8c755ad6af4d7e4774ca7ec
Source snippet
group. This thesis (1) characterizes the gender and skin type distribution of IJB-A, a government facial recognition benchmark, and Adien...
Source: academia.edu
Link: https://www.academia.edu/117322358/Gender_Shades_Intersectional_Accuracy_Disparities_in_Commercial_Gender_Classification
Source snippet
Gender Shades: Intersectional Accuracy Disparities in...Existing datasets like IJB-A and Adience are skewed towards lighter-skinned subj...
Source: youtube.com
Title: Dr. Joy Buolamwini reflects on [decoding]({{ ‘decoding/’ | relative_url }}) algorithmic bias and the future of AI
Link: https://www.youtube.com/watch?v=6n3zvya2lHs
Source snippet
AJL Gender Shades 5th Anniversary Celebration...
Source: youtube.com
Title: The Dangers of Supremely White Data and The Coded Gaze
Link: https://www.youtube.com/watch?v=ZSJXKoD6mA8
Source snippet
Joy Buolamwini and Sam Altman | Unmasking the Future of AI...
Source: youtube.com
Title: Joy Buolamwini and Sam Altman | Unmasking the Future of AI
Link: https://www.youtube.com/watch?v=BpOi5Icizjc
Source snippet
A Conversation with Dr. Joy Buolamwini | SXSW 2024...
Source: youtube.com
Title: AJL Gender Shades 5th Anniversary Celebration
Link: https://www.youtube.com/watch?v=8JSxbZyivuE
Source snippet
The Dangers of Supremely White Data and The Coded Gaze...

When Benchmarks Decide Who Counts

Introduction

When Benchmarks Decide Who Counts

What IJB-A and Adience Overrepresented

Why Small Subgroups Were Easy to Miss

The Gender Shades Challenge to Existing Trust

How Balanced Benchmarks Improve Scrutiny

Further Reading

Unmasking AI

Atlas of AI

Weapons of Math Destruction

Artificial Intelligence

Marketplace Samples

Framed iPhone 7 Wall Art – Deconstructed Tech Frame | Unique Gift | UK Made

Technology girl Framed Art Print Framed Wall Art Poster Canvas Print Picture

yellow technology tree Framed Art P Framed Wall Art Poster Canvas Print Picture

Technology Framed Art Print Framed Wall Art Poster Canvas Print Picture

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2