Within Face Bias
When Benchmarks Decide Who Counts
Skewed benchmark datasets made facial-analysis tools look more dependable than they were for underrepresented groups.
On this page
- What IJB A and Adience overrepresented
- Why small subgroups were easy to miss
- How balanced benchmarks improve scrutiny
Page outline Jump by section
Introduction
Benchmarks do more than measure artificial intelligence systems. They shape which systems are trusted, funded, deployed and improved. In facial analysis, some of the most influential benchmarks gave the impression that algorithms were performing reliably across populations when, in reality, important groups were barely represented in the tests. As a result, failures affecting darker-skinned women often remained hidden behind strong overall accuracy scores. The problem was not simply that datasets were unbalanced; it was that the benchmarks used to judge success were unbalanced as well. When the tests themselves overlooked certain populations, developers, customers and regulators had fewer signals that something was wrong. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
Understanding this dynamic helps explain why concerns about facial-analysis bias emerged relatively late despite years of reported progress in the field. Benchmark design influenced what researchers noticed, what companies celebrated and what users came to trust. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
When Benchmarks Decide Who Counts
A benchmark is a standard test dataset used to compare AI systems. Researchers often rely on benchmark scores as evidence that a model works well. High scores can influence scientific publications, commercial adoption and public confidence.
The difficulty arises when benchmark populations differ substantially from the populations that systems will encounter in the real world. If a benchmark contains mostly lighter-skinned faces, a model can achieve impressive overall results by performing well on those faces while performing poorly on others. Because benchmark results are usually summarised into a single headline number, subgroup failures may receive little attention. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
In this way, trust becomes tied not only to algorithm quality but also to benchmark composition. A benchmark effectively determines whose experiences count when accuracy claims are made.
What IJB-A and Adience Overrepresented
The Gender Shades study examined two influential facial-analysis benchmarks: IJB-A, a government-supported facial recognition benchmark, and Adience, a widely used benchmark for age and gender classification. Researchers found that both datasets were heavily skewed towards lighter-skinned subjects. Approximately 79.6% of IJB-A subjects and 86.2% of Adience subjects were classified as lighter-skinned. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
The imbalance became even more striking when gender and skin tone were examined together. In IJB-A, lighter-skinned men represented nearly 60% of subjects, while darker-skinned women accounted for only about 4.4%. Adience also contained very small proportions of some darker-skinned subgroups. [Computer Science Classes]classes.cs.uchicago.eduComputer Science ClassesGender Shades: Intersectional Accuracy Disparities in…by J Buolamwini · Cited by 11378 — Adience has the most…
These numbers mattered because benchmark influence extends beyond the datasets themselves. Researchers trained and evaluated systems against benchmarks that implicitly treated certain faces as typical and others as marginal. When a subgroup represents only a small fraction of test cases, its failures contribute relatively little to the final score. A model can therefore look dependable even when it struggles with that subgroup. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
Why Small Subgroups Were Easy to Miss
The key issue was not merely underrepresentation but statistical invisibility.
Imagine a benchmark in which a subgroup constitutes only a few percent of all examples. Even if performance on that subgroup is poor, the impact on the overall accuracy figure may be small. Developers focusing on aggregate scores could conclude that a system was ready for deployment. Investors, customers and journalists reviewing the published metrics might reach the same conclusion. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
This created several reinforcing effects:
- Aggregate metrics dominated reporting. Many evaluations highlighted a single accuracy number rather than performance broken down by demographic group. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
- Benchmark success became a proxy for reliability. Strong benchmark performance was often interpreted as evidence of broad real-world effectiveness. [cs4fn]cs4fn.blogthe gender shades auditThe gender shades audit – cs4fn5 Jun 2023 — Joy Buolamwini and Timnit Gebru tested three different commercial systems and found that…
- Error patterns remained hidden. When subgroup sizes were small, systematic failures could remain undetected until deployment or specialised audits. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
- Research incentives favoured benchmark optimisation. Teams frequently focused on improving benchmark scores because those scores influenced publication and adoption decisions. If benchmarks underrepresented certain groups, incentives to improve performance for those groups were weaker. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
The result was a feedback loop. Benchmarks suggested systems were reliable, that reliability increased trust, and increased trust reduced pressure to investigate who might be experiencing higher error rates.
The Gender Shades Challenge to Existing Trust
The significance of Gender Shades was not simply that it found errors. Researchers changed the evaluation method itself by introducing a benchmark designed to provide more balanced representation across gender and skin-tone groups. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
Using this more balanced approach, the study found large disparities that earlier benchmark practices had obscured. Darker-skinned women experienced the highest error rates, reaching as high as 34.7% in some commercial systems, while lighter-skinned men experienced error rates as low as 0.8%. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
These findings challenged a widely held assumption: that strong benchmark performance automatically implied equitable performance. The systems had not suddenly become worse. Instead, the measurement framework had become better at revealing weaknesses. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
This distinction is important for understanding AI trust. Trust can be misplaced not because evaluations are absent, but because evaluations are incomplete.
How Balanced Benchmarks Improve Scrutiny
Balanced benchmarks change what researchers are able to see.
The Pilot Parliaments Benchmark (PPB), introduced alongside Gender Shades, was designed to provide much more even representation across lighter-skinned and darker-skinned subjects as well as across gender categories. Compared with IJB-A and Adience, it contained substantially stronger representation of groups that had previously been underrepresented. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
More balanced benchmarks improve scrutiny in several ways:
- They expose disparities earlier. Problems can be identified before systems are widely deployed. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
- They encourage subgroup reporting. Researchers become more likely to publish results for different demographic groups instead of relying solely on aggregate metrics. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
- They improve accountability. Companies cannot rely on a single impressive score if detailed evaluations reveal large performance gaps. [Just Tech]just-tech.ssrc.orgJust TechGender Shades: Intersectional Accuracy Disparities in…We evaluate 3 commercial gender classification systems using our datase…
- They redefine success. A system is judged not only by average accuracy but also by whether performance is consistent across populations. [arXiv]arxiv.orgReview of Demographic Bias in Face Recognition4 Feb 2025 — IJB-A, Adience, GN, ST, Demonstrated lowest classifier performance for da…
The broader lesson extends beyond facial analysis. Benchmarks are not neutral scoreboards. They help determine which failures are visible and which remain hidden. In the case of face datasets and darker-skinned women, unbalanced benchmarks shaped trust by making some weaknesses difficult to detect. More balanced benchmarks did not create those weaknesses; they revealed them. [Proceedings of Machine Learning Research]proceedings.mlr.pressProceedings of Machine Learning ResearchGender Shades: Intersectional Accuracy Disparities in…January 21, 2018 — by J Buolamwini · 201…
Endnotes
-
Source: cs4fn.blog
Title: the gender shades audit
Link: https://cs4fn.blog/2023/06/05/the-gender-shades-audit/Source snippet
The gender shades audit – cs4fn5 Jun 2023 — Joy Buolamwini and Timnit Gebru tested three different commercial systems and found that...
-
Source: arxiv.org
Link: https://arxiv.org/html/2502.02309v1Source snippet
Review of Demographic Bias in Face Recognition4 Feb 2025 — IJB-A, Adience, GN, ST, Demonstrated lowest classifier performance for da...
-
Source: proceedings.mlr.press
Link: https://proceedings.mlr.press/v81/buolamwini18aSource snippet
Proceedings of [Machine Learning]({{ 'machine-learning/' | relative_url }}) ResearchGender Shades: Intersectional Accuracy Disparities in...January 21, 2018 — by J Buolamwini · 201...
Published: January 21, 2018
-
Source: classes.cs.uchicago.edu
Link: https://www.classes.cs.uchicago.edu/archive/2020/winter/20370-1/readings/gendershadesAIbias.pdfSource snippet
Computer Science ClassesGender Shades: Intersectional Accuracy Disparities in...by J Buolamwini · Cited by 11378 — Adience has the most...
-
Source: just-tech.ssrc.org
Link: https://just-tech.ssrc.org/citation/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/Source snippet
Just TechGender Shades: Intersectional Accuracy Disparities in...We evaluate 3 commercial gender classification systems using our datase...
-
Source: proceedings.mlr.press
Title: the world, the categorizations are fairly coarse. Nonetheless,
Link: https://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a.pdfSource snippet
Shades: Intersectional Accuracy Disparities in...by J Buolamwini · 2018 · Cited by 11391 — Only 4.4% of subjects in Adience are darker-s...
Additional References
-
Source: gendershades.org
Link: https://gendershades.org/overview.htmlSource snippet
Gender ShadesThe Gender Shades project evaluates the accuracy of AI powered gender classification products. This evaluation focuses on ge...
-
Source: ars.electronica.art
Link: https://ars.electronica.art/outofthebox/en/gender-shades/Source snippet
Shades – Out of the BoxThe study reveals that popular applications that are already part of the programming display obvious discriminatio...
-
Source: rrapp.spia.princeton.edu
Link: https://rrapp.spia.princeton.edu/algorithmic-bias-in-facial-recognition-technology-on-the-basis-of-gender-and-skin-tone/Source snippet
13 Oct 2020 — Researchers identify discrepancies in classification of gender and skin tone by facial recognition technology indicati...
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/323722163_Gender_shades_intersectional_phenotypic_and_demographic_evaluation_of_face_datasets_and_gender_classifiersSource snippet
IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter...Read more...
-
Source: semanticscholar.org
Link: https://www.semanticscholar.org/paper/Gender-shades-%3A-intersectional-phenotypic-and-of-Buolamwini/a73bc5398c1ecf9ab8c755ad6af4d7e4774ca7ecSource snippet
group. This thesis (1) characterizes the gender and skin type distribution of IJB-A, a government facial recognition benchmark, and Adien...
-
Source: academia.edu
Link: https://www.academia.edu/117322358/Gender_Shades_Intersectional_Accuracy_Disparities_in_Commercial_Gender_ClassificationSource snippet
Gender Shades: Intersectional Accuracy Disparities in...Existing datasets like IJB-A and Adience are skewed towards lighter-skinned subj...
-
Source: youtube.com
Title: Dr. Joy Buolamwini reflects on [decoding]({{ ‘decoding/’ | relative_url }}) algorithmic bias and the future of AI
Link: https://www.youtube.com/watch?v=6n3zvya2lHsSource snippet
AJL Gender Shades 5th Anniversary Celebration...
-
Source: youtube.com
Title: The Dangers of Supremely White Data and The Coded Gaze
Link: https://www.youtube.com/watch?v=ZSJXKoD6mA8Source snippet
Joy Buolamwini and Sam Altman | Unmasking the Future of AI...
-
Source: youtube.com
Title: Joy Buolamwini and Sam Altman | Unmasking the Future of AI
Link: https://www.youtube.com/watch?v=BpOi5IcizjcSource snippet
A Conversation with Dr. Joy Buolamwini | SXSW 2024...
-
Source: youtube.com
Title: AJL Gender Shades 5th Anniversary Celebration
Link: https://www.youtube.com/watch?v=8JSxbZyivuESource snippet
The Dangers of Supremely White Data and The Coded Gaze...
Topic Tree



