Why AI misses rare but important cases

Introduction

Artificial intelligence systems often learn best from patterns they see repeatedly. When a training dataset contains many examples of one category but very few of another, the result is known as class imbalance. In such datasets, a model may become highly accurate on common cases while consistently missing the rare cases that matter most. This is a widespread challenge in AI because many real-world problems are naturally imbalanced: fraudulent transactions are rarer than legitimate ones, serious diseases are rarer than healthy cases, and safety-critical failures are rarer than normal operation. As a result, a model can appear successful overall while performing poorly where mistakes carry the greatest consequences. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

Rare cases illustration 1

How frequency shapes what models notice

Machine-learning models learn from evidence. During training, every example contributes information about which patterns should be associated with which outcomes. When one class dominates the dataset, it contributes far more training signals than the minority class.

Imagine a dataset containing 99,000 normal transactions and 1,000 fraudulent transactions. The model repeatedly encounters normal behaviour and receives strong feedback about how to recognise it. Fraudulent behaviour appears much less often, giving the model fewer opportunities to learn the distinctive characteristics of fraud. Over time, the model may become excellent at recognising the majority class while developing only a weak understanding of the minority class. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

This imbalance affects how the model allocates its limited learning capacity. Optimisation algorithms generally focus on reducing overall error. Since mistakes on the majority class occur more frequently, correcting those mistakes often produces the largest improvement in the training objective. The rare class can therefore receive less attention during learning even when it is the most important category from a human perspective. [arXiv]arxiv.orgOpen source on arxiv.org.

A second problem is variation within the rare class. Common categories often contain many examples showing different conditions, environments, and edge cases. Rare categories may contain only a small sample of possible situations. The model therefore learns a narrower picture of what the minority class looks like and may struggle when confronted with new variations. [arXiv]arxiv.orgarXiv Striking the Right Balance with UncertaintyStriking the Right Balance with UncertaintyJanuary 22, 2019…Published: January 22, 2019

Why accuracy can hide weak rare-case performance

One of the most misunderstood effects of class imbalance is that a model can achieve impressive accuracy while effectively failing at the task people care about.

Consider a dataset where only 1% of cases belong to the rare class. A model that simply predicts the majority class every time would be correct 99% of the time. The reported accuracy would look excellent despite the model never identifying a single rare example. Researchers often describe this as the accuracy paradox. [Google for Developers+2SAS Support]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

This problem appears in many practical applications:

A disease-screening system may correctly identify most healthy patients while missing people who actually have the disease.
A fraud-detection system may approve nearly all legitimate transactions but fail to stop fraudulent ones.
A manufacturing inspection system may pass almost every product while overlooking the defects it was designed to find.

In each case, overall accuracy can remain high even though performance on the rare class is poor. For this reason, AI practitioners often rely on additional measures such as recall (how many true rare cases are found) and precision (how many detected rare cases are correct). For highly imbalanced datasets, these metrics provide a clearer picture than accuracy alone. [Google for Developers+2Encord]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

The distinction matters because the cost of mistakes is rarely distributed evenly. Missing a rare but dangerous event may be far more consequential than incorrectly flagging a common event. A model optimised only for overall accuracy can therefore appear successful while creating significant real-world risks. [Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

Rare cases illustration 2

Why rare examples are often the most important

Class imbalance becomes especially significant because minority classes frequently correspond to the outcomes humans care about most.

In medical diagnosis, the rare cases are often the patients who need urgent treatment. In cybersecurity, the rare cases are the attacks. In aviation safety, the rare cases are the failures. In financial systems, the rare cases may be money laundering or fraud.

From a purely statistical perspective, these events contribute little to overall accuracy because they occur infrequently. From a practical perspective, they may be the entire reason the AI system exists. This mismatch between statistical frequency and human importance is one of the central challenges of dataset design. [Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

Class imbalance can also create fairness concerns. If certain populations or situations are underrepresented in training data, the model may perform worse for those groups because it has seen fewer examples from them. Standards and research on AI bias repeatedly identify representation problems in training data as a major source of uneven model performance. [NIST Publications+2UNECE]nvlpubs.nist.govNIST PublicationsTowards a Standard for Identifying and Managing Bias in…by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b…

Ways dataset design can protect rare examples

Because class imbalance is common, machine-learning practitioners have developed several approaches to reduce its effects.

Collecting more minority-class examples

The most direct solution is often to gather additional examples of the rare class. More examples expose the model to a wider range of situations and improve its ability to generalise beyond the limited cases originally available. When feasible, improving representation at the data-collection stage is often preferable to relying solely on algorithmic fixes. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

Rebalancing the training dataset

A common technique is to change the composition of the training data.

Oversampling increases the representation of minority-class examples.
Undersampling reduces the number of majority-class examples.

Both methods make rare examples more visible during training. Research on image-classification systems has shown that class imbalance can substantially harm performance and that rebalancing strategies often improve results, particularly oversampling approaches. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

Rare cases illustration 3

Creating additional minority examples

In some domains, developers generate synthetic examples based on existing rare cases. These methods attempt to provide more training signals without requiring large-scale data collection. Research has found that carefully designed augmentation techniques can improve minority-class performance, though their effectiveness depends on the quality and realism of the generated examples. [arXiv]arxiv.orgSolving the Class Imbalance Problem Using a Counterfactual Method for Data AugmentationNovember 5, 2021…Published: November 5, 2021

Measuring the right outcomes

Even a well-balanced training strategy can fail if evaluation focuses only on accuracy. Modern machine-learning practice therefore emphasises metrics that reveal performance on rare classes, including recall, precision, F1 score, and precision–recall analysis. These measures help developers detect situations where the model performs well overall but poorly on the cases that matter most. [Google for Developers+2Encord]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

The key lesson about rare cases

Class imbalance shows that training data influences not only what a model learns, but also what it learns to ignore. When rare examples appear too infrequently, the model receives weaker evidence about them, develops less reliable decision boundaries around them, and may achieve impressive overall accuracy while failing on the very cases users care about most. Understanding this effect is essential for interpreting AI performance: a model that looks accurate on average is not necessarily good at detecting rare but important events. [Google for Developers+2Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Data Is Greater Than Opinion Data Analyst Science Mens T Shirts #P1#Or#A

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

WARNING MAY SPONTANEOUSLY START TALKING ABOUT DATA SCIENCE T-SHIRT

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

I Love Anal Analytics T-Shirt Unisex Funny Data Science Cartoon Graphic Tee

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Trust The Process Algorithmic Data Science Design T-Shirt

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: developers.google.com
Link: https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets
Source snippet
Google for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure...
Source: developers.google.com
Title: accuracy precision recall
Link: https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
Source snippet
Google for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num...
Source: arxiv.org
Link: https://arxiv.org/abs/1710.05381
Source: arxiv.org
Title: arXiv Striking the Right Balance with Uncertainty
Link: https://arxiv.org/abs/1901.07590
Source snippet
Striking the Right Balance with UncertaintyJanuary 22, 2019...

Published: January 22, 2019
Source: arxiv.org
Title: arXiv Heteroskedastic and Imbalanced [Deep Learning]({{ ‘deep-learning/’ | relative_url }}) with Adaptive Regularization
Link: https://arxiv.org/abs/2006.15766
Source: support.sas.com
Link: https://support.sas.com/resources/papers/proceedings17/0942-2017.pdf
Source snippet
SAS SupportPredictive Accuracy: A Misleading Performance Measure...ABSTRACT. The most commonly reported model evaluation metric is the a...
Source: encord.com
Title: Accuracy vs
Link: https://encord.com/blog/classification-metrics-accuracy-precision-recall/
Source snippet
Precision vs. Recall in Machine Learning23 Nov 2023 — Precision measures how often predictions for the positive class are correct. Recall...
Source: nvlpubs.nist.gov
Link: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf
Source snippet
NIST PublicationsTowards a Standard for Identifying and Managing Bias in...by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b...
Source: unece.org
Link: https://unece.org/sites/default/files/2025-10/Companion%20Paper%20on%20Fairness%20in%20Machine%20Learning_[Responsible
Source snippet
Fairness in Machine LearningRepresentation bias occurs when the training data used is not representative of the population the model will...
Source: arxiv.org
Link: https://arxiv.org/abs/2111.03516
Source snippet
Solving the Class Imbalance Problem Using a Counterfactual Method for Data AugmentationNovember 5, 2021...

Published: November 5, 2021
Source: google.com
Link: https://www.google.com/
Source snippet
Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exac...
Source: developers.google.com
Title: crash course
Link: https://developers.google.com/machine-learning/crash-course
Source snippet
google.comGoogle's Machine Learning Crash CourseAn introduction to binary classification models, covering thresholding, confusion matrice...
Source: nist.gov
Link: https://www.nist.gov/
Source snippet
National Institute of Standards and TechnologyNIST promotes U.S. innovation and industrial competitiveness by advancing measurement scien...
Source: nist.gov
Title: theres more ai bias [biased data]({{ ‘biased-data/’ | relative_url }}) nist report highlights
Link: https://www.nist.gov/news-events/news/2022/03/theres-more-ai-bias-biased-data-nist-report-highlights
Source snippet
There's More to AI Bias Than Biased Data, NIST Report...16 Mar 2022 — The NIST report acknowledges that a great deal of AI bias stems fr...
Source: encord.com
Link: https://encord.com/blog/an-introduction-to-balanced-and-imbalanced-datasets-in-machine-learning/
Source snippet
Balanced and Imbalanced Datasets in Machine Learning...Nov 11, 2022 — Balancing a dataset makes training a model easier because it helps...
Source: dictionary.cambridge.org
Link: https://dictionary.cambridge.org/dictionary/english-chinese-traditional/classification
Source snippet
in Traditional Chinese - Cambridge Dictionarythe act or process of dividing things into groups according to their type 將...分類,將...歸類;把...

Additional References

Source: researchgate.net
Link: https://www.researchgate.net/publication/340894484_Addressing_Accuracy_Paradox_Using_Enhanched_Weighted_Performance_Metric_in_Machine_Learning
Source snippet
Addressing Accuracy Paradox Using Enhanched Weighted...Jan 26, 2026 — This accuracy paradox [105] occurred for highly imbalanced dataset...
Source: linkedin.com
Link: https://www.linkedin.com/posts/cornellius-yudha-wijaya_python-datascience-machinelearning-activity-7132185791467847680-bmlw
Source snippet
Cornellius Y.'s PostBalance Metrics: Balanced Accuracy: Fairness in imbalanced datasets. F1 Score: Harmonizes precision and recall. F-bet...
Source: medium.com
Link: https://medium.com/%40shreya_g/why-accuracy-fails-on-imbalanced-datasets-8cd21594137b
Source snippet
Why Accuracy Fails on Imbalanced DatasetsImbalanced datasets require careful evaluation. Accuracy as a metric alone can be dangerously mi...
Source: github.com
Link: https://github.com/litaotao/machine-learning-crash-course
Source: medium.com
Link: https://medium.com/%40boutnaru/the-artificial-intelligence-journey-accuracy-a8a3f292ae6f
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10741524/
Source snippet
by M Ghanem · 2023 · Cited by 63 — AUPRC is a valuable metric when working with imbalanced datasets as it considers precision and reca...
Source: youtube.com
Link: https://www.youtube.com/watch?v=JnlM4yLFNuo
Source snippet
Class Imbalance Explained | Why Your Model Looks Great and Still Fails - YouTube Class Imbalance Explained | Why Your Model Looks Great a...
Source: youtube.com
Link: https://www.youtube.com/watch?v=QM0sYbEQSkM
Source snippet
Machine Learning Crash Course: ClassificationClassification is a machine learning technique for predicting a class (or category)—for exam...
Source: studocu.com
Title: machine learning 125 pm classification accuracy recall precision metrics
Link: https://www.studocu.com/in/document/rajiv-gandhi-university-of-health-sciences/hospital-related-law/machine-learning-125-pm-classification-accuracy-recall-precision-metrics/145476246
Source snippet
Machine Learning 1:25 PM Classification: Accuracy, Recall...9 Oct 2024 — Explore essential metrics for evaluating machine learning model...
Source: pmc.ncbi.nlm.nih.gov
Title: PMC100% Classification Accuracy Considered Harmful
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC3888391/
Source snippet
nih.gov100% Classification Accuracy Considered Harmful - PMC - NIHby FJ Valverde-Albacete · 2014 · Cited by 322 — Despite optimizing clas...

Why AI misses rare but important cases

Introduction

How frequency shapes what models notice

Why accuracy can hide weak rare-case performance

Why rare examples are often the most important

Ways dataset design can protect rare examples

Collecting more minority-class examples

Rebalancing the training dataset

Creating additional minority examples

Measuring the right outcomes

The key lesson about rare cases

Further Reading

Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...

Introduction to Machine Learning with Python

Learning from Imbalanced Data Sets

The Hundred-page Machine Learning Book

Marketplace Samples

Data Is Greater Than Opinion Data Analyst Science Mens T Shirts #P1#Or#A

WARNING MAY SPONTANEOUSLY START TALKING ABOUT DATA SCIENCE T-SHIRT

I Love Anal Analytics T-Shirt Unisex Funny Data Science Cartoon Graphic Tee

Trust The Process Algorithmic Data Science Design T-Shirt

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2