Within Training data
Why AI misses rare but important cases
When rare examples appear too little in training data, a model may become accurate on common cases while missing important exceptions.
On this page
- How frequency shapes what models notice
- Why accuracy can hide weak rare case performance
- Ways dataset design can protect rare examples
Page outline Jump by section
Introduction
Artificial intelligence systems often learn best from patterns they see repeatedly. When a training dataset contains many examples of one category but very few of another, the result is known as class imbalance. In such datasets, a model may become highly accurate on common cases while consistently missing the rare cases that matter most. This is a widespread challenge in AI because many real-world problems are naturally imbalanced: fraudulent transactions are rarer than legitimate ones, serious diseases are rarer than healthy cases, and safety-critical failures are rarer than normal operation. As a result, a model can appear successful overall while performing poorly where mistakes carry the greatest consequences. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…
How frequency shapes what models notice
Machine-learning models learn from evidence. During training, every example contributes information about which patterns should be associated with which outcomes. When one class dominates the dataset, it contributes far more training signals than the minority class.
Imagine a dataset containing 99,000 normal transactions and 1,000 fraudulent transactions. The model repeatedly encounters normal behaviour and receives strong feedback about how to recognise it. Fraudulent behaviour appears much less often, giving the model fewer opportunities to learn the distinctive characteristics of fraud. Over time, the model may become excellent at recognising the majority class while developing only a weak understanding of the minority class. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…
This imbalance affects how the model allocates its limited learning capacity. Optimisation algorithms generally focus on reducing overall error. Since mistakes on the majority class occur more frequently, correcting those mistakes often produces the largest improvement in the training objective. The rare class can therefore receive less attention during learning even when it is the most important category from a human perspective. [arXiv]arxiv.orgOpen source on arxiv.org.
A second problem is variation within the rare class. Common categories often contain many examples showing different conditions, environments, and edge cases. Rare categories may contain only a small sample of possible situations. The model therefore learns a narrower picture of what the minority class looks like and may struggle when confronted with new variations. [arXiv]arxiv.orgarXiv Striking the Right Balance with UncertaintyStriking the Right Balance with UncertaintyJanuary 22, 2019…
Why accuracy can hide weak rare-case performance
One of the most misunderstood effects of class imbalance is that a model can achieve impressive accuracy while effectively failing at the task people care about.
Consider a dataset where only 1% of cases belong to the rare class. A model that simply predicts the majority class every time would be correct 99% of the time. The reported accuracy would look excellent despite the model never identifying a single rare example. Researchers often describe this as the accuracy paradox. [Google for Developers+2SAS Support]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…
This problem appears in many practical applications:
- A disease-screening system may correctly identify most healthy patients while missing people who actually have the disease.
- A fraud-detection system may approve nearly all legitimate transactions but fail to stop fraudulent ones.
- A manufacturing inspection system may pass almost every product while overlooking the defects it was designed to find.
In each case, overall accuracy can remain high even though performance on the rare class is poor. For this reason, AI practitioners often rely on additional measures such as recall (how many true rare cases are found) and precision (how many detected rare cases are correct). For highly imbalanced datasets, these metrics provide a clearer picture than accuracy alone. [Google for Developers+2Encord]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…
The distinction matters because the cost of mistakes is rarely distributed evenly. Missing a rare but dangerous event may be far more consequential than incorrectly flagging a common event. A model optimised only for overall accuracy can therefore appear successful while creating significant real-world risks. [Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…
Why rare examples are often the most important
Class imbalance becomes especially significant because minority classes frequently correspond to the outcomes humans care about most.
In medical diagnosis, the rare cases are often the patients who need urgent treatment. In cybersecurity, the rare cases are the attacks. In aviation safety, the rare cases are the failures. In financial systems, the rare cases may be money laundering or fraud.
From a purely statistical perspective, these events contribute little to overall accuracy because they occur infrequently. From a practical perspective, they may be the entire reason the AI system exists. This mismatch between statistical frequency and human importance is one of the central challenges of dataset design. [Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…
Class imbalance can also create fairness concerns. If certain populations or situations are underrepresented in training data, the model may perform worse for those groups because it has seen fewer examples from them. Standards and research on AI bias repeatedly identify representation problems in training data as a major source of uneven model performance. [NIST Publications+2UNECE]nvlpubs.nist.govNIST PublicationsTowards a Standard for Identifying and Managing Bias in…by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b…
Ways dataset design can protect rare examples
Because class imbalance is common, machine-learning practitioners have developed several approaches to reduce its effects.
Collecting more minority-class examples
The most direct solution is often to gather additional examples of the rare class. More examples expose the model to a wider range of situations and improve its ability to generalise beyond the limited cases originally available. When feasible, improving representation at the data-collection stage is often preferable to relying solely on algorithmic fixes. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…
Rebalancing the training dataset
A common technique is to change the composition of the training data.
- Oversampling increases the representation of minority-class examples.
- Undersampling reduces the number of majority-class examples.
Both methods make rare examples more visible during training. Research on image-classification systems has shown that class imbalance can substantially harm performance and that rebalancing strategies often improve results, particularly oversampling approaches. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…
Creating additional minority examples
In some domains, developers generate synthetic examples based on existing rare cases. These methods attempt to provide more training signals without requiring large-scale data collection. Research has found that carefully designed augmentation techniques can improve minority-class performance, though their effectiveness depends on the quality and realism of the generated examples. [arXiv]arxiv.orgSolving the Class Imbalance Problem Using a Counterfactual Method for Data AugmentationNovember 5, 2021…
Measuring the right outcomes
Even a well-balanced training strategy can fail if evaluation focuses only on accuracy. Modern machine-learning practice therefore emphasises metrics that reveal performance on rare classes, including recall, precision, F1 score, and precision–recall analysis. These measures help developers detect situations where the model performs well overall but poorly on the cases that matter most. [Google for Developers+2Encord]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…
The key lesson about rare cases
Class imbalance shows that training data influences not only what a model learns, but also what it learns to ignore. When rare examples appear too infrequently, the model receives weaker evidence about them, develops less reliable decision boundaries around them, and may achieve impressive overall accuracy while failing on the very cases users care about most. Understanding this effect is essential for interpreting AI performance: a model that looks accurate on average is not necessarily good at detecting rare but important events. [Google for Developers+2Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…
Amazon book picks
Further Reading
Books and field guides related to Why AI misses rare but important cases. Use these as the next step if you want deeper reading beyond the article.
Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...
Covers model evaluation, imbalanced datasets, metrics, sampling, and real-world machine-learning pitfalls.
Introduction to Machine Learning with Python
Explains classification problems, evaluation metrics, and challenges arising from skewed datasets.
Learning from Imbalanced Data Sets
Focused specifically on class imbalance, sampling methods, and minority-class performance.
The Hundred-page Machine Learning Book
Provides a strong conceptual foundation for understanding why models struggle with rare classes.
Endnotes
-
Source: developers.google.com
Link: https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasetsSource snippet
Google for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure...
-
Source: developers.google.com
Title: accuracy precision recall
Link: https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recallSource snippet
Google for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num...
-
Source: arxiv.org
Link: https://arxiv.org/abs/1710.05381 -
Source: arxiv.org
Title: arXiv Striking the Right Balance with Uncertainty
Link: https://arxiv.org/abs/1901.07590Source snippet
Striking the Right Balance with UncertaintyJanuary 22, 2019...
Published: January 22, 2019
-
Source: arxiv.org
Title: arXiv Heteroskedastic and Imbalanced [Deep Learning]({{ ‘deep-learning/’ | relative_url }}) with Adaptive Regularization
Link: https://arxiv.org/abs/2006.15766 -
Source: support.sas.com
Link: https://support.sas.com/resources/papers/proceedings17/0942-2017.pdfSource snippet
SAS SupportPredictive Accuracy: A Misleading Performance Measure...ABSTRACT. The most commonly reported model evaluation metric is the a...
-
Source: encord.com
Title: Accuracy vs
Link: https://encord.com/blog/classification-metrics-accuracy-precision-recall/Source snippet
Precision vs. Recall in Machine Learning23 Nov 2023 — Precision measures how often predictions for the positive class are correct. Recall...
-
Source: nvlpubs.nist.gov
Link: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdfSource snippet
NIST PublicationsTowards a Standard for Identifying and Managing Bias in...by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b...
-
Source: unece.org
Link: https://unece.org/sites/default/files/2025-10/Companion%20Paper%20on%20Fairness%20in%20Machine%20Learning_[ResponsibleSource snippet
Fairness in Machine LearningRepresentation bias occurs when the training data used is not representative of the population the model will...
-
Source: arxiv.org
Link: https://arxiv.org/abs/2111.03516Source snippet
Solving the Class Imbalance Problem Using a Counterfactual Method for Data AugmentationNovember 5, 2021...
Published: November 5, 2021
-
Source: google.com
Link: https://www.google.com/Source snippet
Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exac...
-
Source: developers.google.com
Title: crash course
Link: https://developers.google.com/machine-learning/crash-courseSource snippet
google.comGoogle's Machine Learning Crash CourseAn introduction to binary classification models, covering thresholding, confusion matrice...
-
Source: nist.gov
Link: https://www.nist.gov/Source snippet
National Institute of Standards and TechnologyNIST promotes U.S. innovation and industrial competitiveness by advancing measurement scien...
-
Source: nist.gov
Title: theres more ai bias [biased data]({{ ‘biased-data/’ | relative_url }}) nist report highlights
Link: https://www.nist.gov/news-events/news/2022/03/theres-more-ai-bias-biased-data-nist-report-highlightsSource snippet
There's More to AI Bias Than Biased Data, NIST Report...16 Mar 2022 — The NIST report acknowledges that a great deal of AI bias stems fr...
-
Source: encord.com
Link: https://encord.com/blog/an-introduction-to-balanced-and-imbalanced-datasets-in-machine-learning/Source snippet
Balanced and Imbalanced Datasets in Machine Learning...Nov 11, 2022 — Balancing a dataset makes training a model easier because it helps...
-
Source: dictionary.cambridge.org
Link: https://dictionary.cambridge.org/dictionary/english-chinese-traditional/classificationSource snippet
in Traditional Chinese - Cambridge Dictionarythe act or process of dividing things into groups according to their type 將...分類,將...歸類;把...
Additional References
-
Source: researchgate.net
Link: https://www.researchgate.net/publication/340894484_Addressing_Accuracy_Paradox_Using_Enhanched_Weighted_Performance_Metric_in_Machine_LearningSource snippet
Addressing Accuracy Paradox Using Enhanched Weighted...Jan 26, 2026 — This accuracy paradox [105] occurred for highly imbalanced dataset...
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/cornellius-yudha-wijaya_python-datascience-machinelearning-activity-7132185791467847680-bmlwSource snippet
Cornellius Y.'s PostBalance Metrics: Balanced Accuracy: Fairness in imbalanced datasets. F1 Score: Harmonizes precision and recall. F-bet...
-
Source: medium.com
Link: https://medium.com/%40shreya_g/why-accuracy-fails-on-imbalanced-datasets-8cd21594137bSource snippet
Why Accuracy Fails on Imbalanced DatasetsImbalanced datasets require careful evaluation. Accuracy as a metric alone can be dangerously mi...
-
Source: github.com
Link: https://github.com/litaotao/machine-learning-crash-course -
Source: medium.com
Link: https://medium.com/%40boutnaru/the-artificial-intelligence-journey-accuracy-a8a3f292ae6f -
Source: pmc.ncbi.nlm.nih.gov
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10741524/Source snippet
by M Ghanem · 2023 · Cited by 63 — AUPRC is a valuable metric when working with imbalanced datasets as it considers precision and reca...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=JnlM4yLFNuoSource snippet
Class Imbalance Explained | Why Your Model Looks Great and Still Fails - YouTube Class Imbalance Explained | Why Your Model Looks Great a...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=QM0sYbEQSkMSource snippet
Machine Learning Crash Course: ClassificationClassification is a machine learning technique for predicting a class (or category)—for exam...
-
Source: studocu.com
Title: machine learning 125 pm classification accuracy recall precision metrics
Link: https://www.studocu.com/in/document/rajiv-gandhi-university-of-health-sciences/hospital-related-law/machine-learning-125-pm-classification-accuracy-recall-precision-metrics/145476246Source snippet
Machine Learning 1:25 PM Classification: Accuracy, Recall...9 Oct 2024 — Explore essential metrics for evaluating machine learning model...
-
Source: pmc.ncbi.nlm.nih.gov
Title: PMC100% Classification Accuracy Considered Harmful
Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC3888391/Source snippet
nih.gov100% Classification Accuracy Considered Harmful - PMC - NIHby FJ Valverde-Albacete · 2014 · Cited by 322 — Despite optimizing clas...
Topic Tree



