Within Training data

Why AI misses rare but important cases

When rare examples appear too little in training data, a model may become accurate on common cases while missing important exceptions.

On this page

  • How frequency shapes what models notice
  • Why accuracy can hide weak rare case performance
  • Ways dataset design can protect rare examples
Preview for Why AI misses rare but important cases

Introduction

Artificial intelligence systems often learn best from patterns they see repeatedly. When a training dataset contains many examples of one category but very few of another, the result is known as class imbalance. In such datasets, a model may become highly accurate on common cases while consistently missing the rare cases that matter most. This is a widespread challenge in AI because many real-world problems are naturally imbalanced: fraudulent transactions are rarer than legitimate ones, serious diseases are rarer than healthy cases, and safety-critical failures are rarer than normal operation. As a result, a model can appear successful overall while performing poorly where mistakes carry the greatest consequences. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

Rare cases illustration 1

How frequency shapes what models notice

Machine-learning models learn from evidence. During training, every example contributes information about which patterns should be associated with which outcomes. When one class dominates the dataset, it contributes far more training signals than the minority class.

Imagine a dataset containing 99,000 normal transactions and 1,000 fraudulent transactions. The model repeatedly encounters normal behaviour and receives strong feedback about how to recognise it. Fraudulent behaviour appears much less often, giving the model fewer opportunities to learn the distinctive characteristics of fraud. Over time, the model may become excellent at recognising the majority class while developing only a weak understanding of the minority class. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

This imbalance affects how the model allocates its limited learning capacity. Optimisation algorithms generally focus on reducing overall error. Since mistakes on the majority class occur more frequently, correcting those mistakes often produces the largest improvement in the training objective. The rare class can therefore receive less attention during learning even when it is the most important category from a human perspective. [arXiv]arxiv.orgOpen source on arxiv.org.

A second problem is variation within the rare class. Common categories often contain many examples showing different conditions, environments, and edge cases. Rare categories may contain only a small sample of possible situations. The model therefore learns a narrower picture of what the minority class looks like and may struggle when confronted with new variations. [arXiv]arxiv.orgarXiv Striking the Right Balance with UncertaintyStriking the Right Balance with UncertaintyJanuary 22, 2019…Published: January 22, 2019

Why accuracy can hide weak rare-case performance

One of the most misunderstood effects of class imbalance is that a model can achieve impressive accuracy while effectively failing at the task people care about.

Consider a dataset where only 1% of cases belong to the rare class. A model that simply predicts the majority class every time would be correct 99% of the time. The reported accuracy would look excellent despite the model never identifying a single rare example. Researchers often describe this as the accuracy paradox. [Google for Developers+2SAS Support]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

This problem appears in many practical applications:

  • A disease-screening system may correctly identify most healthy patients while missing people who actually have the disease.
  • A fraud-detection system may approve nearly all legitimate transactions but fail to stop fraudulent ones.
  • A manufacturing inspection system may pass almost every product while overlooking the defects it was designed to find.

In each case, overall accuracy can remain high even though performance on the rare class is poor. For this reason, AI practitioners often rely on additional measures such as recall (how many true rare cases are found) and precision (how many detected rare cases are correct). For highly imbalanced datasets, these metrics provide a clearer picture than accuracy alone. [Google for Developers+2Encord]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

The distinction matters because the cost of mistakes is rarely distributed evenly. Missing a rare but dangerous event may be far more consequential than incorrectly flagging a common event. A model optimised only for overall accuracy can therefore appear successful while creating significant real-world risks. [Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

Rare cases illustration 2

Why rare examples are often the most important

Class imbalance becomes especially significant because minority classes frequently correspond to the outcomes humans care about most.

In medical diagnosis, the rare cases are often the patients who need urgent treatment. In cybersecurity, the rare cases are the attacks. In aviation safety, the rare cases are the failures. In financial systems, the rare cases may be money laundering or fraud.

From a purely statistical perspective, these events contribute little to overall accuracy because they occur infrequently. From a practical perspective, they may be the entire reason the AI system exists. This mismatch between statistical frequency and human importance is one of the central challenges of dataset design. [Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

Class imbalance can also create fairness concerns. If certain populations or situations are underrepresented in training data, the model may perform worse for those groups because it has seen fewer examples from them. Standards and research on AI bias repeatedly identify representation problems in training data as a major source of uneven model performance. [NIST Publications+2UNECE]nvlpubs.nist.govNIST PublicationsTowards a Standard for Identifying and Managing Bias in…by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b…

Ways dataset design can protect rare examples

Because class imbalance is common, machine-learning practitioners have developed several approaches to reduce its effects.

Collecting more minority-class examples

The most direct solution is often to gather additional examples of the rare class. More examples expose the model to a wider range of situations and improve its ability to generalise beyond the limited cases originally available. When feasible, improving representation at the data-collection stage is often preferable to relying solely on algorithmic fixes. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

Rebalancing the training dataset

A common technique is to change the composition of the training data.

  • Oversampling increases the representation of minority-class examples.
  • Undersampling reduces the number of majority-class examples.

Both methods make rare examples more visible during training. Research on image-classification systems has shown that class imbalance can substantially harm performance and that rebalancing strategies often improve results, particularly oversampling approaches. [Google for Developers]developers.google.comGoogle for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure…

Rare cases illustration 3

Creating additional minority examples

In some domains, developers generate synthetic examples based on existing rare cases. These methods attempt to provide more training signals without requiring large-scale data collection. Research has found that carefully designed augmentation techniques can improve minority-class performance, though their effectiveness depends on the quality and realism of the generated examples. [arXiv]arxiv.orgSolving the Class Imbalance Problem Using a Counterfactual Method for Data AugmentationNovember 5, 2021…Published: November 5, 2021

Measuring the right outcomes

Even a well-balanced training strategy can fail if evaluation focuses only on accuracy. Modern machine-learning practice therefore emphasises metrics that reveal performance on rare classes, including recall, precision, F1 score, and precision–recall analysis. These measures help developers detect situations where the model performs well overall but poorly on the cases that matter most. [Google for Developers+2Encord]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

The key lesson about rare cases

Class imbalance shows that training data influences not only what a model learns, but also what it learns to ignore. When rare examples appear too infrequently, the model receives weaker evidence about them, develops less reliable decision boundaries around them, and may achieve impressive overall accuracy while failing on the very cases users care about most. Understanding this effect is essential for interpreting AI performance: a model that looks accurate on average is not necessarily good at detecting rare but important events. [Google for Developers+2Google for Developers]developers.google.comaccuracy precision recallGoogle for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num…

Amazon book picks

Further Reading

Books and field guides related to Why AI misses rare but important cases. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: developers.google.com
    Link: https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets
    Source snippet

    Google for DevelopersClass-imbalanced datasets | Machine LearningAug 28, 2025 — For example, the class-imbalanced dataset shown in Figure...

  2. Source: developers.google.com
    Title: accuracy precision recall
    Link: https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
    Source snippet

    Google for DevelopersClassification: Accuracy, recall, precision, and related metrics12 Jan 2026 — In an imbalanced dataset where the num...

  3. Source: arxiv.org
    Link: https://arxiv.org/abs/1710.05381

  4. Source: arxiv.org
    Title: arXiv Striking the Right Balance with Uncertainty
    Link: https://arxiv.org/abs/1901.07590
    Source snippet

    Striking the Right Balance with UncertaintyJanuary 22, 2019...

    Published: January 22, 2019

  5. Source: arxiv.org
    Title: arXiv Heteroskedastic and Imbalanced [Deep Learning]({{ ‘deep-learning/’ | relative_url }}) with Adaptive Regularization
    Link: https://arxiv.org/abs/2006.15766

  6. Source: support.sas.com
    Link: https://support.sas.com/resources/papers/proceedings17/0942-2017.pdf
    Source snippet

    SAS SupportPredictive Accuracy: A Misleading Performance Measure...ABSTRACT. The most commonly reported model evaluation metric is the a...

  7. Source: encord.com
    Title: Accuracy vs
    Link: https://encord.com/blog/classification-metrics-accuracy-precision-recall/
    Source snippet

    Precision vs. Recall in Machine Learning23 Nov 2023 — Precision measures how often predictions for the positive class are correct. Recall...

  8. Source: nvlpubs.nist.gov
    Link: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf
    Source snippet

    NIST PublicationsTowards a Standard for Identifying and Managing Bias in...by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b...

  9. Source: unece.org
    Link: https://unece.org/sites/default/files/2025-10/Companion%20Paper%20on%20Fairness%20in%20Machine%20Learning_[Responsible
    Source snippet

    Fairness in Machine LearningRepresentation bias occurs when the training data used is not representative of the population the model will...

  10. Source: arxiv.org
    Link: https://arxiv.org/abs/2111.03516
    Source snippet

    Solving the Class Imbalance Problem Using a Counterfactual Method for Data AugmentationNovember 5, 2021...

    Published: November 5, 2021

  11. Source: google.com
    Link: https://www.google.com/
    Source snippet

    Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exac...

  12. Source: developers.google.com
    Title: crash course
    Link: https://developers.google.com/machine-learning/crash-course
    Source snippet

    google.comGoogle's Machine Learning Crash CourseAn introduction to binary classification models, covering thresholding, confusion matrice...

  13. Source: nist.gov
    Link: https://www.nist.gov/
    Source snippet

    National Institute of Standards and TechnologyNIST promotes U.S. innovation and industrial competitiveness by advancing measurement scien...

  14. Source: nist.gov
    Title: theres more ai bias [biased data]({{ ‘biased-data/’ | relative_url }}) nist report highlights
    Link: https://www.nist.gov/news-events/news/2022/03/theres-more-ai-bias-biased-data-nist-report-highlights
    Source snippet

    There's More to AI Bias Than Biased Data, NIST Report...16 Mar 2022 — The NIST report acknowledges that a great deal of AI bias stems fr...

  15. Source: encord.com
    Link: https://encord.com/blog/an-introduction-to-balanced-and-imbalanced-datasets-in-machine-learning/
    Source snippet

    Balanced and Imbalanced Datasets in Machine Learning...Nov 11, 2022 — Balancing a dataset makes training a model easier because it helps...

  16. Source: dictionary.cambridge.org
    Link: https://dictionary.cambridge.org/dictionary/english-chinese-traditional/classification
    Source snippet

    in Traditional Chinese - Cambridge Dictionarythe act or process of dividing things into groups according to their type 將...分類,將...歸類;把...

Additional References

  1. Source: researchgate.net
    Link: https://www.researchgate.net/publication/340894484_Addressing_Accuracy_Paradox_Using_Enhanched_Weighted_Performance_Metric_in_Machine_Learning
    Source snippet

    Addressing Accuracy Paradox Using Enhanched Weighted...Jan 26, 2026 — This accuracy paradox [105] occurred for highly imbalanced dataset...

  2. Source: linkedin.com
    Link: https://www.linkedin.com/posts/cornellius-yudha-wijaya_python-datascience-machinelearning-activity-7132185791467847680-bmlw
    Source snippet

    Cornellius Y.'s PostBalance Metrics: Balanced Accuracy: Fairness in imbalanced datasets. F1 Score: Harmonizes precision and recall. F-bet...

  3. Source: medium.com
    Link: https://medium.com/%40shreya_g/why-accuracy-fails-on-imbalanced-datasets-8cd21594137b
    Source snippet

    Why Accuracy Fails on Imbalanced DatasetsImbalanced datasets require careful evaluation. Accuracy as a metric alone can be dangerously mi...

  4. Source: github.com
    Link: https://github.com/litaotao/machine-learning-crash-course

  5. Source: medium.com
    Link: https://medium.com/%40boutnaru/the-artificial-intelligence-journey-accuracy-a8a3f292ae6f

  6. Source: pmc.ncbi.nlm.nih.gov
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC10741524/
    Source snippet

    by M Ghanem · 2023 · Cited by 63 — AUPRC is a valuable metric when working with imbalanced datasets as it considers precision and reca...

  7. Source: youtube.com
    Link: https://www.youtube.com/watch?v=JnlM4yLFNuo
    Source snippet

    Class Imbalance Explained | Why Your Model Looks Great and Still Fails - YouTube Class Imbalance Explained | Why Your Model Looks Great a...

  8. Source: youtube.com
    Link: https://www.youtube.com/watch?v=QM0sYbEQSkM
    Source snippet

    Machine Learning Crash Course: ClassificationClassification is a machine learning technique for predicting a class (or category)—for exam...

  9. Source: studocu.com
    Title: machine learning 125 pm classification accuracy recall precision metrics
    Link: https://www.studocu.com/in/document/rajiv-gandhi-university-of-health-sciences/hospital-related-law/machine-learning-125-pm-classification-accuracy-recall-precision-metrics/145476246
    Source snippet

    Machine Learning 1:25 PM Classification: Accuracy, Recall...9 Oct 2024 — Explore essential metrics for evaluating machine learning model...

  10. Source: pmc.ncbi.nlm.nih.gov
    Title: PMC100% Classification Accuracy Considered Harmful
    Link: https://pmc.ncbi.nlm.nih.gov/articles/PMC3888391/
    Source snippet

    nih.gov100% Classification Accuracy Considered Harmful - PMC - NIHby FJ Valverde-Albacete · 2014 · Cited by 322 — Despite optimizing clas...

Topic Tree

Follow this branch

Parent topic

Training data Why the data teaches the model

Related pages 2