Why the data teaches the model

Introduction

Training data is not just background material for an artificial intelligence system. It is the model’s experience of the world. A machine-learning model learns by examining examples and adjusting itself to reduce mistakes, so the patterns present in those examples strongly influence what the model comes to treat as important, normal, unusual, relevant, or predictive. If the data is rich and representative, the model is more likely to perform well on new cases. If the data is incomplete, biased, or misleading, the model can learn the wrong lessons. [Google for Developers+2Google for Developers]developers.google.comGoogle for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t…

Training data illustration 1 Understanding training data is therefore essential to understanding artificial intelligence. The behaviour of a model is shaped not only by its algorithms and computing power, but also by the evidence it receives during training. In many practical situations, the data has as much influence on outcomes as the model design itself. [Google for Developers]developers.google.comdata characteristicsGoogle for DevelopersDatasets: Data characteristics | Machine Learning3 Dec 2025 — A machine learning model's performance is heavily reli…

What counts as training data?

Training data is the collection of examples used to teach a machine-learning system. The exact form depends on the task.

A spam filter may be trained on emails labelled as “spam” or “not spam”. An image-recognition system may learn from millions of pictures paired with object labels. A language model may learn from large collections of books, articles, websites, and other text. Regardless of the format, the examples provide evidence about what relationships exist between inputs and outputs. [Google for Developers]developers.google.comGoogle for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t…

Training datasets can contain: [developers.google.com]developers.google.comGoogle for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t…

Text, such as documents, web pages, and messages.
Images and video.
Audio recordings.
Numerical measurements.
Categories or labels attached by humans.
Behavioural records, such as clicks, purchases, or ratings. [Google for Developers]developers.google.comdata characteristicsGoogle for DevelopersDatasets: Data characteristics | Machine Learning3 Dec 2025 — A machine learning model's performance is heavily reli…

The crucial point is that the model does not directly observe reality. It observes the dataset. If the dataset provides a narrow or distorted view of reality, that view can become embedded in the model’s behaviour. [NIST Publications]nvlpubs.nist.govNIST PublicationsTowards a Standard for Identifying and Managing Bias in…by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b…

How examples become model behaviour

During training, a model repeatedly compares its predictions with known examples and adjusts internal parameters to improve performance. Over time, it becomes better at recognising statistical relationships that help it succeed on the training task. [Google for Developers]developers.google.comGoogle for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t…

This process means the model learns whatever patterns help reduce errors. Sometimes those patterns are meaningful. For example, a medical image model may learn visual signs associated with a disease. Other times, the model may discover shortcuts that happen to work in the dataset but do not reflect genuine cause-and-effect relationships. [arXiv]arxiv.orgarXiv Learning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesLearning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesNovember 7, 2020…Published: November 7, 2020

Imagine a dataset where nearly every photograph of a dog happens to be taken indoors. A model could learn that indoor backgrounds are a useful clue for identifying dogs. The model may appear accurate during testing on similar data, yet struggle when shown dogs outdoors. Researchers describe these misleading relationships as dataset biases or spurious correlations. [arXiv]arxiv.orgarXiv Learning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesLearning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesNovember 7, 2020…Published: November 7, 2020

The model therefore learns two things simultaneously:

The intended pattern, such as the appearance of a dog.
Any accidental regularities that happen to exist in the training examples.

Because machine learning is fundamentally pattern matching, the model has no built-in understanding of which correlations humans consider meaningful unless the training process and data encourage it to learn the right ones. [arXiv]arxiv.orgarXiv Learning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesLearning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesNovember 7, 2020…Published: November 7, 2020

Training data illustration 2

Why frequency matters

Models generally pay more attention to patterns that appear repeatedly. Common examples have a stronger influence on what is learned than rare examples.

If a dataset contains thousands of examples of one category but only a handful of another, the model may become very good at recognising the common category while struggling with the rare one. Researchers studying class imbalance have shown that training-set composition can significantly affect how often a model recognises rare cases correctly. [arXiv]arxiv.orgAn Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare ObjectsJuly 7, 2022…Published: July 7, 2022

This is one reason why collecting diverse examples is often as important as collecting large numbers of examples.

Why missing or poor examples matter

A model cannot learn patterns that are absent from its training experience. Missing data creates blind spots.

Suppose a speech-recognition system is trained mostly on recordings from a limited range of accents. The system may perform well for speakers represented in the data while making more mistakes for others. The issue is not necessarily that the algorithm dislikes certain accents; rather, it has had fewer opportunities to learn their characteristics. [UNECE]unece.orgFairness in Machine Learning Representation bias occurs when the training data used is not representative of the population the model wilUNECEFairness in Machine LearningRepresentation bias occurs when the training data used is not representative of the population the model…

Researchers and standards organisations frequently identify representation problems as a major source of AI bias. When important groups, situations, or environments are underrepresented, the resulting model can systematically perform worse for them. [NIST Publications+2NIST]nvlpubs.nist.govNIST PublicationsTowards a Standard for Identifying and Managing Bias in…by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b…

Poor-quality examples can also cause problems. Incorrect labels, measurement errors, duplicated records, outdated information, or inconsistent data can teach models inaccurate relationships. Since machine learning depends on finding patterns in past examples, flawed evidence often produces flawed behaviour. This principle is sometimes summarised as “garbage in, garbage out.” [Google for Developers]developers.google.comdata characteristicsGoogle for DevelopersDatasets: Data characteristics | Machine Learning3 Dec 2025 — A machine learning model's performance is heavily reli…

Representation shapes outcomes

Representation is not merely a technical detail. It influences what a model considers typical.

For example, if historical data reflects existing social inequalities, a model may learn those patterns and reproduce them in future predictions. NIST notes that human and institutional biases can enter AI systems through the data used for training, while IBM defines data bias as biases in training or fine-tuning datasets that affect model behaviour. [NIST Publications+2NIST]nvlpubs.nist.govNIST PublicationsTowards a Standard for Identifying and Managing Bias in…by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b…

Recent discussions of gender representation in AI similarly emphasise that models trained on incomplete or skewed datasets can generate outputs that underrepresent or disadvantage certain groups because the training data itself does not adequately reflect the population being modelled. [TechRadar]techradar.comIt explains that AI systems, while beneficial in streamlining tasks and decision-making, are only as unbiased as the data they are traine…

The difference between memorising and generalising

The ultimate goal of training is not to memorise examples but to learn patterns that apply to new situations.

A model that simply remembers training cases may appear successful during training yet fail when confronted with unfamiliar data. Google’s machine-learning guidance emphasises the importance of datasets and evaluation methods that encourage generalisation rather than overfitting to the training set. [Google for Developers]developers.google.comGoogle for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t…

Good training data helps models discover patterns that remain useful beyond the specific examples they have seen. This requires:

Sufficient variety.
Accurate labels and measurements.
Coverage of important real-world situations.
Representative sampling of the environment in which the model will operate. [Google for Developers+2arXiv]developers.google.comGoogle for DevelopersData quality and interpretation | ML Universal Guides25 Aug 2025 — The Fairness module in Machine Learning Crash Cou…

When these conditions are met, the model is more likely to learn durable relationships rather than fragile shortcuts.

Training data illustration 3

Why training data is often the most important ingredient

People often focus on algorithms, neural networks, or model size when discussing artificial intelligence. Yet many AI researchers and practitioners regard data as one of the most influential factors in determining model behaviour.

The model learns from evidence, not from direct experience of the world. The examples selected for training determine which patterns are visible, which are hidden, which groups are represented, and which mistakes are likely. As a result, training data acts as the curriculum from which the model learns. Better data does not guarantee perfect performance, but it strongly shapes what the system can and cannot know. [Google for Developers+2Google for Developers]developers.google.comGoogle for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t…

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

I Love Anal Analytics T-Shirt Unisex Funny Data Science Cartoon Graphic Tee

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Trust The Process Algorithmic Data Science Design T-Shirt

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Data Drives Decisions Mens T-Shirt Data Science Technology Fathers Day Gift

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Example eBay listing

Data Is Greater Than Opinion Data Analyst Science Mens T Shirts #P1#Or#A

Search eBay.co.uk: data science t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: developers.google.com
Link: https://developers.google.com/machine-learning/crash-course
Source snippet
Google for DevelopersGoogle's Machine Learning Crash CourseAn introduction to the characteristics of machine learning datasets, and how t...
Source: developers.google.com
Title: data characteristics
Link: https://developers.google.com/machine-learning/crash-course/overfitting/data-characteristics
Source snippet
Google for DevelopersDatasets: Data characteristics | Machine Learning3 Dec 2025 — A machine learning model's performance is heavily reli...
Source: nvlpubs.nist.gov
Link: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1270.pdf
Source snippet
NIST PublicationsTowards a Standard for Identifying and Managing Bias in...by R Schwartz · 2022 · Cited by 808 — Systemic and implicit b...
Source: ibm.com
Link: https://www.ibm.com/think/topics/data-bias
Source snippet
dversely affect model behavior...
Source: arxiv.org
Title: arXiv Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles
Link: https://arxiv.org/abs/2011.03856
Source snippet
Learning to Model and Ignore Dataset Bias with Mixed Capacity EnsemblesNovember 7, 2020...

Published: November 7, 2020
Source: developers.google.com
Link: https://developers.google.com/machine-learning/guides/data-traps/quality
Source snippet
Google for DevelopersData quality and interpretation | ML Universal Guides25 Aug 2025 — The Fairness module in Machine Learning Crash Cou...
Source: arxiv.org
Link: https://arxiv.org/abs/2207.03207
Source snippet
An Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare ObjectsJuly 7, 2022...

Published: July 7, 2022
Source: arxiv.org
Title: arXiv Algorithmic Factors Influencing Bias in Machine Learning
Link: https://arxiv.org/abs/2104.14014
Source: unece.org
Link: https://unece.org/sites/default/files/2025-10/Companion%20Paper%20on%20Fairness%20in%20Machine%20Learning_[Responsible
Source snippet
Fairness in Machine LearningRepresentation bias occurs when the training data used is not representative of the population the model...
Source: nist.gov
Title: theres more ai bias biased data nist report highlights
Link: https://www.nist.gov/news-events/news/2022/03/theres-more-ai-bias-biased-data-nist-report-highlights
Source snippet
There's More to AI Bias Than Biased Data, NIST Report...16 Mar 2022 — The NIST report acknowledges that a great deal of AI bias stems fr...
Source: nist.gov
Title: AI Research
Link: https://www.nist.gov/artificial-intelligence/ai-research-identifying-managing-harmful-bias-ai
Source snippet
Identifying & Managing Harmful Bias in AIAI systems can potentially increase the speed and scale of harmful biases and perpetuate or ampl...
Source: techradar.com
Link: https://www.techradar.com/pro/the-gender-data-gap-and-the-need-for-representation-in-ai
Source snippet
It explains that AI systems, while beneficial in streamlining tasks and decision-making, are only as unbiased as the data they are traine...
Source: developers.google.com
Link: https://developers.google.com/machine-learning/crash-course/overfitting/interpreting-loss-curves
Source snippet
Google for DevelopersOverfitting: Interpreting loss curves | Machine Learning3 Dec 2025 — Unfortunately, loss curves are often challengin...
Source: arxiv.org
Title: arXiv Data Representativity for Machine Learning and AI Systems
Link: https://arxiv.org/abs/2203.04706
Source: nist.gov
Link: https://www.nist.gov/trustworthy-and-responsible-ai
Source snippet
ity and Resiliency; Accountability and Transparency...Read more...
Source: google.com
Link: https://www.google.com/
Source snippet
Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exac...
Source: cloud.google.com
Title: machinelearning ai
Link: https://cloud.google.com/learn/training/machinelearning-ai
Source snippet
Learning & AI Courses | Google Cloud TrainingLearn how to implement the latest machine learning and artificial intelligence technology wi...
Source: nist.gov
Link: https://www.nist.gov/
Source snippet
National Institute of Standards and TechnologyNIST promotes U.S. innovation and industrial competitiveness by advancing measurement scien...
Source: arxiv.org
Link: https://arxiv.org/html/2303.01704v4
Source snippet
Feature Importance Disparities for Data Bias Investigations3 Jun 2024 — One of the primary causes of model bias is bias inherent in the t...
Source: ibm.com
Link: https://www.ibm.com/think/topics/algorithmic-bias
Source snippet
tory outcomes...
Source: solytics-partners.com
Link: https://www.solytics-partners.com/knowledge-and-training/training-datasets-explained-types-importance-challenges-and-impact-on-model-performance
Source snippet
Training Datasets: Types, Importance & Model Performance5 Jan 2026 — Learn what a training dataset is, its types, importance, challenges...
Source: scribd.com
Title: Machine Learning
Link: https://www.scribd.com/document/904355409/Machine-Learning-Google-for-Developers
Source snippet
Google ML Crash Course OverviewThe Machine Learning Crash Course outlines best practices for fairness, including auditing models for bias...

Additional References

Source: linkedin.com
Link: https://www.linkedin.com/pulse/google-machine-learning-crash-course-joseph-johnson-xl5de
Source snippet
Google Machine Learning Crash CourseGoogle's machine learning crash course is an online, self-study course with 15 hours' worth of (liste...
Source: openresearch.amsterdam
Link: https://openresearch.amsterdam/nl/page/83795/part-1-concepts-analyzing-bias-in-machine-learning-a-step-by-step
Source snippet
Part 1: Concepts: Analyzing Bias in Machine LearningMachine learning makes use of data by learning through the generalization of examples...
Source: reddit.com
Link: https://www.reddit.com/r/learnmachinelearning/comments/wuiycm/saas_engineering_manager_want_to_pivot_to_leading/
Source snippet
SaaS Engineering manager want to pivot to leading ML...Before deploying you need to run different kinds of tests, including quality chec...
Source: github.com
Link: https://github.com/litaotao/machine-learning-crash-course
Source snippet
machine-learning-crash-course from googleThis module investigates how to frame a task as a machine learning problem, and covers many of t...
Source: epic.org
Title: comments to nist on managing the risks of misuse with ai foundation models
Link: https://epic.org/documents/epic-comments-to-nist-on-managing-the-risks-of-misuse-with-ai-foundation-models/
Source snippet
EPIC Comments to NIST on Managing the Risks of Misuse...9 Sept 2024 — For example, malicious actors can and have exploited implicit bias...
Source: cltc.berkeley.edu
Title: a taxonomy of trustworthiness for artificial intelligence standalone taxonomy
Link: https://cltc.berkeley.edu/publication/a-taxonomy-of-trustworthiness-for-artificial-intelligence-standalone-taxonomy/
Source snippet
Taxonomy of Trustworthiness for Artificial Intelligence - CLTCNIST's characteristics of trustworthiness include: valid and reliable; safe...
Source: youtube.com
Link: https://www.youtube.com/watch?v=iBQlukGBZ78
Source snippet
FREE Machine Learning Crash Course from GoogleDo you want to learn about Machine Learning? If you answered yes, then this video is for yo...
Source: linkedin.com
Link: https://www.linkedin.com/top-content/artificial-intelligence/ethical-ai-principles/ensuring-fair-representation-in-ai-training-data/
Source snippet
Training data bias: When AI learns from unrepresentative data, it produces skewed outcomes. For example...
Source: encord.com
Link: https://encord.com/blog/an-introduction-to-data-labelling-and-training-data/
Source snippet
The Full Guide to Training Datasets for Machine Learning3 Dec 2024 — Training data is the initial training dataset used to teach a machin...
Source: aws.amazon.com
Title: framework mitigate bias improve outcomes new age ai
Link: https://aws.amazon.com/blogs/publicsector/framework-mitigate-bias-improve-outcomes-new-age-ai/
Source snippet
framework to mitigate bias and improve outcomes in the...26 Jun 2023 — This framework includes methods to mitigate bias, provide transpa...

Why the data teaches the model

Introduction

What counts as training data?

How examples become model behaviour

Why frequency matters

Why missing or poor examples matter

Representation shapes outcomes

The difference between memorising and generalising

Why training data is often the most important ingredient

Further Reading

Hands-on Machine Learning with Scikit-Learn, Keras, and Tenso...

The Hundred-page Machine Learning Book

An Introduction to Statistical Learning

Pattern Recognition and Machine Learning

Marketplace Samples

I Love Anal Analytics T-Shirt Unisex Funny Data Science Cartoon Graphic Tee

Trust The Process Algorithmic Data Science Design T-Shirt

Data Drives Decisions Mens T-Shirt Data Science Technology Fathers Day Gift

Data Is Greater Than Opinion Data Analyst Science Mens T Shirts #P1#Or#A

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 4

More on this topic 3