Within Spam filters
How spam filters add up small clues
Bayesian filters show how learning systems combine word patterns into probabilities instead of relying on one perfect rule.
On this page
- What labelled spam and ham examples teach the filter
- How word probabilities become a spam score
- Why weak clues become stronger together
Page outline Jump by section
Introduction
Bayesian spam filters are a classic example of how artificial intelligence can learn patterns rather than follow rigid rules. Instead of asking whether a message contains one forbidden word, a Bayesian filter examines many small hints and estimates how likely the message is to be spam. A single clue may be weak and unreliable, but dozens of weak clues can combine into strong evidence. This ability to accumulate probabilities is what made Bayesian filtering one of the most influential machine-learning techniques in the history of email spam detection. [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…
What labelled spam and ham examples teach the filter
A Bayesian filter begins with examples. Messages that users have identified as spam are placed in one group, while legitimate messages—often called “ham”—are placed in another. The system studies these examples and records how often different words, phrases, and other text fragments appear in each category. [SpamAssassin]spamassassin.apache.orgSpam Assassinsa-learnsa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what…
Suppose a training set contains thousands of emails. The filter might discover that words such as “discount”, “winner”, or certain unusual spellings appear more frequently in spam than in ordinary correspondence. At the same time, it may learn that words related to a person’s workplace, hobbies, or regular contacts are common in legitimate messages. The filter is not told which words matter in advance. It learns those patterns from the labelled examples. [SpamAssassin]spamassassin.apache.orgSpam Assassinsa-learnsa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what…
This learning process is one reason Bayesian filtering adapts better than fixed keyword lists. When spammers change their wording, new examples can teach the filter about the new patterns. As Paul Graham noted in his influential work on Bayesian spam filtering, if spammers replace a blocked word with a disguised version such as “c0ck” instead of “cock”, the filter can learn that the new spelling itself has become a strong clue. [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…
How word probabilities become a spam score
Once training is complete, the filter has a probability estimate for many words and tokens. A token can be a word, a number, a character sequence, or another text fragment extracted from a message. [Apache Wiki]wiki.apache.orgWiki Bayes In Spam AssassinApache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w…
Imagine that a new email contains several tokens. For each one, the filter asks a question:
Based on past examples, how strongly does this token suggest spam rather than legitimate mail?
Some tokens may contribute almost no evidence. Others may lean slightly towards spam. A few may be highly suspicious. The filter then combines these probability estimates using Bayes’ theorem to produce an overall score representing how likely the entire message is to belong to the spam category. [CU Blog Service+2Gigamonkeys]blogs.cornell.edubayes theorem in email spam filteringCU Blog ServiceBayes' Theorem in email spam filtering27-Oct-2018 — Bayes' Theorem describes the conditional probability an event is going…
The important idea is that classification is based on probability, not certainty. A word does not automatically make an email spam. Instead, each word nudges the final judgement in one direction or the other.
For example:
- “Meeting” might slightly favour legitimate mail.
- “Invoice” might be nearly neutral.
- “Guaranteed” might lean towards spam.
- An unusual promotional phrase might strongly favour spam.
The filter combines all of these signals and calculates a final likelihood. [Medium]medium.comEmail Spam Classifier Using Naive BayesNaive Bayes classifier technique has become a very popular method in mail filtering Email. E…
Why weak clues become stronger together
The most interesting feature of Bayesian filtering is that it does not need a single decisive indicator. Many weak clues can collectively become powerful evidence.
Consider a simple analogy. Seeing one raindrop does not prove it is raining. Seeing dark clouds alone does not prove it either. But dark clouds, falling raindrops, wet pavement, and people carrying umbrellas together make the conclusion much more convincing.
Bayesian spam filters work in a similar way. A message might contain several mildly suspicious features:
- A marketing-style phrase.
- An unusual punctuation pattern.
- A word common in past spam campaigns.
- A rarely seen sender format.
- A collection of terms that often appear together in spam.
None of these clues alone may justify blocking the message. Combined, however, they can push the spam probability above a threshold where the filter becomes confident enough to classify the email as unwanted. [Gigamonkeys]gigamonkeys.com23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine…
This is a key difference from rule-based systems. A rule-based filter often asks, “Did the message contain a forbidden feature?” A Bayesian filter asks, “What is the overall probability after considering all available evidence?” [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…
A concrete example
Imagine a filter has learned the following tendencies from past mail:
- The phrase “limited time offer” appears frequently in spam.
- The word “investment” appears somewhat more often in spam than in legitimate mail.
- The recipient rarely receives emails discussing investments.
- Several formatting patterns in the message resemble previous spam.
Individually, none of these observations guarantees that the email is spam. A legitimate financial newsletter could contain all of them.
However, when the filter combines the probabilities associated with each clue, the cumulative evidence may indicate that the message is much more likely to be spam than ham. The final decision emerges from the interaction of many pieces of evidence rather than from any single trigger word. [Apache Wiki+2Jonk Agstrom]wiki.apache.orgWiki Bayes In Spam AssassinApache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w…
Why the approach was so influential
Bayesian filtering became influential because it addressed a fundamental weakness of manual rule writing. Human developers could not easily predict every new trick that spammers would invent. A probabilistic system could learn from examples and continually adjust its understanding of which patterns mattered. [Paul Graham+2Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…
Systems such as Apache SpamAssassin incorporated Bayesian learning for exactly this reason. After being trained on sufficient examples of spam and legitimate mail, the classifier could tailor itself to a user’s actual email environment and reduce both missed spam and mistaken blocks of legitimate messages. [SpamAssassin]spamassassin.apache.orgsa-learnSpamAssassin 2.50 and later supports Bayesian spam analysis, in the form of the BAYES rules. This is a new feature, q…
The broader lesson for artificial intelligence is that useful decisions often emerge from combining many imperfect signals. Bayesian spam filters demonstrate that learning systems do not always need a perfect rule. By accumulating numerous weak clues and turning them into probabilities, they can make surprisingly accurate judgements in complex, changing environments. [Gigamonkeys+2Jonk Agstrom]gigamonkeys.com23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine…
Amazon book picks
Further Reading
Books and field guides related to How spam filters add up small clues. Use these as the next step if you want deeper reading beyond the article.
Data Science for Business
Explains probabilistic classification concepts central to Bayesian spam filters.
Pattern Recognition and Machine Learning
Strong coverage of Bayesian methods and probability models.
Machine Learning for Absolute Beginners
Helps readers grasp how AI combines clues to make decisions.
The Hundred-page Machine Learning Book
Summarises probabilistic classification approaches.
Endnotes
-
Source: gigamonkeys.com
Link: https://gigamonkeys.com/book/practical-a-spam-filterSource snippet
23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine...
-
Source: spamassassin.apache.org
Title: Spam Assassinsa-learn
Link: https://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.htmlSource snippet
sa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what...
-
Source: wiki.apache.org
Title: Wiki Bayes In Spam Assassin
Link: https://wiki.apache.org/spamassassin/BayesInSpamAssassinSource snippet
Apache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w...
-
Source: medium.com
Link: https://medium.com/analytics-vidhya/email-spam-classifier-using-naive-bayes-a51b8c6290d4Source snippet
Email Spam Classifier Using Naive BayesNaive Bayes classifier technique has become a very popular method in mail filtering Email. E...
-
Source: spamassassin.apache.org
Link: https://spamassassin.apache.org/full/4.0.x/doc/sa-learn.htmlSource snippet
sa-learnSpamAssassin 2.50 and later supports Bayesian spam analysis, in the form of the BAYES rules. This is a new feature, q...
-
Source: medium.com
Link: https://medium.com/data-science/na%C3%AFve-bayes-spam-filter-from-scratch-12970ad3dae7 -
Source: medium.com
Link: https://medium.com/%40myh809503699/naive-bayes-classifier-for-spam-detection-f4f85783a861Source snippet
Naive Bayes Classifier for Spam Detection | by CaitlyncccOverview. The Naive Bayes Classifier is a fast, probabilistic machine learning a...
-
Source: medium.com
Link: https://medium.com/%40saudhaminiupdated/beyond-the-basics-mastering-text-classification-using-naive-bayes-b096980a4030Source snippet
Mastering Text Classification using Naive BayesFalse Positives (FP): 13 instances incorrectly classified as spam. False Negatives (FN): 1...
-
Source: paulgraham.com
Link: https://www.paulgraham.com/spam.htmlSource snippet
Paul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor...
-
Source: paulgraham.com
Link: https://paulgraham.com/better.html?viewfullsite=1Source snippet
Better Bayesian FilteringI discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot. Spam filtering is a subset of text cl...
-
Source: blogs.cornell.edu
Title: bayes theorem in email spam filtering
Link: https://blogs.cornell.edu/info2040/2018/10/27/bayes-theorem-in-email-spam-filtering/Source snippet
CU Blog ServiceBayes' Theorem in email spam filtering27-Oct-2018 — Bayes' Theorem describes the conditional probability an event is going...
-
Source: jonkagstrom.com
Link: https://jonkagstrom.com/static/improvingnb.pdfSource snippet
IMPROVING NAIVE BAYESIAN SPAM FILTERINGby J Kågström · 2005 · Cited by 29 — The Naive Bayesian classifier combines the probabilities of e...
-
Source: paulgraham.com
Link: https://www.paulgraham.com/better.htmlSource snippet
Bayesian FilteringJan 10, 2003 — It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam...
-
Source: paulgraham.com
Link: https://www.paulgraham.com/antispam.html -
Source: linkedin.com
Link: https://www.linkedin.com/posts/dhruv-roongta-421a7b214_paul-graham-never-worked-at-google-he-still-activity-7451275855835680768-cAwDSource snippet
Paul Graham's Bayesian Spam Filter RevolutionIn August 2002 he wrote a 9-page essay called "A Plan for Spam." That essay quietly became t...
Published: August 2002
-
Source: linkedin.com
Link: https://www.linkedin.com/posts/mailwarmhq_happy-weekend-everyone-did-you-know-activity-7446211306275295232-T-Y-Source snippet
t ran incoming messages through an old probability equation.Read more...
-
Source: cs.ubbcluj.ro
Link: https://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdfSource snippet
Paul Graham's approach has become fairly famous[2]. He introduced a new formula for...Read more...
-
Source: www1.se.cuhk.edu.hk
Title: Bayesian Spam Filter for Outlook
Link: https://www1.se.cuhk.edu.hk/~seem5680/lecture/Bayesian-Spam-Filter-for-Outlook.pdfSource snippet
approach and spam filteringFeb 4, 2020 — The inventor and the main promoter of the idea to use the Bayesian approach in spam filtering so...
-
Source: lingualeo.com
Title: Paul Graham
Link: https://lingualeo.com/fa/jungle/paul-graham-a-plan-for-spam-103842Source snippet
A Plan for Spam ترجمه به فارسیAn improved algorithm is described in Better Bayesian Filtering.) I think it's possible to stop spam, and t...
-
Source: classpages.cselabs.umn.edu
Link: https://classpages.cselabs.umn.edu/Spring-2020/csci4511W/spam.htmlSource snippet
FilterJan 9, 2020 — A spam filter. Write a simple spam filter based on naive Bayes probability, following the steps outlined in A plan fo...
Additional References
-
Source: perlmonks.org
Link: https://www.perlmonks.org/?node_id=190837Source snippet
Bayesian Filtering for SpamI read, with great interest, Paul Graham's article on filtering for spam using a Bayesian scoring system of in...
-
Source: kaggle.com
Link: https://www.kaggle.com/code/mehmetlaudatekman/filtering-spam-e-mails-power-of-naive-bayesSource snippet
Filtering Spam E-mails: Power of Naive BayesClass Prior Probability (P(A)) = Probability of class being spam. And naive bayes algorithm...
-
Source: github.com
Link: https://github.com/ivedants/Naive-Bayes-Spam-Email-ClassifierSource snippet
ivedants/Naive-Bayes-Spam-Email-ClassifierThe process involves looking for particular words that have probabilities of showing up in a sp...
-
Source: merriam-webster.com
Link: https://www.merriam-webster.com/dictionary/naiveSource snippet
NAIVE Definition & Meaning1. marked by honest simplicity: artless 2. showing lack of experience or knowledge: credulous naively adverbR...
-
Source: stackoverflow.com
Link: https://stackoverflow.com/questions/361917/naive-bayesian-spam-filtering-effectivenessSource snippet
Naive Bayesian spam filtering effectivenessPaul Graham was the guy to really introduce the idea of using Bayesian spam filtering to the w...
-
Source: felix-colibri.com
Link: https://www.felix-colibri.com/papers/web/bayesian_spam_filter/bayesian_spam_filter.htmlSource snippet
Bayesian Spam FilterOnce we have our two token / frequency lists, we combine them into a token spam probability list. Each entry in this...
-
Source: youtube.com
Link: https://www.youtube.com/watch?v=2sXAYoPIz3ASource snippet
Creating a Spam Filter using Naive BayesCreating a Spam Filter using Naive Bayes. Filtering, and Summarizing Data. Naive Bayes Classifier...
-
Source: github.com
Link: https://github.com/jieren123/SpamFilter-NavieBayesSource snippet
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem.Read more...
-
Source: sigmamagic.com
Link: https://www.sigmamagic.com/blogs/naive-bayes-detecting-spam-emails/Source snippet
Application of Naive Bayes for Filtering Email Spam | BlogsIn this article, we will cover an overview of Naive Bayes algorithm and use it...
-
Source: github.com
Link: https://github.com/andrejlukic/spam-classifierSource snippet
They typically use bag of words features to identify spam e-mail, an approach...Read more...
Topic Tree



