How spam filters add up small clues

Introduction

Bayesian spam filters are a classic example of how artificial intelligence can learn patterns rather than follow rigid rules. Instead of asking whether a message contains one forbidden word, a Bayesian filter examines many small hints and estimates how likely the message is to be spam. A single clue may be weak and unreliable, but dozens of weak clues can combine into strong evidence. This ability to accumulate probabilities is what made Bayesian filtering one of the most influential machine-learning techniques in the history of email spam detection. [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

Bayesian Filters illustration 1

What labelled spam and ham examples teach the filter

A Bayesian filter begins with examples. Messages that users have identified as spam are placed in one group, while legitimate messages—often called “ham”—are placed in another. The system studies these examples and records how often different words, phrases, and other text fragments appear in each category. [SpamAssassin]spamassassin.apache.orgSpam Assassinsa-learnsa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what…

Suppose a training set contains thousands of emails. The filter might discover that words such as “discount”, “winner”, or certain unusual spellings appear more frequently in spam than in ordinary correspondence. At the same time, it may learn that words related to a person’s workplace, hobbies, or regular contacts are common in legitimate messages. The filter is not told which words matter in advance. It learns those patterns from the labelled examples. [SpamAssassin]spamassassin.apache.orgSpam Assassinsa-learnsa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what…

This learning process is one reason Bayesian filtering adapts better than fixed keyword lists. When spammers change their wording, new examples can teach the filter about the new patterns. As Paul Graham noted in his influential work on Bayesian spam filtering, if spammers replace a blocked word with a disguised version such as “c0ck” instead of “cock”, the filter can learn that the new spelling itself has become a strong clue. [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

How word probabilities become a spam score

Once training is complete, the filter has a probability estimate for many words and tokens. A token can be a word, a number, a character sequence, or another text fragment extracted from a message. [Apache Wiki]wiki.apache.orgWiki Bayes In Spam AssassinApache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w…

Imagine that a new email contains several tokens. For each one, the filter asks a question:

Based on past examples, how strongly does this token suggest spam rather than legitimate mail?

Some tokens may contribute almost no evidence. Others may lean slightly towards spam. A few may be highly suspicious. The filter then combines these probability estimates using Bayes’ theorem to produce an overall score representing how likely the entire message is to belong to the spam category. [CU Blog Service+2Gigamonkeys]blogs.cornell.edubayes theorem in email spam filteringCU Blog ServiceBayes' Theorem in email spam filtering27-Oct-2018 — Bayes' Theorem describes the conditional probability an event is going…

The important idea is that classification is based on probability, not certainty. A word does not automatically make an email spam. Instead, each word nudges the final judgement in one direction or the other.

For example:

“Meeting” might slightly favour legitimate mail.
“Invoice” might be nearly neutral.
“Guaranteed” might lean towards spam.
An unusual promotional phrase might strongly favour spam.

The filter combines all of these signals and calculates a final likelihood. [Medium]medium.comEmail Spam Classifier Using Naive BayesNaive Bayes classifier technique has become a very popular method in mail filtering Email. E…

Why weak clues become stronger together

The most interesting feature of Bayesian filtering is that it does not need a single decisive indicator. Many weak clues can collectively become powerful evidence.

Consider a simple analogy. Seeing one raindrop does not prove it is raining. Seeing dark clouds alone does not prove it either. But dark clouds, falling raindrops, wet pavement, and people carrying umbrellas together make the conclusion much more convincing.

Bayesian spam filters work in a similar way. A message might contain several mildly suspicious features:

A marketing-style phrase.
An unusual punctuation pattern.
A word common in past spam campaigns.
A rarely seen sender format.
A collection of terms that often appear together in spam.

None of these clues alone may justify blocking the message. Combined, however, they can push the spam probability above a threshold where the filter becomes confident enough to classify the email as unwanted. [Gigamonkeys]gigamonkeys.com23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine…

This is a key difference from rule-based systems. A rule-based filter often asks, “Did the message contain a forbidden feature?” A Bayesian filter asks, “What is the overall probability after considering all available evidence?” [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

Bayesian Filters illustration 2

A concrete example

Imagine a filter has learned the following tendencies from past mail:

The phrase “limited time offer” appears frequently in spam.
The word “investment” appears somewhat more often in spam than in legitimate mail.
The recipient rarely receives emails discussing investments.
Several formatting patterns in the message resemble previous spam.

Individually, none of these observations guarantees that the email is spam. A legitimate financial newsletter could contain all of them.

However, when the filter combines the probabilities associated with each clue, the cumulative evidence may indicate that the message is much more likely to be spam than ham. The final decision emerges from the interaction of many pieces of evidence rather than from any single trigger word. [Apache Wiki+2Jonk Agstrom]wiki.apache.orgWiki Bayes In Spam AssassinApache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w…

Why the approach was so influential

Bayesian filtering became influential because it addressed a fundamental weakness of manual rule writing. Human developers could not easily predict every new trick that spammers would invent. A probabilistic system could learn from examples and continually adjust its understanding of which patterns mattered. [Paul Graham+2Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

Systems such as Apache SpamAssassin incorporated Bayesian learning for exactly this reason. After being trained on sufficient examples of spam and legitimate mail, the classifier could tailor itself to a user’s actual email environment and reduce both missed spam and mistaken blocks of legitimate messages. [SpamAssassin]spamassassin.apache.orgsa-learnSpamAssassin 2.50 and later supports Bayesian spam analysis, in the form of the BAYES rules. This is a new feature, q…

The broader lesson for artificial intelligence is that useful decisions often emerge from combining many imperfect signals. Bayesian spam filters demonstrate that learning systems do not always need a perfect rule. By accumulating numerous weak clues and turning them into probabilities, they can make surprisingly accurate judgements in complex, changing environments. [Gigamonkeys+2Jonk Agstrom]gigamonkeys.com23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine…

Bayesian Filters illustration 3

Amazon book picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Example eBay listing

Quality “Learning” Center Funny Parody Education Quote Vintage Men's T-Shirt Tee

Search eBay.co.uk: machine learning t shirt

Browse similar on eBay.co.uk

Example eBay listing

Eat Sleep Machine Learning T shirt Tee

Search eBay.co.uk: machine learning t shirt

Browse similar on eBay.co.uk

Example eBay listing

I LOVE MACHINE LEARNING T-SHIRT heart ai data science algorithms technology

Search eBay.co.uk: machine learning t shirt

Browse similar on eBay.co.uk

Example eBay listing

Keep Calm and Study Machine Learning T shirt Funny Tee

Search eBay.co.uk: machine learning t shirt

Browse similar on eBay.co.uk

Browse more on eBay.co.uk

Example items shown for inspiration; availability and pricing can change. Branchoria may earn a commission if you purchase through outbound eBay links.

Endnotes

Source: gigamonkeys.com
Link: https://gigamonkeys.com/book/practical-a-spam-filter
Source snippet
23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine...
Source: spamassassin.apache.org
Title: Spam Assassinsa-learn
Link: https://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.html
Source snippet
sa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what...
Source: wiki.apache.org
Title: Wiki Bayes In Spam Assassin
Link: https://wiki.apache.org/spamassassin/BayesInSpamAssassin
Source snippet
Apache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w...
Source: medium.com
Link: https://medium.com/analytics-vidhya/email-spam-classifier-using-naive-bayes-a51b8c6290d4
Source snippet
Email Spam Classifier Using Naive BayesNaive Bayes classifier technique has become a very popular method in mail filtering Email. E...
Source: spamassassin.apache.org
Link: https://spamassassin.apache.org/full/4.0.x/doc/sa-learn.html
Source snippet
sa-learnSpamAssassin 2.50 and later supports Bayesian spam analysis, in the form of the BAYES rules. This is a new feature, q...
Source: medium.com
Link: https://medium.com/data-science/na%C3%AFve-bayes-spam-filter-from-scratch-12970ad3dae7
Source: medium.com
Link: https://medium.com/%40myh809503699/naive-bayes-classifier-for-spam-detection-f4f85783a861
Source snippet
Naive Bayes Classifier for Spam Detection | by CaitlyncccOverview. The Naive Bayes Classifier is a fast, probabilistic machine learning a...
Source: medium.com
Link: https://medium.com/%40saudhaminiupdated/beyond-the-basics-mastering-text-classification-using-naive-bayes-b096980a4030
Source snippet
Mastering Text Classification using Naive BayesFalse Positives (FP): 13 instances incorrectly classified as spam. False Negatives (FN): 1...
Source: paulgraham.com
Link: https://www.paulgraham.com/spam.html
Source snippet
Paul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor...
Source: paulgraham.com
Link: https://paulgraham.com/better.html?viewfullsite=1
Source snippet
Better Bayesian FilteringI discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot. Spam filtering is a subset of text cl...
Source: blogs.cornell.edu
Title: bayes theorem in email spam filtering
Link: https://blogs.cornell.edu/info2040/2018/10/27/bayes-theorem-in-email-spam-filtering/
Source snippet
CU Blog ServiceBayes' Theorem in email spam filtering27-Oct-2018 — Bayes' Theorem describes the conditional probability an event is going...
Source: jonkagstrom.com
Link: https://jonkagstrom.com/static/improvingnb.pdf
Source snippet
IMPROVING NAIVE BAYESIAN SPAM FILTERINGby J Kågström · 2005 · Cited by 29 — The Naive Bayesian classifier combines the probabilities of e...
Source: paulgraham.com
Link: https://www.paulgraham.com/better.html
Source snippet
Bayesian FilteringJan 10, 2003 — It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam...
Source: paulgraham.com
Link: https://www.paulgraham.com/antispam.html
Source: linkedin.com
Link: https://www.linkedin.com/posts/dhruv-roongta-421a7b214_paul-graham-never-worked-at-google-he-still-activity-7451275855835680768-cAwD
Source snippet
Paul Graham's Bayesian Spam Filter RevolutionIn August 2002 he wrote a 9-page essay called "A Plan for Spam." That essay quietly became t...

Published: August 2002
Source: linkedin.com
Link: https://www.linkedin.com/posts/mailwarmhq_happy-weekend-everyone-did-you-know-activity-7446211306275295232-T-Y-
Source snippet
t ran incoming messages through an old probability equation.Read more...
Source: cs.ubbcluj.ro
Link: https://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdf
Source snippet
Paul Graham's approach has become fairly famous[2]. He introduced a new formula for...Read more...
Source: www1.se.cuhk.edu.hk
Title: Bayesian Spam Filter for Outlook
Link: https://www1.se.cuhk.edu.hk/~seem5680/lecture/Bayesian-Spam-Filter-for-Outlook.pdf
Source snippet
approach and spam filteringFeb 4, 2020 — The inventor and the main promoter of the idea to use the Bayesian approach in spam filtering so...
Source: lingualeo.com
Title: Paul Graham
Link: https://lingualeo.com/fa/jungle/paul-graham-a-plan-for-spam-103842
Source snippet
A Plan for Spam ترجمه به فارسیAn improved algorithm is described in Better Bayesian Filtering.) I think it's possible to stop spam, and t...
Source: classpages.cselabs.umn.edu
Link: https://classpages.cselabs.umn.edu/Spring-2020/csci4511W/spam.html
Source snippet
FilterJan 9, 2020 — A spam filter. Write a simple spam filter based on naive Bayes probability, following the steps outlined in A plan fo...

Additional References

Source: perlmonks.org
Link: https://www.perlmonks.org/?node_id=190837
Source snippet
Bayesian Filtering for SpamI read, with great interest, Paul Graham's article on filtering for spam using a Bayesian scoring system of in...
Source: kaggle.com
Link: https://www.kaggle.com/code/mehmetlaudatekman/filtering-spam-e-mails-power-of-naive-bayes
Source snippet
Filtering Spam E-mails: Power of Naive BayesClass Prior Probability (P(A)) = Probability of class being spam. And naive bayes algorithm...
Source: github.com
Link: https://github.com/ivedants/Naive-Bayes-Spam-Email-Classifier
Source snippet
ivedants/Naive-Bayes-Spam-Email-ClassifierThe process involves looking for particular words that have probabilities of showing up in a sp...
Source: merriam-webster.com
Link: https://www.merriam-webster.com/dictionary/naive
Source snippet
NAIVE Definition & Meaning1. marked by honest simplicity: artless 2. showing lack of experience or knowledge: credulous naively adverbR...
Source: stackoverflow.com
Link: https://stackoverflow.com/questions/361917/naive-bayesian-spam-filtering-effectiveness
Source snippet
Naive Bayesian spam filtering effectivenessPaul Graham was the guy to really introduce the idea of using Bayesian spam filtering to the w...
Source: felix-colibri.com
Link: https://www.felix-colibri.com/papers/web/bayesian_spam_filter/bayesian_spam_filter.html
Source snippet
Bayesian Spam FilterOnce we have our two token / frequency lists, we combine them into a token spam probability list. Each entry in this...
Source: youtube.com
Link: https://www.youtube.com/watch?v=2sXAYoPIz3A
Source snippet
Creating a Spam Filter using Naive BayesCreating a Spam Filter using Naive Bayes. Filtering, and Summarizing Data. Naive Bayes Classifier...
Source: github.com
Link: https://github.com/jieren123/SpamFilter-NavieBayes
Source snippet
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem.Read more...
Source: sigmamagic.com
Link: https://www.sigmamagic.com/blogs/naive-bayes-detecting-spam-emails/
Source snippet
Application of Naive Bayes for Filtering Email Spam | BlogsIn this article, we will cover an overview of Naive Bayes algorithm and use it...
Source: github.com
Link: https://github.com/andrejlukic/spam-classifier
Source snippet
They typically use bag of words features to identify spam e-mail, an approach...Read more...

How spam filters add up small clues

Introduction

What labelled spam and ham examples teach the filter

How word probabilities become a spam score

Why weak clues become stronger together

A concrete example

Why the approach was so influential

Further Reading

Data Science for Business

Pattern Recognition and Machine Learning

Machine Learning for Absolute Beginners

The Hundred-page Machine Learning Book

Marketplace Samples

Quality “Learning” Center Funny Parody Education Quote Vintage Men's T-Shirt Tee

Eat Sleep Machine Learning T shirt Tee

I LOVE MACHINE LEARNING T-SHIRT heart ai data science algorithms technology

Keep Calm and Study Machine Learning T shirt Funny Tee

Endnotes

Additional References

Follow this branch

Parent topic

Related pages 2