Within Spam filters

How spam filters add up small clues

Bayesian filters show how learning systems combine word patterns into probabilities instead of relying on one perfect rule.

On this page

  • What labelled spam and ham examples teach the filter
  • How word probabilities become a spam score
  • Why weak clues become stronger together
Preview for How spam filters add up small clues

Introduction

Bayesian spam filters are a classic example of how artificial intelligence can learn patterns rather than follow rigid rules. Instead of asking whether a message contains one forbidden word, a Bayesian filter examines many small hints and estimates how likely the message is to be spam. A single clue may be weak and unreliable, but dozens of weak clues can combine into strong evidence. This ability to accumulate probabilities is what made Bayesian filtering one of the most influential machine-learning techniques in the history of email spam detection. [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

Bayesian Filters illustration 1

What labelled spam and ham examples teach the filter

A Bayesian filter begins with examples. Messages that users have identified as spam are placed in one group, while legitimate messages—often called “ham”—are placed in another. The system studies these examples and records how often different words, phrases, and other text fragments appear in each category. [SpamAssassin]spamassassin.apache.orgSpam Assassinsa-learnsa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what…

Suppose a training set contains thousands of emails. The filter might discover that words such as “discount”, “winner”, or certain unusual spellings appear more frequently in spam than in ordinary correspondence. At the same time, it may learn that words related to a person’s workplace, hobbies, or regular contacts are common in legitimate messages. The filter is not told which words matter in advance. It learns those patterns from the labelled examples. [SpamAssassin]spamassassin.apache.orgSpam Assassinsa-learnsa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what…

This learning process is one reason Bayesian filtering adapts better than fixed keyword lists. When spammers change their wording, new examples can teach the filter about the new patterns. As Paul Graham noted in his influential work on Bayesian spam filtering, if spammers replace a blocked word with a disguised version such as “c0ck” instead of “cock”, the filter can learn that the new spelling itself has become a strong clue. [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

How word probabilities become a spam score

Once training is complete, the filter has a probability estimate for many words and tokens. A token can be a word, a number, a character sequence, or another text fragment extracted from a message. [Apache Wiki]wiki.apache.orgWiki Bayes In Spam AssassinApache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w…

Imagine that a new email contains several tokens. For each one, the filter asks a question:

Based on past examples, how strongly does this token suggest spam rather than legitimate mail?

Some tokens may contribute almost no evidence. Others may lean slightly towards spam. A few may be highly suspicious. The filter then combines these probability estimates using Bayes’ theorem to produce an overall score representing how likely the entire message is to belong to the spam category. [CU Blog Service+2Gigamonkeys]blogs.cornell.edubayes theorem in email spam filteringCU Blog ServiceBayes' Theorem in email spam filtering27-Oct-2018 — Bayes' Theorem describes the conditional probability an event is going…

The important idea is that classification is based on probability, not certainty. A word does not automatically make an email spam. Instead, each word nudges the final judgement in one direction or the other.

For example:

  • “Meeting” might slightly favour legitimate mail.
  • “Invoice” might be nearly neutral.
  • “Guaranteed” might lean towards spam.
  • An unusual promotional phrase might strongly favour spam.

The filter combines all of these signals and calculates a final likelihood. [Medium]medium.comEmail Spam Classifier Using Naive BayesNaive Bayes classifier technique has become a very popular method in mail filtering Email. E…

Why weak clues become stronger together

The most interesting feature of Bayesian filtering is that it does not need a single decisive indicator. Many weak clues can collectively become powerful evidence.

Consider a simple analogy. Seeing one raindrop does not prove it is raining. Seeing dark clouds alone does not prove it either. But dark clouds, falling raindrops, wet pavement, and people carrying umbrellas together make the conclusion much more convincing.

Bayesian spam filters work in a similar way. A message might contain several mildly suspicious features:

  • A marketing-style phrase.
  • An unusual punctuation pattern.
  • A word common in past spam campaigns.
  • A rarely seen sender format.
  • A collection of terms that often appear together in spam.

None of these clues alone may justify blocking the message. Combined, however, they can push the spam probability above a threshold where the filter becomes confident enough to classify the email as unwanted. [Gigamonkeys]gigamonkeys.com23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine…

This is a key difference from rule-based systems. A rule-based filter often asks, “Did the message contain a forbidden feature?” A Bayesian filter asks, “What is the overall probability after considering all available evidence?” [Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

Bayesian Filters illustration 2

A concrete example

Imagine a filter has learned the following tendencies from past mail:

  • The phrase “limited time offer” appears frequently in spam.
  • The word “investment” appears somewhat more often in spam than in legitimate mail.
  • The recipient rarely receives emails discussing investments.
  • Several formatting patterns in the message resemble previous spam.

Individually, none of these observations guarantees that the email is spam. A legitimate financial newsletter could contain all of them.

However, when the filter combines the probabilities associated with each clue, the cumulative evidence may indicate that the message is much more likely to be spam than ham. The final decision emerges from the interaction of many pieces of evidence rather than from any single trigger word. [Apache Wiki+2Jonk Agstrom]wiki.apache.orgWiki Bayes In Spam AssassinApache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w…

Why the approach was so influential

Bayesian filtering became influential because it addressed a fundamental weakness of manual rule writing. Human developers could not easily predict every new trick that spammers would invent. A probabilistic system could learn from examples and continually adjust its understanding of which patterns mattered. [Paul Graham+2Paul Graham]paulgraham.comPaul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor…

Systems such as Apache SpamAssassin incorporated Bayesian learning for exactly this reason. After being trained on sufficient examples of spam and legitimate mail, the classifier could tailor itself to a user’s actual email environment and reduce both missed spam and mistaken blocks of legitimate messages. [SpamAssassin]spamassassin.apache.orgsa-learnSpamAssassin 2.50 and later supports Bayesian spam analysis, in the form of the BAYES rules. This is a new feature, q…

The broader lesson for artificial intelligence is that useful decisions often emerge from combining many imperfect signals. Bayesian spam filters demonstrate that learning systems do not always need a perfect rule. By accumulating numerous weak clues and turning them into probabilities, they can make surprisingly accurate judgements in complex, changing environments. [Gigamonkeys+2Jonk Agstrom]gigamonkeys.com23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine…

Bayesian Filters illustration 3

Amazon book picks

Further Reading

Books and field guides related to How spam filters add up small clues. Use these as the next step if you want deeper reading beyond the article.

eBay marketplace picks

Marketplace Samples

Example marketplace items related to this page. Use the search link to explore similar finds on eBay.

Using USA

Endnotes

  1. Source: gigamonkeys.com
    Link: https://gigamonkeys.com/book/practical-a-spam-filter
    Source snippet

    23. Practical: A Spam FilterHe called his approach Bayesian filtering after the statistical technique that he used to combine...

  2. Source: spamassassin.apache.org
    Title: Spam Assassinsa-learn
    Link: https://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn.html
    Source snippet

    sa-learn - train SpamAssassin's Bayesian classifierThis tool will feed each mail to SpamAssassin, allowing it to 'learn' what...

  3. Source: wiki.apache.org
    Title: Wiki Bayes In Spam Assassin
    Link: https://wiki.apache.org/spamassassin/BayesInSpamAssassin
    Source snippet

    Apache WikiBayesInSpamAssassin - Apache Software FoundationThe Bayesian classifier in Spamassassin tries to identify spam by looking at w...

  4. Source: medium.com
    Link: https://medium.com/analytics-vidhya/email-spam-classifier-using-naive-bayes-a51b8c6290d4
    Source snippet

    Email Spam Classifier Using Naive BayesNaive Bayes classifier technique has become a very popular method in mail filtering Email. E...

  5. Source: spamassassin.apache.org
    Link: https://spamassassin.apache.org/full/4.0.x/doc/sa-learn.html
    Source snippet

    sa-learnSpamAssassin 2.50 and later supports Bayesian spam analysis, in the form of the BAYES rules. This is a new feature, q...

  6. Source: medium.com
    Link: https://medium.com/data-science/na%C3%AFve-bayes-spam-filter-from-scratch-12970ad3dae7

  7. Source: medium.com
    Link: https://medium.com/%40myh809503699/naive-bayes-classifier-for-spam-detection-f4f85783a861
    Source snippet

    Naive Bayes Classifier for Spam Detection | by CaitlyncccOverview. The Naive Bayes Classifier is a fast, probabilistic machine learning a...

  8. Source: medium.com
    Link: https://medium.com/%40saudhaminiupdated/beyond-the-basics-mastering-text-classification-using-naive-bayes-b096980a4030
    Source snippet

    Mastering Text Classification using Naive BayesFalse Positives (FP): 13 instances incorrectly classified as spam. False Negatives (FN): 1...

  9. Source: paulgraham.com
    Link: https://www.paulgraham.com/spam.html
    Source snippet

    Paul GrahamA Plan for SpamSo as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual wor...

  10. Source: paulgraham.com
    Link: https://paulgraham.com/better.html?viewfullsite=1
    Source snippet

    Better Bayesian FilteringI discovered this algorithm after ``A Plan for Spam'' [1] was on Slashdot. Spam filtering is a subset of text cl...

  11. Source: blogs.cornell.edu
    Title: bayes theorem in email spam filtering
    Link: https://blogs.cornell.edu/info2040/2018/10/27/bayes-theorem-in-email-spam-filtering/
    Source snippet

    CU Blog ServiceBayes' Theorem in email spam filtering27-Oct-2018 — Bayes' Theorem describes the conditional probability an event is going...

  12. Source: jonkagstrom.com
    Link: https://jonkagstrom.com/static/improvingnb.pdf
    Source snippet

    IMPROVING NAIVE BAYESIAN SPAM FILTERINGby J Kågström · 2005 · Cited by 29 — The Naive Bayesian classifier combines the probabilities of e...

  13. Source: paulgraham.com
    Link: https://www.paulgraham.com/better.html
    Source snippet

    Bayesian FilteringJan 10, 2003 — It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam...

  14. Source: paulgraham.com
    Link: https://www.paulgraham.com/antispam.html

  15. Source: linkedin.com
    Link: https://www.linkedin.com/posts/dhruv-roongta-421a7b214_paul-graham-never-worked-at-google-he-still-activity-7451275855835680768-cAwD
    Source snippet

    Paul Graham's Bayesian Spam Filter RevolutionIn August 2002 he wrote a 9-page essay called "A Plan for Spam." That essay quietly became t...

    Published: August 2002

  16. Source: linkedin.com
    Link: https://www.linkedin.com/posts/mailwarmhq_happy-weekend-everyone-did-you-know-activity-7446211306275295232-T-Y-
    Source snippet

    t ran incoming messages through an old probability equation.Read more...

  17. Source: cs.ubbcluj.ro
    Link: https://www.cs.ubbcluj.ro/~gabis/DocDiplome/Bayesian/000539771r.pdf
    Source snippet

    Paul Graham's approach has become fairly famous[2]. He introduced a new formula for...Read more...

  18. Source: www1.se.cuhk.edu.hk
    Title: Bayesian Spam Filter for Outlook
    Link: https://www1.se.cuhk.edu.hk/~seem5680/lecture/Bayesian-Spam-Filter-for-Outlook.pdf
    Source snippet

    approach and spam filteringFeb 4, 2020 — The inventor and the main promoter of the idea to use the Bayesian approach in spam filtering so...

  19. Source: lingualeo.com
    Title: Paul Graham
    Link: https://lingualeo.com/fa/jungle/paul-graham-a-plan-for-spam-103842
    Source snippet

    A Plan for Spam ترجمه به فارسیAn improved algorithm is described in Better Bayesian Filtering.) I think it's possible to stop spam, and t...

  20. Source: classpages.cselabs.umn.edu
    Link: https://classpages.cselabs.umn.edu/Spring-2020/csci4511W/spam.html
    Source snippet

    FilterJan 9, 2020 — A spam filter. Write a simple spam filter based on naive Bayes probability, following the steps outlined in A plan fo...

Additional References

  1. Source: perlmonks.org
    Link: https://www.perlmonks.org/?node_id=190837
    Source snippet

    Bayesian Filtering for SpamI read, with great interest, Paul Graham's article on filtering for spam using a Bayesian scoring system of in...

  2. Source: kaggle.com
    Link: https://www.kaggle.com/code/mehmetlaudatekman/filtering-spam-e-mails-power-of-naive-bayes
    Source snippet

    Filtering Spam E-mails: Power of Naive BayesClass Prior Probability (P(A)) = Probability of class being spam. And naive bayes algorithm...

  3. Source: github.com
    Link: https://github.com/ivedants/Naive-Bayes-Spam-Email-Classifier
    Source snippet

    ivedants/Naive-Bayes-Spam-Email-ClassifierThe process involves looking for particular words that have probabilities of showing up in a sp...

  4. Source: merriam-webster.com
    Link: https://www.merriam-webster.com/dictionary/naive
    Source snippet

    NAIVE Definition & Meaning1. marked by honest simplicity: artless 2. showing lack of experience or knowledge: credulous naively adverbR...

  5. Source: stackoverflow.com
    Link: https://stackoverflow.com/questions/361917/naive-bayesian-spam-filtering-effectiveness
    Source snippet

    Naive Bayesian spam filtering effectivenessPaul Graham was the guy to really introduce the idea of using Bayesian spam filtering to the w...

  6. Source: felix-colibri.com
    Link: https://www.felix-colibri.com/papers/web/bayesian_spam_filter/bayesian_spam_filter.html
    Source snippet

    Bayesian Spam FilterOnce we have our two token / frequency lists, we combine them into a token spam probability list. Each entry in this...

  7. Source: youtube.com
    Link: https://www.youtube.com/watch?v=2sXAYoPIz3A
    Source snippet

    Creating a Spam Filter using Naive BayesCreating a Spam Filter using Naive Bayes. Filtering, and Summarizing Data. Naive Bayes Classifier...

  8. Source: github.com
    Link: https://github.com/jieren123/SpamFilter-NavieBayes
    Source snippet

    Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem.Read more...

  9. Source: sigmamagic.com
    Link: https://www.sigmamagic.com/blogs/naive-bayes-detecting-spam-emails/
    Source snippet

    Application of Naive Bayes for Filtering Email Spam | BlogsIn this article, we will cover an overview of Naive Bayes algorithm and use it...

  10. Source: github.com
    Link: https://github.com/andrejlukic/spam-classifier
    Source snippet

    They typically use bag of words features to identify spam e-mail, an approach...Read more...

Topic Tree

Follow this branch

Parent topic

Spam filters Why spam filters do not need perfect rules

Related pages 2