This blog is now hosted at

Thursday, November 1, 2007

Bayesian Filtering: Why Not?

The earlier pieces I did on Ron Paul spam (here and here) point to some problems with Bayesian filtering. Read on for some further analysis of the problems with Bayesian filtering.

Bayesian filtering, in a nutshell, breaks down an email into words and or phrases, and then assigns a spam probability to each, based on the word's previous penchant to be spam. For example, let's assume that spammers are sending messages advertising a product called "Bradley". Over time, as those messages are categorized as spam, a Bayesian system would give an increasing spam score to the term "Bradley".

This type of system began to be deployed widely starting 3 or 4 years ago, and was very effective for a couple of years.

Bayesian Poisoning

As seems to always be the case, the spammers switched tactics. They began sending out their spam with a large number of (typically incoherent) words stolen from news sources or literature. This throws off the Bayesian system by

  • Adding to the non-spam score (since there are "good" words in the mail), and
  • Putting the good words in the spam list

Beginning about a year ago, I started receiving spam that only had the good text, no advertisements. This was a deliberate attempt just to poison Bayesian systems.

The ongoing issue with Bayesian systems is that spammers have fairly effectively figured out how to confuse them (either by falsely calling acceptable mail spam, or letting spam go through). Fortunately, the state of the art in spam detection is being pushed forward as well.

No comments: