The earlier pieces I did on Ron Paul spam (here and here) point to some problems with Bayesian filtering. Read on for some further analysis of the problems with Bayesian filtering.
Bayesian filtering, in a nutshell, breaks down an email into words and or phrases, and then assigns a spam probability to each, based on the word's previous penchant to be spam. For example, let's assume that spammers are sending messages advertising a product called "Bradley". Over time, as those messages are categorized as spam, a Bayesian system would give an increasing spam score to the term "Bradley".
This type of system began to be deployed widely starting 3 or 4 years ago, and was very effective for a couple of years.
Bayesian Poisoning
As seems to always be the case, the spammers switched tactics. They began sending out their spam with a large number of (typically incoherent) words stolen from news sources or literature. This throws off the Bayesian system by
Beginning about a year ago, I started receiving spam that only had the good text, no advertisements. This was a deliberate attempt just to poison Bayesian systems.
The ongoing issue with Bayesian systems is that spammers have fairly effectively figured out how to confuse them (either by falsely calling acceptable mail spam, or letting spam go through). Fortunately, the state of the art in spam detection is being pushed forward as well.
This blog is now hosted at consciou.us
Thursday, November 1, 2007
Bayesian Filtering: Why Not?
Posted by Bradley at 3:41 PM
Labels: email, email governance, Ron Paul, spam
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment