Mailing List SIMS@mail.stalker.com Message #15495
From: Christopher Bort <cbort@globalhomes.com>
Subject: Re: spam samples for bayes training
Date: Thu, 31 May 2007 14:04:17 -0700
To: SIMS Discussions <SIMS@mail.stalker.com>
X-Mailer: Mailsmith 2.2
On 05/31/07 11:31, lbutler@covisp.net (Lewis Butler) wrote:

On 30-May-2007, at 13:16, Christopher Bort wrote:
It is highly recommended that you train your Bayes database only with messages that have actually been received at your own installation. Using someone else's spam and ham is likely skew your database and result in inaccuracies. In other words, SpamAssassin needs to know what _you_ see as spam and ham, not what someone else sees.

Spam AND ham, yes, but there is nothing wrong with using someone else's spamarchive to help train up your bayes.  While defenitions of ham vary widely, the same cannot be said for spam.

Sure it can. I've seen a lot of different definitions of 'spam,' 'UCE,' 'UBE,' etc. They don't all agree on everything, starting with what to call it. Even within a given definition, what is or is not spam can be subjective. As a case in point, I work for a company that publishes real estate advertising magazines. Our advertisers routinely send us ad copy by e-mail. In many contexts, much of it would look quite spammy. In our context it is not and I need to make certain that such messages are not classified as spam. Also, our sales reps get a lot of e-mail advertisements and 'newsletters' all about the latest greatest things happening in real estate, mortgages and generally bilking people out of as much money as possible for houses that are several times bigger than what they need. To me, a great deal of it is absolutely spam. To our sales people who are receiving it, it's keeping up with trends in the market to which they sell (i.e. realtors). As much as the stuff might turn my stomach, a lot of it I cannot have SA classifying as spam.

The point is still that the spam you train SpamAssassin with must reflect the nature of spam that your installation receives. Can you get away with using someone else's 'generic' spam archive for training? Sure, but you would be well advised to first review it thoroughly to make sure that it fairly closely matches the spam you need to catch on your own server(s).

Another thing that has been implied in this thread, but not stated explicitly, is that Bayes training needs to continue on a regular basis beyond the initial training period. The mix of spam and ham that hits a given server now may not be exactly the same as what that server will see six months from now, so training needs to keep up with what is currently being received.

The best thing for accuracy is keeping SpamAssassin up to date (I am right now updating to 3.2), keeping its rules updated, and running some of the other rules sets.  I use RulesDuJour myself, but read-up on the wiki:

<http://wiki.apache.org/spamassassin/CustomRulesets>
<http://www.exit0.us/index.php?pagename=RulesDuJour >

Yes, I recommended using additional rulesets and RulesDuJour in my previous post, as well as regular use of sa-update to keep SA's default rules up to date. Good point about keeping SA itself up to date. Subscribing to the SpamAssassin-Announce list is a good way to know when updates are available.

- Don't necessarily just accept the default spam threshold of 5.0. If you're getting too many false results, adjust the threshold in small increments and wait a while between adjustments to make sure that you don't get more false results than you can tolerate.

I have to disagree.  I never adjust my threshold.  I throw things at bayes until they 'stick' and I may, on rare occasions, adjust the score for a rule.

Note that I said 'don't necessarily.' By that I meant that 5.0 may work for a lot of installations, but it is something to keep track of because adjusting it might be beneficial. This just points out once again that everyone's installation is different enough that there's no one-size-fits-all strategy. When I first installed SpamAssassin, before I had a well-trained Bayes database and before adding any custom rulesets, I got a lot of false positives with the threshold at 5.0 points, at least partly due to the circumstances described above. I raised the threshold to, IIRC, 7.5 or so to eliminate the false positives. Of course, that introduced a lot of false negatives. As I trained my Bayes database so that it became more finely tuned to my server's view of both spam and ham, and added a few rulesets, I gradually ratcheted the threshold downward. It's currently at 4.9. I get a fair amount of spam that scores around 5.0 or so, but only the occasional piece that scores at 4.9, so I've kept the threshold steady there for a while now.

--
Christopher Bort
Homes Magazine
email: <cbort@homesmagazine.com>
website: <http://www.homesmagazine.com/>
FAX: 775-284-1298
Phone: 775-284-1294

Real Estate Advertising/ Web Products/ Digital Printing Services

Serving: Wine Country Napa & Sonoma County, Marin County, San Francisco Bay
Area, Santa Cruz County, Monterey County , San Luis Obispo County & Santa
Barbara County, Reno/Sparks & Carson Valley, North Lake Tahoe & Truckee &
South Lake Tahoe

Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster