Mailing List SIMS@mail.stalker.com Message #15491
From: Christopher Bort <cbort@globalhomes.com>
Subject: Re: spam samples for bayes training
Date: Wed, 30 May 2007 12:16:18 -0700
To: SIMS Discussions <SIMS@mail.stalker.com>
X-Mailer: Mailsmith 2.2
On 05/30/07 10:56, option8@option8.com (Charles Mangin) wrote:

one of them that i'm working on now is bayesian filtering within
spamassassin. i've got it marking/learning spam and ham, but it's slow
going. what i'd love to find is a compilation of example spams that i
can dump into my database so it can start with a critical mass of spam
to check against. jumpstart the "training" process, so to speak.

do any mail admins on this list know where to get such an archive,
other than to open up one of my own domains to the floodgates and just
capture it myself?

It is highly recommended that you train your Bayes database only with messages that have actually been received at your own installation. Using someone else's spam and ham is likely skew your database and result in inaccuracies. In other words, SpamAssassin needs to know what _you_ see as spam and ham, not what someone else sees.

Obtaining ham messages is fairly easy. Just use known good messages that you and your users have received. For spam, if you have any stocks of old spam that you've received, start with that (as long as they're not too old; they should reflect the character of the spam that you are currently receiving). If you're having SA quarantine messages that it has classified as spam, feed those messages to sa-learn or `spamassassin -r` after reviewing them to weed out false positives. If you haven't done so already, turn on SA's autolearn feature, with the caveat that if you are autolearning, you need to keep an eye on false positives and false negatives so that you can correct SA if it autolearns a message incorrectly. Of course, you need to keep an eye on false results anyway, but it's especially important if you are autolearning.

As you've noted, training a Bayes database can take a while, depending on how exactly you collect sample messages from which to learn, but if you do it carefully you'll end up with much better accuracy in the long term.

Some other ways to improve SA's accuracy, in no particular order:

- Run sa-update regularly to keep SA's default rulesets up to date.

- Consider using additional rulesets like the ones from <http://www.rulesemporium.com/rules.htm>. Before using any of these rulesets, though, make sure that you review them to make sure that they are appropriate for your installation. Not all of them are appropriate to all situations.

- If you decide to use rulesets from Rules Emporium, use the RulesDuJour script to keep them up to date.

- If you can afford the computing and network overhead, consider turning on SA's network tests so that you can take advantage of things like the URIBL tests.

- If you can afford the computing and network overhead, consider installing and using Razor|Pyzor|DCC. The great majority of spam in my quarantine folder has hits on Razor and|or URIBL rules.

- Don't necessarily just accept the default spam threshold of 5.0. If you're getting too many false results, adjust the threshold in small increments and wait a while between adjustments to make sure that you don't get more false results than you can tolerate.

--
Christopher Bort
Homes Magazine
email: <cbort@homesmagazine.com>
website: <http://www.homesmagazine.com/>
FAX: 775-284-1298
Phone: 775-284-1294

Real Estate Advertising/ Web Products/ Digital Printing Services

Serving: Wine Country Napa & Sonoma County, Marin County, San Francisco Bay
Area, Santa Cruz County, Monterey County , San Luis Obispo County & Santa
Barbara County, Reno/Sparks & Carson Valley, North Lake Tahoe & Truckee &
South Lake Tahoe

Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster