Mailing List SIMS@mail.stalker.com Message #15494
From: Lewis Butler <lbutler@covisp.net>
Subject: Re: spam samples for bayes training
Date: Thu, 31 May 2007 12:31:25 -0600
To: SIMS Discussions <SIMS@mail.stalker.com>
X-Mailer: Apple Mail (2.752.3)
On 30-May-2007, at 11:56, Charles Mangin wrote:
sigh. it's been a while now, and i'm actually starting to miss the ease of administrating SIMS.

Been there... lemme tell you.

now i've got all these advanced features that i must 1> learn how to use 2> learn how to configure and 3> learn how to fix when they break.

Agreed.

do any mail admins on this list know where to get such an archive, other than to open up one of my own domains to the floodgates and just capture it myself?

That's one way.  I used to keep a large stack of Spam (100K messages) for training purposes but finally got rid of it a year or so ago when it became obvious that old spam was useless for training because SA scored it so well to start with.

On 30-May-2007, at 13:16, Christopher Bort wrote:
It is highly recommended that you train your Bayes database only with messages that have actually been received at your own installation. Using someone else's spam and ham is likely skew your database and result in inaccuracies. In other words, SpamAssassin needs to know what _you_ see as spam and ham, not what someone else sees.

Spam AND ham, yes, but there is nothing wrong with using someone else's spamarchive to help train up your bayes.  While defenitions of ham vary widely, the same cannot be said for spam.

The best thing for accuracy is keeping SpamAssassin up to date (I am right now updating to 3.2), keeping its rules updated, and running some of the other rules sets.  I use RulesDuJour myself, but read-up on the wiki:

<http://wiki.apache.org/spamassassin/CustomRulesets>
<http://www.exit0.us/index.php?pagename=RulesDuJour >

- Don't necessarily just accept the default spam threshold of 5.0. If you're getting too many false results, adjust the threshold in small increments and wait a while between adjustments to make sure that you don't get more false results than you can tolerate.

I have to disagree.  I never adjust my threshold.  I throw things at bayes until they 'stick' and I may, on rare occasions, adjust the score for a rule.

My goal is to get ham to score 0 or below and spam to score 8 or above (the auto-delete point on most my accounts).

--
Lewis Butler
303.564.2512


Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster