Mailing List SIMS@mail.stalker.com Message #15492
From: Bill Cole <listbill@scconsult.com>
Subject: Re: spam samples for bayes training
Date: Wed, 30 May 2007 15:54:08 -0400
To: SIMS Discussions <SIMS@mail.stalker.com>
At 12:16 PM -0700 5/30/07, Christopher Bort  imposed structure on a stream of electrons, yielding:
On 05/30/07 10:56, option8@option8.com (Charles Mangin) wrote:

one of them that i'm working on now is bayesian filtering within
spamassassin. i've got it marking/learning spam and ham, but it's slow
going. what i'd love to find is a compilation of example spams that i
can dump into my database so it can start with a critical mass of spam
to check against. jumpstart the "training" process, so to speak.

do any mail admins on this list know where to get such an archive,
other than to open up one of my own domains to the floodgates and just
capture it myself?

It is highly recommended that you train your Bayes database only with messages that have actually been received at your own installation. Using someone else's spam and ham is likely skew your database and result in inaccuracies. In other words, SpamAssassin needs to know what _you_ see as spam and ham, not what someone else sees.

AMEN!

Training a Bayes database from someone else's spam/ham corpus is a path to trouble. Aside from the subjective issue of mail you want having been reported as spam by someone else and vice-versa, there is a significant ephemeral quality to spam that causes trouble with using any corpus that isn't extremely current. Because of how Bayes filtering works, that means you can easily get significantly worse results from a large aged database than a small but very current one.


--
Bill Cole                                  bill@scconsult.com

Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster