Mailing List Message #14330
From: Bill Cole <>
Subject: Re: Spammers, Viruses and Attachments
Date: Fri, 12 Mar 2004 08:54:54 -0500
To: SIMS Discussions <>
At 3:47 PM -0500 3/11/04, Charles Yeomans  imposed structure on a stream of electrons, yielding:
On Mar 11, 2004, at 3:40 PM, Timothy Binder wrote:

On Mar 11, 2004, at 2:59 PM, Neil Herber wrote:

A disadvantage I have found with this approach is that some Java based email address verifiers on web forms do not accept addresses containing hyphens.

Many web-based email address "checkers" are brain dead -- Java, JavaScript, PHP, etc. (I know, I had to fix a web site that had one.) A lot of people assume that email addresses can only have letters & numbers and underscores. Sometimes dashes or periods. This has never been true -- sendmail for a very long time has allowed addresses of the form "", which most of these "checkers" will reject as "invalid".

The fact is that RFCs allow for virtually anything in the local portion of the email address. It was an explicit design choice, since the Internet connected so many types of email systems. (Anyone remember bang (!) addressing?)

In fact, it is not possible to write a regular expression that can check an arbitrary e-mail address, as defined in RFC 822, for validity.  Anyone that claims they've done it has more or less proved that they don't really understand the problem.

True, but RFC822 and its successor RFC2822 do not define addresses for e-mail headers in precisely the way that RFC821/2821 do for SMTP. It IS possible to write a regular expression for a legal SMTP address. The full 822 feature set (with embedded whitespace and comments) makes no sense to use when you are not intentionally adapting to a particular idiosyncratic mail system or obfuscating an address from weak parsers. In short, you CAN write a RE which will validate reduced addresses legal in RFC2821 SMTP and not using any deprecated routing features. RFC2821 includes specification which can be translated into RE. the real hassle in 822 address parsing is the optional embedded whitespace and commenting.

In a useful side-effect of that, 822 addresses are specified for mailto URL's, which makes it possible to use quite odd-looking strings in them which are perfectly valid and can be parsed to simple SMTP addresses but which will look broken to nearly all spammer tools that harvest addresses (and to Microsoft mailers.) For example, I have this on one of my web pages:

<a href="mailto:bill-dnsblfaq(Bill%20Cole-%20DNSBL%20FAQ)">

Perfectly legal. Breaks Outlook. In almost a year with the address there I have had zero spam to the address, while every non-obfuscated address on my website has been scraped and spammed.

Bill Cole                        

Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster