My Plan for Spam

by Billy Biggs <vektor@dumbterm.net>, 10-12 Jun 2004

On email

I have always received tons of email. Since my first email account, I have kept up to date with Internet events by subscribing to lots of mailing lists. Since I cannot read them all, I have tended to leave my incoming mail unfiltered and just scan for senders or subjects that interest me, hoping to notice any personal email.

I have never worried much about spam. I strongly believe that if I make it difficult for people to email me, then they simply will not bother. I value email as a tool, and so my email address is plastered liberally across the Internet. I believe that obfuscated email addresses are incredibly annoying and clumsy, and that challenge-response systems destroy the usefulness of email. My argument has always been that the receiver of email should bear the responsibility of filtering. My inbox has always received spam, but it was usually obvious enough that I could easily follow my lists and personal email.

A spam problem

Recently, the amount of email viruses and spam I have received has gone up enough that my manual "scan and pray" filtering system can no longer function. In May and June of this year, more than 20 megabytes of email per day has gone into my inbox, beefed up by the large attachments of email viruses.

  <zer0|work> holy crap
  <zer0|work> you get a ridiculous amount of spam  

Here is a plot of the amount of spam and virus email I received per day for the first 145 days of 2004.

Spam and virus emails for 2004

I went from about 200-300 total per day in January up to over 1000 per day in May. My spam count increased at a rate of about 7 emails per day. It is time for me to live up to my argument and deal with my spam, without changing my habit of having one, and only one, incredibly public email address.

So what shall I do? Let's see the plan.

My plan

I have selected two tools to help me with my spam problem. First was SpamAssassin, since it was easily available and well respected. However, since it was designed to combat spam and not email viruses, I selected Clam AntiVirus to handle this job on a recommendation from the SpamAssassin website. I use the Debian packages and daemon versions of both software packages.

The final plan I chose was as follows:

  1. Filter all incoming mail into one folder per day.
  2. Filter out of that all traffic from every mailing list I am on.
  3. Run clamav on whatever is left
  4. Run spamassassin on whatever is left
  5. Filter remaining spam and virus mail by hand
  6. Read email

Justification

Here are the interesting ideas from my plan:

  1. Daily folders

    This was necessary based on the size of my inboxes. Daily folders allow me to deal with problems in filtering in a sane way. When the month is over, a script can compact the filtered results into a single mailbox.

  2. Mailing lists

    Most mailing lists have their own solutions for spam filtering, and those which don't I can live with skimming. To save both time and energy, I decided to filter out all mailing list traffic into a separate folder so I can skim that in my traditional way. This way the job of the filtering tools is reduced to worrying only about my personal email.

Results

It has been a battle to train SpamAssassin on my spam. I have attempted to do this by retroactively filtering mail. I am finding a lot of old personal email that I missed in all the spam!

Date Mail Viruses
(ClamAV)
Spam
(SpamAssassin)
Spam
(manual)
Percentage
spam
10 Jun 2004 8 353 764 104 99.34%
9 Jun 2004 16 462 998 59 98.94%
8 Jun 2004 7 415 1106 45 99.55%
7 Jun 2004 9 435 556 850 99.51%
6 Jun 2004 1 388 394 342 99.91%
5 Jun 2004 1 462 381 483 99.92%
15 Mar 2004 14 167 349 41 97.49%
10 Mar 2004 12 152 306 59 97.68%

Some notes on the numbers: SpamAssassin was well trained by the time I filtered the 9th and the two dates from March. June 7th was a fluke, I was hit badly by some sort of virus and SpamAssassin did not catch it at all.

That all said, even on the 10th of March, the amount of spam I had to manually filter far exceeded the amount of email I actually received.

Conclusion

Unfortunately, my solution requires a lot of manual work before an inbox on any given day is clean. While I have turned a hopeless problem into something managable, having to manually tag 50 emails per day is a daunting task. I hope that the classifier's accuracy can improve, but given the huge proportion of spam to non-spam that I receive, I am worried.

Going forward, the problems I have to deal with are:

  1. The amount of spam I receive is so large that I have almost no hope of discovering false positives.
  2. There is still too much spam not being correctly detected. I hope this can improve.
  3. I have a few gigabytes of archived mail and spam and therefore must use bzip2 compression on most of it. Managing a large number of mailboxes, many of them compressed, is proving challenging.

If you have any thoughts on my problems with spam, please feel free to drop me an email. Hopefully, I will now actually have an opportunity to read it.