by Billy Biggs <vektor@dumbterm.net>, 10-12 Jun 2004
I have always received tons of email. Since my first email account, I have kept up to date with Internet events by subscribing to lots of mailing lists. Since I cannot read them all, I have tended to leave my incoming mail unfiltered and just scan for senders or subjects that interest me, hoping to notice any personal email.
I have never worried much about spam. I strongly believe that if I make it difficult for people to email me, then they simply will not bother. I value email as a tool, and so my email address is plastered liberally across the Internet. I believe that obfuscated email addresses are incredibly annoying and clumsy, and that challenge-response systems destroy the usefulness of email. My argument has always been that the receiver of email should bear the responsibility of filtering. My inbox has always received spam, but it was usually obvious enough that I could easily follow my lists and personal email.
Recently, the amount of email viruses and spam I have received has gone up enough that my manual "scan and pray" filtering system can no longer function. In May and June of this year, more than 20 megabytes of email per day has gone into my inbox, beefed up by the large attachments of email viruses.
<zer0|work> holy crap <zer0|work> you get a ridiculous amount of spam |
Here is a plot of the amount of spam and virus email I received per day for the first 145 days of 2004.
I went from about 200-300 total per day in January up to over 1000 per day in May. My spam count increased at a rate of about 7 emails per day. It is time for me to live up to my argument and deal with my spam, without changing my habit of having one, and only one, incredibly public email address.
So what shall I do? Let's see the plan.
I have selected two tools to help me with my spam problem. First was SpamAssassin, since it was easily available and well respected. However, since it was designed to combat spam and not email viruses, I selected Clam AntiVirus to handle this job on a recommendation from the SpamAssassin website. I use the Debian packages and daemon versions of both software packages.
The final plan I chose was as follows:
Here are the interesting ideas from my plan:
This was necessary based on the size of my inboxes. Daily folders allow me to deal with problems in filtering in a sane way. When the month is over, a script can compact the filtered results into a single mailbox.
Most mailing lists have their own solutions for spam filtering, and those which don't I can live with skimming. To save both time and energy, I decided to filter out all mailing list traffic into a separate folder so I can skim that in my traditional way. This way the job of the filtering tools is reduced to worrying only about my personal email.
It has been a battle to train SpamAssassin on my spam. I have attempted to do this by retroactively filtering mail. I am finding a lot of old personal email that I missed in all the spam!
Date | Viruses (ClamAV) | Spam (SpamAssassin) | Spam (manual) | Percentage spam | |
---|---|---|---|---|---|
10 Jun 2004 | 8 | 353 | 764 | 104 | 99.34% |
9 Jun 2004 | 16 | 462 | 998 | 59 | 98.94% |
8 Jun 2004 | 7 | 415 | 1106 | 45 | 99.55% |
7 Jun 2004 | 9 | 435 | 556 | 850 | 99.51% |
6 Jun 2004 | 1 | 388 | 394 | 342 | 99.91% |
5 Jun 2004 | 1 | 462 | 381 | 483 | 99.92% |
15 Mar 2004 | 14 | 167 | 349 | 41 | 97.49% |
10 Mar 2004 | 12 | 152 | 306 | 59 | 97.68% |
Some notes on the numbers: SpamAssassin was well trained by the time I filtered the 9th and the two dates from March. June 7th was a fluke, I was hit badly by some sort of virus and SpamAssassin did not catch it at all.
That all said, even on the 10th of March, the amount of spam I had to manually filter far exceeded the amount of email I actually received.
Unfortunately, my solution requires a lot of manual work before an inbox on any given day is clean. While I have turned a hopeless problem into something managable, having to manually tag 50 emails per day is a daunting task. I hope that the classifier's accuracy can improve, but given the huge proportion of spam to non-spam that I receive, I am worried.
Going forward, the problems I have to deal with are:
If you have any thoughts on my problems with spam, please feel free to drop me an email. Hopefully, I will now actually have an opportunity to read it.