Goals

deliver every ham to the inbox
1. without delay
prevent delivery of every spam
1. during the SMTP exchange
minimize consumption of our resources by malware
1. maximize consumption of malware resources

The first goal can be accomplished by simply delivering everything. Preventing spam delivery is a difficult task that is often constrained by the first goal. Rejecting delivery is preferred over silent discards as preventing delivery blunts the incentives to spam.

Spammer

Assets

bot nets provide a vast supply of bandwidth, CPU, and IP addresses
cheap domains (is tasting still permitted?)
custom software optimized for fast delivery

liabilities

poorly written software
PCs turned on/off at times of the day
bot nets are primarily windows PCs
no control of rDNS

History

There was a time when spam filtering consisted of installing SpamAssassin and writing a few procmail rules. Ham was routed to the inbox, spam to the bit bucket, and the suspect messages to the spam folder. For quite a long time SA was sufficient to provide users with a clean enough inbox.

A weakness of early filtering was that MTAs didn't filter messages until after the SMTP connection was complete. By the time SA determined the message was rubbish, the spammer had already successfully delivered the message (the basis for charging his customers), obligating the recipient to deliver or bounce it. Bouncing it had the nasty habit of creating backscatter. Silent discards are fraught with technical, social, and often legal implications. To avoid those pitfalls, filtering needed to happen before the message was accepted.

Some MTAs evolved interfaces (milter, QMAILQUEUE, etc.) that enabled filtering during the SMTP conversation. Mail operators without those abilities front ended their MTAs with software filters like AMAVISD or ASSP. Others deployed commercial hardware products like the Barracuda. Dozens of filtering techniques have been developed with varying degrees of efficacy. We are going to cover most of them.

Most Effective Anti-Spam Weapons

Identifying bot/malware -vs- legit mail servers

Content Analysis

Bayesian
URIBL

SMTP Phases

connect

remote IP

- IPs in dialup pools, and to a lesser extent DSL and cable, are ephemeral. Tracking abusive IPs for more than a few days rapidly becomes less beneficial and more likely to generate False Positives. Hence why many DNSBLs automatically expire listings after 5-20 days.
- connection history. Legitimate mail servers rarely change their IPs. Their DNS information rarely changes. If remote IP history is stored for more than about 30 days, a very significant majority of servers that send ham to your users will have a stored history. After populating a sender history database, connections from IPs without a history is highly probably to be spam.

remote OS (p0f)

- Bot nets are primarily composed of older versions of Windows. A windows email sender that isn't from hotmail.com is 98% odds of spam.

geographic location 9_% of ham travels less than 4,000 km. _0% of spam travels more than 4,000 DNSBL listing(s)

- Range in efficacy from 45-90%. Tend to have False Positive rates from 1-30%. Using more than a couple tends to amplify the FP rate.

AS number (network neighborhood)

- Well managed networks have very little abuse and get it under control quickly.
- Abusive machines tend to be clustered on networks that tolerate abuse.
early talker

HELO / EHLO

hostname
- valid?
- match rDNS?
- TLS
- auth
- relay
- SPF

MAIL FROM

black & white lists
validity of domain in from address
SPF

RCPT TO

black & white lists
local user existence / deliverability

DATA

headers
- duplicated singular headers?
- missing any required headers?
- is Return Path valid?
- is Date reasonable?
- is UserAgent detected?
- is Mailing List detected?
- are there enough Received headers?
- does the From header match the envelope FROM?
- valid bounce?
content
- bayesian
- virus
- spam URLs

Mail Filter Best Practices

Contents