A third of Marketo users have a broken SPF record... (Part II: 'include' and DNS limits)

In Part I of this series, I revealed that a scarily large number of Marketo instances have an SPF record that's unusable: you may think you're getting the small-but-important deliverability boost of an SPF PASS result, but you're not.

Note you're unlikely to be penalized for having no SPF result. Domains don't literally get a negative "spammy" mark from lacking an SPF record; rather, they lack the positive "hammy" credit that might have gotten them into the Inbox otherwise (of course, to a marketer, that amounts to the same thing).

And you're allowing malicious folks to send mail on your behalf. Illicitly impersonated email is what SPF is designed to prevent (when both sender and receiver respect SPF) and it carries possible legal consequences — and definitely PR trouble!

In this post, you'll learn the most common way SPF records are broken, which affected 12 of 16 broken records in my survey.

First, let's dip into what an SPF record represents.

What SPF attempts to communicate

An SPF record is a specially formatted record in your DNS zone (alongside the A or CNAME record for www.example.com, which tells browsers where to find your website, and the MX record for example.com, which tells senders how to send you email).[1]

You have one SPF record per sender domain (remember, me@example.com and me@sales.example.com represent two sender domains).[2] When servers get incoming mail claiming to be from me@example.com, they check example.com's SPF record to see if the computer trying to send it is on your allowed list.[3]

The record itself is one or more short lines of text that contains a list of all the computers in the world that are allowed to send email with the MAIL FROM (envelope sender) @ your domain. Conceptually, it's a list like this:

"our Exchange server", "our backup Exchange server", "that place that hosts our ecommerce site", "our Marketo instance", "that other ESP we sometimes use for untrusted lists"

So many machines, so few characters

OK, I just said "all the computers in the world," but they have to fit in a few lines of text (and usually under 450 characters in total). Obviously that means you don't literally list every individual computer by name and/or IP address. Global businesses can have tens or even hundreds of IP addresses from which they legitimately send mail, so various forms of shorthand are necessary.

One type of shorthand is the age-old CIDR format for specifying a range of IP addreses. For example, ip4:1.2.3.0/24 means "all 254 IP addresses between 1.2.3.1 and 1.2.3.254" but it's a lot shorter than writing out all those IPs.

Other types of shorthand — and here's where things get hairy, so pay attention! — require the receiving server to do additional DNS lookups to fill out your domain's allowed list.

For example, if your SPF record has the shorthand mx, that means "get the MX record(s) for my domain, then get the A record(s) my MX(s) point to, and whatever IP address(es) you end up with are all allowed." With the 2 characters mx you can thus reference a whole bunch of IP addresses, which is quite efficient. (It also means you don't need to touch your SPF record when you change your MX, since it adjusts dynamically.)

And one really efficient, blatantly risky, yet necessary shorthand allows a domain owner to defer to somebody else the SPF-related stuff they don't know, can't know, or can't keep track of. include:myfavoriteesp.com in your SPF record means "now look up the SPF record for myfavoriteesp.com and check all the computers they allow."

include is a necessary evil if you're trusting a third party like Marketo or MailChimp to send mail on your behalf, since they'll use servers around the world whose IP addresses will change over time. It isn't feasible for all Marketo's customers to keep independent lists of IP addresses, or even wider IP ranges, up to date. And this isn't just about MAPs/ESPs: if you use Google Apps or Office 365 for corporate email, you'll set up your SPF record to include your cloud provider's record. Without the ability to delegate some SPF maintenance to your vendors and partners, SPF wouldn't work.

With include notation, it's easy for example.com to assemble a concise SPF record like this:

v=spf1 include:anotherexample.com -all

And anotherexample.com can pass off to another domain:

v=spf1 include:yetanotherexample.com -all

Then yetanotherexample.com, which is doing the "heavy lifting," can have the real computers listed:

v=spf1 mx ip4:1.2.3.0/24 ip4:2.3.4.0/24 ip4:3.4.5.6/32 -all

But suddenly the potential technical consequences threaten to overwhelm the ethical and legal value of preventing forged email.

That's because allowing any domain to include any other domain, ad nauseum, means a mail server looking up the first domain in a long chain of includes needs to send request, after request, after request to retrieve the whole list. Yes, DNS requests are relatively lightweight (that's how 100s of billions of them cross the net every day), but they don't place zero demands on infrastructure. Generate enough requests and you'll kill DNS servers, clients, and networks.

Check out how Wikipedia sums up the wheat and chessboard fable, which I think is relevant:

The problem warns of the dangers of treating large but finite resources as infinite, i.e., of ignoring distant but absolute and inevitable constraints.[4]

The inventors of SPF saw the danger of unlimited DNS-based SPF entries, because DNS lookups could be abused by those who hate SPF most — spammers! — to make SPF unusable in general. You see, all a spammer would have to do is set up random domains whose SPF record requires exponential DNS lookups. Receiving servers would overwhelm their own resources, as well as the resources of other innocent victims, by checking those SPF records, so admins would end up disabling SPF entirely. (This attack is of the ever-fascinating... well, to me, at least... amplification attack type and will also remind some veterans of the old Zip Bomb attack.)

So, to defuse this risk, SPF records are limited to a maximum of 10 DNS lookups. This isn't a new restriction: it's been in the SPF standard for a really long time.

A strict standard, but still ignored

Unfortunately, most people who set up SPF records are great at their IT specialty, but they're not SMTP admins. They're not tuned into the history of spam and anti-spam, and with companies putting so many messaging tasks in the cloud, they don't deal with email day-to-day. They'll likely paste in whatever's requested by a senior enough colleague, which is why so many domains have an extraneous include:mktomail.com that they don't need. (No offense to you guys, that's why I'm here.)

Compounding the problem, the DNS control panel (where you actually add and modify SPF records) isn't going to validate what you copy-and-paste into a textbox.[5] (I've never seen a DNS UI that does this, though maybe there's a supercool one somewhere.) It'll let you paste gibberish, basically. So any SPF validation has to be done separately, via a web site or manual validation.

Making matters even worse, there are many web sites that purport to "validate" SPF records, but they don't count the number of DNS lookups. All they do is check basic syntax, giving a completely false sense of usability. Major exceptions are the canonical SPF validator maintained by one of SPF's inventors and Vamsoft's validator (shouts to Peter Karsai). Those two validators really work. If everyone used them when adding or changing SPF records, we wouldn't be in this mess!

And consider that an SPF record that's under the 10-lookup limit one day can break the next, just by pasting in a single character like a. Unless SPF records are deep-validated every time a change is made (or even more often), they're going to eventually break.

And consider that "break," when it comes to the lookup limit, means "become unusable" rather than "reject all mail." Most of your mail will still get through: it's not like you're going to be deluged by bounces one day. Instead, it'll be a slight increase in bounces, a dip in deliverability, a pesky mystery which no one knows how to approach because "Well, it's not SPF, since we have an SPF record."

It's a perfect opportunity for low-key, undetected technical failure.

Coming up

I know these posts are long, but this is for posterity, guys! In Part III, we'll look at examples of DNS lookup limit failures and how to solve them.


Notes:

  1. SPF-specific (Type 99) records are obsolete, so I'm referring to SPF-tagged TXT records in the post.
  2. Wildcard SPF is discouraged, so assume you need another record for the subdomain.
  3. For simplicity, I am only considering pass entries (with the + qualifier), since those are by far those most widely used and + is the default. SPF records may also include known disallowed computers, "known unknowns," and softfail entries.
  4. https://en.wikipedia.org/wiki/Wheat_and_chessboard_problem#Moral_story
  5. In fairness, since Type 99 was killed, a TXT record is an opaque string, so the control panel would also need an "It's okay, save it anyway" override.