Why customer “branding domains” *were* affected by the Marketo DNS outage

Not that anybody asked me about this yet! But I felt like explaining why branding (i.e. click tracking) domains were indeed affected by Tuesday's outage.

Originally, the Marketo RFO/FAQ indicated these domains weren't involved, then it was downgraded to “Maybe they were after all.” As of this writing, the wording is

Hyperlinks using a branded domain were affected. It was originally reported that they would not be, however upon closer investigation, we’ve found that those hyperlinks do still pass through part of Marketo’s domain. Therefore, some customers with branded domains did have links that could not connect properly.

That's still an understatement. As far as I can see, all customers — not just some — had links that frequently functioned improperly.

And it should've been clear from the beginning that branding and landing domains were catastrophically affected, even if — unlike direct requests to marketo.com itself — not every request to those domains would fail.

Let's look at why.

How branding domains work

I explored the ins, outs, and rules for branding domain CNAMEs in this post.

To briefly review, a branding domain requires a CNAME record. Your custom domain is the Alias (left-hand-side). The mkto-{nnnnnnnn}.com hostname from the Marketo Admin UI is the Canonical Name (right-hand-side):

click.example.com.       300     IN      CNAME   mkto-sj010203.com.

So that's pretty simple. An end-user request for click.example.com is thus equivalent to a request for mkto-sj010203.com and name resolution proceeds on the Marketo-owned domain. You look pretty, Marketo's servers do the actual lifting.

So who's authoritative for mkto-sj010203.com?

mkto-sj010203.com.      86400   IN      NS      ns2.mktdns.com.
mkto-sj010203.com.      86400   IN      NS      ns1.marketo.com.

Maybe I don't need to continue. :)

What you can see here is that the branding domain has an indirect dependency on ns1.marketo.com providing correct answers to DNS requests, if it responds at all.

If that NS were fully down (returning no results at all) everything would be fine: that's the magic of DNS's fault tolerance. You can and should have NS records spread across domains and TLDs, but what's critical is each box either is running correctly (with the most recent version of your zone) or is shut down/firewalled off.

When Network Solutions redirects an entire domain, including glue RRs, to their internal, er, “partner servers,” you have something else: an NS that's up and running but providing answers you definitely don't want.

When one NS is right and one NS is wrong, responses will be wrong 50% of the time (that's assuming equal randomization, rotation, and client resolver behavior, so it won't be exactly 50% in reality, but it'll be bad no matter what).

So there's no reason to believe this effect was particularly rare — far from it.

How LP domains work

Basically the same as above, only the Canonical Name is the {nnnn}.mktoweb.com built from the “account string” under Admin.

And there, too:

mktoweb.com.            300     IN      NS      ns2.mktdns.com.
mktoweb.com.            300     IN      NS      ns1.marketo.com.

Could this collateral effect have been avoided?

Eh, not really, given the principal cause that we all know about by now. There's nothing wrong with having a corporate domain be among the NSes for another domain — as long as it's either up-healthy or down-sick, not in-between.