A Marketo/DNS near-disaster

As we all know, in-house IT typically tunes out all things Marketo, so major rollouts must forge forward without a technical hand on deck.

After trying, and failing, to flip the switch to a new primary LP domain the other day, a client had me double-check some DNS work. (Of course IT had said everything was fine before getting back to their real jobs!)

It's a good thing they called. If they'd gone forward with the switch after IT's first attempt at a fix, 20% of users would've gotten a browser error instead of an LP — and 20% is an unfortunate sweet spot, clearly high enough to significantly impact your reputation and marketing success, yet low enough for problem reports to be written off as user error.

When the Marketo admin first attempted the LP switch, the Marketo UI kicked back the error Invalid Landing Page Domain — because it does you a nice favor and performs a simple DNS lookup before allowing you to change your primary domain. (Interestingly, it doesn't do this lookup for domain aliases.)

Turned out they were in the midst of switching DNS hosts, from self-managed boxes to Google's nameservers (not exactly sure why — I feel that big companies should own their DNS but that's just me).

“In the midst” is a fittingly vague description, since the problem was they were in the middle of migrating, but they didn't seem to know exactly where they were. They hadn't yet taken down the 3 old servers, yet they weren't adding new records to the old copies of the zone, only to the 4 new servers.

As a result, 3 out of 7 queries for the new Landing Page CNAME would fail. The failures would be distributed essentially randomly around the world, but at least with nearly half the lookups failing it would be hard to deny there was a systemic problem. And indeed that high probability was why the Marketo UI's double-check happened to pick it up.

So: once I mentioned that not all the old servers had the new CNAME record, they attempted to deal with that by shutting down the old servers completely. Which was a nice idea, but they only shut down 2 of the 3 old servers!

That left them with 1 out of 5 queries failing. A perfect DNS disaster, since a 20% failure rate (only a further fraction of which would likely be reported) could well be written off as user error: It works for me! says the person testing from a single office where the DNS record is cached for everyone.

Here's a sketch of how we get 1/5 or 20%:

diag

  • All the queries to the new servers have a successful response (the green lines). That's 4.
  • A query sent to the 2 servers that have been brought down completely (the dotted lines) will fail to respond at all. And that's not an error. DNS resilience is built around the idea that servers don't have to be up all the time. But the key is that if they're up, they have to respond with the right records.
  • A query sent to the 1 server that is up but out-of-date will get a bad result, specifically an NXDOMAIN (no domain found) result.

So in total, assuming for simplicity that the 5 servers that are up have the same response time, 1/5 of the queries will act like the LP domain doesn't exist. Yikes!

Finally, to give a glimpse of how one troubleshoots such things, here are the results from the ubiquitous DNS tool dig. Look for the NXDOMAIN (failure) vs NOERROR (success) vs. timeout (neutral) results:

> dig +trace @4.2.2.1 uncover.▒▒▒▒▒.com cname
.                       12913   IN      NS      d.root-servers.net.
.                       12913   IN      NS      b.root-servers.net.
.                       12913   IN      NS      g.root-servers.net.
.                       12913   IN      NS      h.root-servers.net.
.                       12913   IN      NS      c.root-servers.net.
.                       12913   IN      NS      m.root-servers.net.
.                       12913   IN      NS      j.root-servers.net.
.                       12913   IN      NS      k.root-servers.net.
.                       12913   IN      NS      a.root-servers.net.
.                       12913   IN      NS      l.root-servers.net.
.                       12913   IN      NS      e.root-servers.net.
.                       12913   IN      NS      f.root-servers.net.
.                       12913   IN      NS      i.root-servers.net.

com.                    172800  IN      NS      a.gtld-servers.net.
com.                    172800  IN      NS      b.gtld-servers.net.
com.                    172800  IN      NS      c.gtld-servers.net.
com.                    172800  IN      NS      d.gtld-servers.net.
com.                    172800  IN      NS      e.gtld-servers.net.
com.                    172800  IN      NS      f.gtld-servers.net.
com.                    172800  IN      NS      g.gtld-servers.net.
com.                    172800  IN      NS      h.gtld-servers.net.
com.                    172800  IN      NS      i.gtld-servers.net.
com.                    172800  IN      NS      j.gtld-servers.net.
com.                    172800  IN      NS      k.gtld-servers.net.
com.                    172800  IN      NS      l.gtld-servers.net.
com.                    172800  IN      NS      m.gtld-servers.net.

▒▒▒▒▒.com.              172800  IN      NS      dns00.▒▒▒▒▒.com.
▒▒▒▒▒.com.              172800  IN      NS      dns02.▒▒▒▒▒.com.
▒▒▒▒▒.com.              172800  IN      NS      dns03.▒▒▒▒▒.com.
▒▒▒▒▒.com.              172800  IN      NS      ns-cloud-b1.googledomains.com.
▒▒▒▒▒.com.              172800  IN      NS      ns-cloud-b2.googledomains.com.
▒▒▒▒▒.com.              172800  IN      NS      ns-cloud-b3.googledomains.com.
▒▒▒▒▒.com.              172800  IN      NS      ns-cloud-b4.googledomains.com.

▒▒▒▒▒.com.              1800    IN      SOA     dns00.▒▒▒▒▒.com. root.▒▒▒▒▒.com. 2018052300 1800 1200 1209600 10800
> dig @dns00.▒▒▒▒▒.com uncover.▒▒▒▒▒.com cname
;; connection timed out; no servers could be reached


> dig @dns03.▒▒▒▒▒.com uncover.▒▒▒▒▒.com cname
;; connection timed out; no servers could be reached


> dig @dns02.▒▒▒▒▒.com uncover.▒▒▒▒▒.com cname
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 41739
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; AUTHORITY SECTION:
▒▒▒▒▒.com. 1800    IN      SOA     dns00.▒▒▒▒▒.com. root.▒▒▒▒▒.com. 2018052300 1800 1200 1209600 10800


> dig @ns-cloud-b1.googledomains.com. uncover.▒▒▒▒▒.com cname
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35004
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

uncover.▒▒▒▒▒.com.     300     IN      CNAME   ▒▒▒▒▒.mktoweb.com.


> dig @ns-cloud-b2.googledomains.com. uncover.▒▒▒▒▒.com cname
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 27438
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

uncover.▒▒▒▒▒.com.     300     IN      CNAME   ▒▒▒▒▒.mktoweb.com.


> dig @ns-cloud-b3.googledomains.com. uncover.▒▒▒▒▒.com cname
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58750
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

uncover.▒▒▒▒▒.com.     300     IN      CNAME   ▒▒▒▒▒.mktoweb.com.

> dig @ns-cloud-b4.googledomains.com. uncover.▒▒▒▒▒.com cname
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34669
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

uncover.▒▒▒▒▒.com.     300     IN      CNAME   ▒▒▒▒▒.mktoweb.com.