Have you no sense of DNS latency, sir, at long last?

As primitive as SMTP and DNS may seem, lots of stuff needs to work for an email to go from sender ⮕ recipient’s MX ⮕ recipient’s mailbox. Then even more stuff needs to work for a mail client to parse and render an HTML email, plus its embedded resources and/or remote images.

The good news: modern infrastructure is reliable. We don’t worry much about extended network outages or servers melting down. But that’s also misleading, since fault-tolerant infrastructure, plus the native resilience of SMTP and DNS, can disguise poor performance.

Put another way, stuff stays up, but that doesn’t mean it stays fast.

DNS is everywhere

I recently looked into a problem with email image load performance for a client. Check out the animation below. The left side shows the expected load performance with an unprimed local cache.[1] This was happening most of the time. But sometimes they’d see performance like the right side: a noticeably staggered experience, more than twice as slow.

Quite puzzling. There was no correlation to particular mail clients or domains. Over a huge number of manual tests, we saw the behavior many times, but the Network tab didn’t show anything interesting besides “yep, it’s slow.”[2]

Clearly, we weren’t getting anywhere manually. So I set up a monitoring job to download one of those images every 100ms.

Next morning I checked the logs and found a critical clue. Response times varied widely, and around 1% of the hits were returning an HTTP 502 Bad Gateway response. That meant their CDN (image cache) couldn’t connect to the origin server (the primary source of their images, a DAM system).

Now that didn’t make much sense, because we were already monitoring the origin server and to our knowledge it never went down, with response time steadily around 30ms.

But something clicked when I looked at the nameservers for the origin domain, let’s call it dam-images.example.com:

example.com.          86400   IN      NS      ans11.dnshosting.example.
example.com.          86400   IN      NS      ans5.dnshosting.example.
example.com.          86400   IN      NS      ans6.dnshosting.example.
example.com.          86400   IN      NS      ans8.dnshosting.example.

Now bear in mind the NS records are returned in random order. That’s just one example, and a future query could return:

example.com.          86400   IN      NS      ans8.dnshosting.example.
example.com.          86400   IN      NS      ans6.dnshosting.example.
example.com.          86400   IN      NS      ans5.dnshosting.example.
example.com.          86400   IN      NS      ans11.dnshosting.example.

So pay less attention to the order and more to the fact there are 4 different nameservers.

I looked up dam-images.example.com on each nameserver. A dig command like

dig dam-images.example.com @ans11.dnshosting.example

gets results only from the server ans11.dnshosting.example. (If that server is up, that is; otherwise, dig times out after 15 seconds.)

3 of the 4 servers responded immediately with the same set of A records:

dam-images.example.com. 60 IN     A       13.35.10.1
dam-images.example.com. 60 IN     A       13.35.10.4
dam-images.example.com. 60 IN     A       13.35.10.7
dam-images.example.com. 60 IN     A       13.35.10.8

But one server, ans5.dnshosting.example, always timed out. Further investigation revealed ans5 was retired permanently during a network cutover. That is, it was known to be down, which is typically harmless because DNS is resilient.

DNS resilience is a net good, but has some edges

DNS resolvers automatically try every nameserver before giving up. This keeps your domain’s DNS working even if only 1 NS is available. (Most corporate domains publish 4 NS records, some as many as 8.[3])

NSes aren’t tried strictly in (random) sequence nor in parallel, but a combination. Every resolver uses different logic, for example:

query NS 1 and wait 1s for a response
query NS 2 and wait 1s
query NS 2 again and wait 2s
query NS 3 and wait 4s
query all NSes in parallel and wait 8s
etc.

The net effect of this retry logic is you almost always get a response. But the elapsed time varies. If your nameservers never go down and always instantly respond, great! But if some are unreliable due to spotty networks or sloppy setups, while everything “still works” in the grand scheme, there will be a range of user-facing experiences.

Here’s what happened

What we found was the CDN (call it global-images.example.com) enforced a 5-second timeout in connecting to the origin (dam-images.example.com). That timeout covered the DNS lookup time + time-to-first-byte (TTFB) of the HTTP response.

The vast majority of the time, the DNS (cached or uncached) + HTTP combo would start in under 1s, but ~10% of of the time it would creep closer to 5s. And 1% of the time it would time out completely — which we only saw when we set up additional monitoring.

The origin server was perfectly healthy; the performance hit was entirely due to DNS latency when the “wrong” NS came up first in the shuffle! Also surprising was that the CDN was querying all NSes throughout the day, rather than pinning a high-performance one.[4]

The solution was to remove the known-down NS completely from DNS. Which is too bad, really, because it should’ve had no effect. But you never know what shenanigans intermediate servers — particularly CDNs’ proprietary network stacks — will get up to.

Notes

[1] i.e. the image URLs had not previously been fetched by the client.

[2] Browsers implement a quite lengthy timeout for images: it’s 3 minutes in Chrome, for example. These delays were more in the few-seconds range, so they weren’t anywhere close to that timeout.

[3] There’s no hard limit on the the number of nameservers for a domain: if you commit to DNS over TCP and very short hostnames, you could stuff thousands of NS RRs into a response. But there’s no practical utility beyond 20 or so: “You still got a response after 10 minutes, that doesn’t count as downtime” isn’t gonna be convincing!

[4] Most resolvers remember which NS was fastest and then try that one first next time around. But this “pinning” isn’t permanent and will be reset after a fixed period and upon reboot. Most examples in this post assume a resolver has no cached info about a domain.