06/27/2012

The Evil Domain Name System (DNS): Why Your Monitoring Solution May Be Defective

Of course, we couldn't live without DNS. It would stink to have to remember addresses like 172.16.254.1 for every website we want to visit. And, when ipv6 takes hold remembering addresses like 2001:0db8:85a3:0000:0000:8a2e:0370:7334 isn't even an option. But for DNS to work, some service somewhere has to look up a name and return an address anytime anyone (or any thing) does anything on the Internet. And, the first problem with this convenient invention is it would immediately collapse under the load of billions of simultaneous requests.

The solution? Caching. DNS configuration specifies "time to live" (TTL) which instructs DNS caches everwhere not to do a new lookup until the cached entry is some number of seconds old. Typical configurations specify a TTL of seconds to hours. This is a beautiful solution which successfully reduces the load of DNS lookup requests to manageable levels.

The beauty of the solution quickly turns ugly though if you happen to want to make sure your website is accessible to everyone all the time. All of our customers do happen to want to know exactly that. Here's the problem: There are website monitoring services cropping up everywhere these days that haven't considered this problem in the least. Their monitoring software (usually a bunch of cobbled together scripts running as cron jobs) happily relies on the underlying operating system to perform standard DNS lookups every time they check a site. So if your domain is configured with a TTL of say 4 hours and then your DNS server goes down (trust us, it happens all the time!), guess how long it will be before that wanna be monitoring service figures out something's wrong? If you guessed "4 hours" you're exactly right!

In their defense, it's actually non-trivial to avoid cached DNS lookups. It's part of the hard work that goes into building a world-class monitoring solution. Just like delivering reliable phone call alerts and working out the many pains of monitoring SSL connections, getting around the operating system's natural inclination to use cached DNS entries is tricky. We've designed a system that is effective at doing just that and fails less than 0.2% of the time (and we're working to improve on even that).

What it all means is if we are monitoring your website and other services we will detect outages caused by DNS issues as soon as they occur without regard to your domain's TTL setting. This can cause some confusion though: we receive a lot of e-mails complaining that we notified about an outage incorrectly (But I just pulled my site up in my browser and it's working fine!). And for you it is working fine because your system is dutifully relying on the cached DNS information. But for the millions of other visitors in the world who haven't been to your site in the last few hours, you are down, down, down. I mean hard down.

We couldn't live without DNS, just make sure you're using a monitoring solution that understands how it works.