11/07/2014

How We Diagnosed GoDaddy's Obscure DNS Outage

Last night the support desk was pinged with several support requests all relating to DNS issues. DNS issues are particularly difficult to understand sometimes, especially when we are reporting your site is down but you can access it in a browser. Other monitoring services may also be reporting your site is up. Unlike your browser and those other services we do not cache any results from the lookup process. That means that each time we resolve your domain name we are getting an 'authoritative' response. An authoritative response means we traverse the entire domain resolution path to get to the IP address associated with your domain name. Our monitoring stations are located in data centers all over the world and operate their own resolvers. Since our service only notifies you of an outage when the outage is confirmed by multiple stations, it is unlikely that a problem on our end would result in a false notification. After looking at the support requests and the customer URLs involved we determined that they were all using the same DNS provider: GoDaddy. This blog post will be kind of long and detailed, but here is the summary: Users who did not already have certain GoDaddy nameserver IPs in their DNS resolver's cache were unable to resolve domain names serviced by GoDaddy. We report that as an outage because it means something associated with your device is broken and possibly affecting your customers.

Dig

Now that we know there is a common provider involved we could have just let it go and told our customers to take it up with GoDaddy. But we wanted to know exactly what the problem was, especially since even services we knew wouldn't have particular names cached were still able to resolve the name to an IP address. The first thing we did is use the 'dig' tool to trace the domain name lookup:

# dig +trace custdomain.org

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.30.rc1.el6 <<>> +trace custdomain.org
;; global options: +cmd
.           518400  IN  NS  c.root-servers.net.
.           518400  IN  NS  k.root-servers.net.
.           518400  IN  NS  j.root-servers.net.
.           518400  IN  NS  a.root-servers.net.
.           518400  IN  NS  g.root-servers.net.
.           518400  IN  NS  e.root-servers.net.
.           518400  IN  NS  h.root-servers.net.
.           518400  IN  NS  m.root-servers.net.
.           518400  IN  NS  d.root-servers.net.
.           518400  IN  NS  l.root-servers.net.
.           518400  IN  NS  i.root-servers.net.
.           518400  IN  NS  b.root-servers.net.
.           518400  IN  NS  f.root-servers.net.
;; Received 228 bytes from 127.0.0.1#53(127.0.0.1) in 487 ms

org.            172800  IN  NS  a0.org.afilias-nst.info.
org.            172800  IN  NS  a2.org.afilias-nst.info.
org.            172800  IN  NS  d0.org.afilias-nst.org.
org.            172800  IN  NS  b0.org.afilias-nst.org.
org.            172800  IN  NS  b2.org.afilias-nst.org.
org.            172800  IN  NS  c0.org.afilias-nst.info.
;; Received 429 bytes from 192.33.4.12#53(192.33.4.12) in 513 ms

custdomain.org. 86400   IN  NS  ns10.domaincontrol.com.
custdomain.org. 86400   IN  NS  ns09.domaincontrol.com.
dig: couldn't get address for 'ns10.domaincontrol.com': no more

So the resolver has discovered by interrogating the root nameservers and those responsible for the '.org' domain, that ns09.domaincontrol.com and ns10.domaincontrol.com will know the IP address for our domain. However, it doesn't know the IP address for ns09 or ns10 and seems unable to resolve it. So who is responsible for ns09 and ns10 IP addresses? According to Wireshark: ns10lookup It is ans01.domaincontrol.com and ans02.domaincontrol.com. So what problem is 'dig' having talking to those resolvers? The record contains 'glue' that gives us the IP addresses for ans01 and ans02, so we don't have to look up their IP addresses, we can just go straight to asking them what the IP address is for ns09 and ns10:

# dig @216.69.185.35 ns09.domaincontrol.com

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.30.rc1.el6 <<>> @216.69.185.35 ns09.domaincontrol.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Neither ans01 nor ans02 were responding to requests to resolve that domain name. well-theres-your-problem To resolve the domain custdomain.org authoritatively we have to complete these steps:

  1. Query the root nameservers to see who can resolve .org names;
  2. Query the .org resolver to see who is responsible for custdomain.org;
  3. Query ans01 or ans02 to resolve the IP addresses of ns09 and ns10;
  4. Query ns09 or ns10 and resolve custdomain.org to an IP address.

The process was failing at step 5. ans01 and ans02 were not responding to requests. Incidentally if queried directly, ns09 and ns10 could both resolve custdomain.org which is a hint to why no one but us spotted the problem.

Why did other services work?

The big question you might have is, if we reported this device as down, why were you able to access it on computers that hadn't accessed the URL before? Why could other monitoring services access the device? The answer is: they all had GoDaddy resolver IPs cached. No one was having to look up the IP address for ns09 and ns10 because those IP addresses already existed in their resolver's cache. They would have been there because GoDaddy is huge and the resolver probably gets asked 1,000 times a day to resolve domain names hosted by them. Only Alertra was doing Authoritative lookups and only Alertra spotted the problem. In fact, GoDaddy didn't know about it.

GoDaddy Support

After identifying the problem I contacted GoDaddy support through their online chat service. Alertra is not a GoDaddy customer, so the support they could provide was more limited. But I had a very productive chat with Shawn. Here is a partial transcript of our conversation:

You're chatting with Shawn.
Me - Hi Shawn.
Shawn - Hello, my name is Shawn.  What can I help you with today?
Me - I am having problems resolving the domain 'custdomain.org' which is registered with godaddy.
Me - when I use 'dig' it gets to ns09.domaincontrol.com and then times out.
Shawn - OK, Let me look at the account.
Shawn - Do you have your Customer Number and PIN for support access to the account?
Me - It isn't my domain, but one my customer has. I do not have their information.
Me - Just trying to figure out for them why it isn't working.
Shawn - OK, let me see what I can find for us.  The information will be limited though
Shawn - OK, I am getting the site to resolve in my browser
... some text from the website ...
Me - Yes, I do as well as long as all resolver has the godaddy nameserver information cached. But if that information isn't cached, then it seems unable to resolve.
...
Me - ns09 can resolve custdomain.org. But the server responsible for resolving ns09 is not responding.
...

There was some other discussion about nameserver configuration and then we figured out a way he could see the results I was seeing.

Me - Do you have a way to verify the complete lookup? go from custdomain.org through each step until the IP is resolved? A program like 'dig' would do it.
Shawn - I can run the dig web interface
Me - Try running the dig web interface with the "Authoritative" option checked.
Shawn - OK, I am seeing that failure
Me - that is ans01 and ans02 failing to resolve ns09 and ns10.
Shawn - I am also checking and not seeing any events that would cause this being communicated to us either.
Kirby - Can you contact someone there to have them check the ans01 and ans02 resolvers?
Shawn - I can let my supervisor know so he can contact the right department for that.

The dig web interface at the time showed the same error I was seeing:

dig: couldn't get address for 'ns10.domaincontrol.com': no more

Since I wasn't a customer, Shawn didn't have any way to put in the ticket to have them email me when the problem was resolved. The problem wasn't severe enough to generate a System Alert from GoDaddy which is unfortunate. I don't know exactly when the problem was fixed, but this morning 'dig' is happy:

# dig +trace custdomain.org

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.30.rc1.el6 <<>> +trace ccsvi.org
;; global options: +cmd
.           518400  IN  NS  k.root-servers.net.
.           518400  IN  NS  e.root-servers.net.
.           518400  IN  NS  b.root-servers.net.
.           518400  IN  NS  j.root-servers.net.
.           518400  IN  NS  l.root-servers.net.
.           518400  IN  NS  f.root-servers.net.
.           518400  IN  NS  d.root-servers.net.
.           518400  IN  NS  i.root-servers.net.
.           518400  IN  NS  h.root-servers.net.
.           518400  IN  NS  m.root-servers.net.
.           518400  IN  NS  c.root-servers.net.
.           518400  IN  NS  g.root-servers.net.
.           518400  IN  NS  a.root-servers.net.
;; Received 228 bytes from 127.0.0.1#53(127.0.0.1) in 48 ms

org.            172800  IN  NS  d0.org.afilias-nst.org.
org.            172800  IN  NS  b2.org.afilias-nst.org.
org.            172800  IN  NS  a2.org.afilias-nst.info.
org.            172800  IN  NS  c0.org.afilias-nst.info.
org.            172800  IN  NS  a0.org.afilias-nst.info.
org.            172800  IN  NS  b0.org.afilias-nst.org.
;; Received 429 bytes from 202.12.27.33#53(202.12.27.33) in 855 ms

custdomain.org.     86400   IN  NS  ns10.domaincontrol.com.
custdomain.org.     86400   IN  NS  ns09.domaincontrol.com.
;; Received 82 bytes from 199.19.54.1#53(199.19.54.1) in 168 ms

custdomain.org.     3600    IN  A   173.231.31.82
custdomain.org.     3600    IN  NS  ns09.domaincontrol.com.
custdomain.org.     3600    IN  NS  ns10.domaincontrol.com.
;; Received 98 bytes from 216.69.185.5#53(216.69.185.5) in 1 ms

Conclusion

DNS problems are sometimes very hard to troubleshoot. Often the URL will work in your browser even when we are telling you there is a problem. In this case we were 100% correct. Something deep in the bowels of the resolver chain was broken. This wasn't affecting our customers ability to get to their sites. But it may and probably was affecting some of their customers' ability to access the site, learn about their products, and place orders.