Web Host Strategies for Server Monitoring

As a hosting provider (or anyone operating a shared server) you probably have at least two reasons for deciding to use server monitoring software. Anytime your servers are down, your phone will be ringing and your inbox will be filled with upset customers. Alertra's network monitoring service can alert you when your servers are unavailable; you'll know before they know. You've put a lot of work into building your business. Invested in hardware, spent countless hours configuring and testing, and more still keeping things running smoothly. With Alertra's server monitoring tools you can tout your server's fantastic uptime while you build a valuable reputation. So your server monitoring likely has two goals: Get notified when things go wrong. Publish the highest uptime percentage possible. In this paper I'll give you some tips on how you can best use Alertra's network monitoring services to accomplish both. This list of Do's and Dont's is largely compiled from our 5 year history of responding to support requests.

Do not monitor someone else's domain server.

This is probably the biggest cause of unnecessary outages among the web hosting companies we monitor. If you give Alertra a fully qualified domain name (FQDN) in the device setup, then a misbehaving DNS will cause your site to be marked down. That may be what you want if you are providing DNS services to your customers, but if not, then really what you are doing is monitoring someone else's domain server. If their domain server goes down, the downtime will get charged to your devices. Bottom line: Unless you have your own domain name servers, do not use FQDNs in your device setups.

Do not ignore alerts.

We get e-mail all the time from customers who want to know why after several hours, or even days, we are still reporting their device is down. As soon as Alertra generates an alert notification, the clock has started ticking on your downtime. Until Alertra determines that the site is back up, you are racking up downtime that will affect your stats for months to come. Unless the outage is caused by an error in the Alertra software, we cannot delete outages. I use the stopwatch on my cell phone. As soon as I get the alert, I start the stopwatch; I can't look at my phone without seeing the time ticking away. As my team and I are working on the problem, I even report over conference calls the current readout on the stopwatch. Bottom Line: Every hour of downtime is another 0.1 off your uptime percentage. Do not ignore device down alerts.

Do use the "Check All" feature.

Once you have repaired the server and it is back on-line, your next task should be to make sure that Alertra knows it is on-line. Until Alertra successfully checks the device, the downtime keeps going up. If you've already gotten the "Device OK" alert for the device then everything is fine. Otherwise, you should log in to your Alertra account and use the "Check All" feature. The "Check All" button can be found at the bottom of the Devices list on the right side of the page. When you click it, all of your devices will be immediately scheduled for a re-check. If your device is really back up, it should get marked up again very soon after the re-check. If it is still not up, then you'll know it then, instead of a few days later when a customer points out your 98% uptime for the month. Bottom Line: After an outage, use the "Check All" function to test your server and make sure you've really fixed the problem.

Do make DNS changes very carefully.

If you do want to monitor your domain name server, then you'll want to use fully qualified domain names (FQDNs) when setting up your Alertra devices. If you are using FQDNs though, you'll want to make changes to those domain names very carefully. The issue is the TTLs (time-to-live) of DNS entries. When you set up a DNS entry you also give it a TTL. This allows web browsers and other Internet services to cache the mapping between your domain name and the IP so they don't have to ask for it each time. This drastically speeds up the user experience while cutting down the load on you domain servers. The problem comes when you go to change the DNS entry for a name to point to a new IP. If it's not done right, this can spawn the dreaded "rolling outage", where Alertra alternatively tells you your site is down and then a few minutes later says it is back up again. This can cycle like this for up to an hour (Alertra's servers have a maximum TTL of 60 minutes). With our dispersed network of monitoring stations, it can take a while for your DNS changes to propagate to all of them, due to having to wait for the TTLs to, well, time out. This is really no different than what your customers experience, they too will have problems accessing your site after you change servers and update your DNS entries; only their browsers will wait the full TTL. Bottom Line: Make DNS changes carefully so they don't cause outages. Update the TTL to 60 seconds or so without changing the rest of the entry. Then wait until the previous TTL has expired. If your TTL is 24 hours, you'll need to wait 24 hours before making the final change. Finally, make your DNS changes and also set the TTL back to what it was before.

Do not monitor your web site or your customers' website on a shared server.

If you want to publish uptime stats for a shared server, you definitely do want to actually monitor the HTTP service on that server. For shared web hosts, monitoring the web service using either an HTTP/S or HTTPb check is really the best way for your monitoring to have legitimacy. You can use a PING check to make sure the server is up, but that doesn't really tell your customers how good your Apache or IIS server is at serving their pages. That being said, if you use an HTTP/S or HTTPb check, do not monitor your web site (if it is on the same server) or any of your customers' sites. Why? Because one day you will take your site down for maintenance, or your customer will and suddenly you will have an outage show up on your formerly pure uptime percentage. If you want to provide monitoring for your customers' sites, you can do that with separate devices; the device you use to advertise your uptime should not be your site or a customer's site. Bottom Line: It is a very good idea to use an HTTP/S or HTTPb monitor for your shared server; users should lend your uptime stats more credence because you are showing your actual web server in action. Create a test page to monitor though, don't monitor your own web site or your customers'.

Do set up multiple contacts and an escalation schedule.

Whenever you have an outage, the stopwatch is ticking on your downtime. Every minute of downtime affects your uptime percentage. In addition, and more importantly, your customers can't access their sites either. Alertra has a wide variety of contact methods and a very sophisticated escalation mechanism. You should set up a variety of contact methods and use the escalation schedule to become more and more persistent the longer the outage lasts. Have you heard the saying, "when it rains, it pours?" It is amazing how many server outages I have experienced where not only was the server down, but for some reason my phone carrier was having problems delivering SMS messages. One time I got an entire day's worth of messages all at once. My phone's battery also seems to die just before a big outage. When setting up your notification contacts, configure more than one way to get a hold of you or your support staff. Ideally you want to use 2 or more SMS contacts on multiple carriers and at least one voice phone contact. Next you should set up your notification schedule so that those SMS and voice phone contacts continue to be notified as long as the outage lasts. You can do that using the "Repeat Device Down notification every X minutes" check box on the Notification Schedule page. Bottom Line: Configure multiple methods of contacting you when an outage occurs. Use multiple carriers for SMS and voice phone contacts if possible (e.g. you use Cingular and someone else on your contact list use T-Mobile). Set the "Repeat Device Down notification every X minutes" check box.

Summary

Alertra is an excellent resource for both making your current customers happy and attracting new customers. Our network monitoring services provide the means for you to be notified when outages occur. You will fix the problem quicker because you'll know about the problem quicker. This will keep your current customer base happy. Alertra provides the means to also publish your uptime to the rest of the world, allowing you to differentiate the quality of the service you provide from those of your competitors. Using the strategies in this paper, you can maximize your uptime while cutting down or eliminating "false positives" that drain your uptime percentage.