How We Calculate Uptime

How should uptime be calculated? The answer depends a lot on whom you ask and what their vested interest is as we all have our biases. Early on, we tried to strike the best balance we could with how we record and report on uptime and downtime data. Our customers use our data for a variety of purposes (internal reporting, SLAs, advertising, etc.) so we want to be fair while providing data that is consistent, accurate and meaningful.

For us, there have always been two main issues or points of contention.

The Monitoring Interval

We monitor at the specified interval whether a device is up or down. We don't monitor at an increased interval when the device is down. To do so would distort the statistical value of the data. It's true that we're overstating downtime if we don't detect the end of an outage immediately, but we're also overstating uptime when we don't immediately detect the beginning of an outage. Think of the monitoring interval as a random sampling of whether a system is up or down. If you alter the interval while the system is down, you ruin the randomness of the sample.

Here's an example where the monitoring interval is 60 minutes (which we do not recommend for important systems but it helps the illustration):

  1. We last checked a URL at 01:00 and it was working fine.
  2. At 01:30 the URL started timing out due to a backend database crash.
  3. At 02:00 we checked the URL and generated a Device Down alert.
  4. At 02:30 the database was fixed and the URL came back online.
  5. At 03:00 we checked the URL and generated a Device OK alert.

In the example, what is the actual uptime percentage during the two-hour period? The URL was only available 50% of the time, right? If we were to start checking the URL very frequently while it was offline, we would record only 30 minutes of downtime and report the uptime as 75% which would be very inaccurate.

Our goal is to report a statistically accurate picture of uptime over a period of time. To achieve that we need two things:

  1. A fairly short monitoring interval. 60-minute intervals are terrible at providing accurate uptime statistics, 1-minute intervals are of course much better.
  2. A long period of time. Because of the gap between an outage's actual start/stop time, and our detection of the start/stop time, it takes time for things to average out. In reality, it takes a long time get an accurate uptime percentage. Shorter monitoring intervals help a lot, but it still takes time.

Note: You can always click the Check Now button for a device as soon as you resolve an outage and we'll stop recording downtime. The effect of that over time would be to make your uptime look a little better than it really is, but sometimes you just need to get the Device OK alert right now so you can go back to bed.

By the way, we're not just trying to promote 1-minute monitoring intervals because they cost more, we think customers should choose the interval based on their tolerance for outage duration before receiving an alert.

Also, keep in mind, our innovative Synapse technology allows us to detect many kinds of outages within seconds regardless of your monitoring interval. If your outages tend to be mostly because of network failures, we would develop accurate uptime statistics very quickly.

Maintenance Time

The other question is what to do about maintenance time. One school of thought says that if you take your site down for scheduled maintenance, and you've informed your visitors about it in advance, then it's not really downtime and shouldn't be counted against your uptime percentage. Uptime purists maintain that anytime a site is unavailable then it's downtime as far as the site visitors are concerned.

Our belief is that it really depends on the application, the expectations of the particular set of site users, the organization, the organization's stakeholders, and on and on.

In an effort to strike the best balance, our approach is to simply exclude maintenance time from the uptime calculation. Here's an example:

  1. At 01:00 the user puts the site into maintenance mode which means we stop monitoring.
  2. At 01:05 the user takes the site out of maintenance mode.
  3. At 01:59 we detect a site outage.
  4. At 02:00 we detect the site is okay again.

What is the uptime percentage during the time period of 01:00 - 02:00?

We calculate uptime using this formula: Uptime / (Uptime + Downtime)

  1. Uptime = 54 minutes.
  2. Downtime = 1 minute.
  3. Maintenance time = 5 minutes.
  4. Uptime percentage = 54 / (54 + 1) or 98.182%

So time spent in maintenance doesn't count for you and it doesn't count against you, it just doesn't count.

Publishing Uptime

If you're reading this, it probably means you care more about your uptime statistics than the average user. You may be interested in publishing your statistics for others to see. If so, here's how to do that.