10/03/2018

4 “Facepalm” Moments in Website Downtime History

Even the best websites with the most carefully laid continency plans don’t always achieve 100 percent uptime. There is, however, a vast difference between properly mitigated website downtime and instances of downtime that leave the world wondering how things could have gone so horribly wrong. For example, if a website goes down for a day due to a power outage, we wonder where the backup servers are and why they haven’t been implemented. Then we find out there were never any backup servers in place to begin with. That is a facepalm moment. This month we will discuss a few true “facepalm” moments in website downtime history to help others avoid making these same brutal mistakes.

Four: Fornite Update Fail

In July of 2018, the popular Fortnite video game went offline for updates. That in and of itself isn’t unusual as many online games go offline routinely for scheduled maintenance and updates. However, when Fortnite went down on July 24 of 2018, the company released a message to all its gamers stating, “Downtime for the v5.10 update has begun. Unwrap the Patch Notes to see all the tasty treats we have in store for you.” Once the update was done, Fortnite released another statement letting everyone know they could resume playing the game. The problem, however, was that there were numerous players who still didn’t have access to Fortnite. It took quite a bit of time for access to be available to everyone, with the likely cause being server overload. Fortnite is no stranger to press and media coverage, but usually the coverage has to do with the game being addictive and people playing it too much. While this instance of downtime is quite the facepalm moment, it may have provided a much-needed forced break for some of its players.

Three: When Gmail Goes Down

If you use Gmail for your business email (as many do), you may want to think twice. Gmail has experienced a rash of outages, one of which resulted in 150,000 users finding completely empty Gmail accounts when they logged in. Inboxes were gone and empty. Sent mail? Missing too. The accounts looked like they were brad new, minus the welcome emails. If someone had an email with crucial information saved to their account, they were unable to access it. While Google did indeed provide regular updates as to the status of the issue and what they were doing to fix it, some people didn’t get their emails back for a harrowing four days. According to Google’s vice president of engineering, it was a software bug that resulted in the data loss. Fortunately, the company was able to restore accounts from physical tape backups. What makes this a true facepalm moment, however, is that it took a full 4 days for Google to fix the issue and multiple layers of its multi-layer protection approach failed, resulting in a need for the physical tapes to begin with.

Two: Amazon Shows Us the Dangers of a Cloud-Based Future

Amazon has had its fair share of problems with downtime and outages, but the biggest facepalm moment occurred in February of 2017. In one of its largest outages ever to occur, Amazon’s AWS servers went down for more than five hours, affecting approximately 148,000 websites that depends on Amazon’s AWS services. To make matters worse, Amazon couldn’t even get into its own dashboard to warn any of its customers about the downtime. Why? Because the necessary status icons used to warn customers were hosted on the very same servers that the outage had taken down (and apparently there were no backups on servers in different locations). Some of the websites that went down with Amazon’s AWS included Quora, Slack, Imgur, Twitch, Adobe’s cloud, Expedia, Yahoo! Mail, and more. Even some website downtime monitoring services that are hosted by Amazon went down, so those who had sites hosted on Amazon’s AWS platform and used website downtime monitoring services hosted on the same couldn’t rely on the service to alert them of the website downtime.

It wasn’t just websites that went down with this Amazon outage. This particular outage may have given us a glimpse of how serious downtime may become in the future when AWS-enabled devices stopped working as well, including garage doors, gate controls, television remotes, and other connected electronics. To make matters worse, the cause of this huge disaster was a simple console typo, and it still took the company more than five hours to fix the issue. The cause behind the outage and the length of time it took Amazon to fix it definitely qualify this as a downtime facepalm moment.

One: PlayStation Network and Nintendo eShop Both Disappoint Kids on Christmas

Here at Alertra we have frequently discussed the importance of planning for surges in traffic and server resource demands. One would think that companies as large as PlayStation Network and Nintendo eShop would have the proper measures in place to accommodate traffic surges. However, on Christmas of 2017, both networks went down. Kids who had gotten a PS4 or a Nintendo Switch for Christmas couldn’t connect them to the PlayStation or Nintendo networks. Obviously, the children who had received such gifts were disappointed and frustrated (and we imagine their parents weren’t any happier).  Even the servers for redeeming game vouchers on the PlayStation Network went down. That resulted in the recipients of said vouchers being unable to redeem them. While this is never a good scenario, when it happens on Christmas it becomes a PR nightmare.

What makes this downtime incident even worse is that the issue could have been avoided entirely had the companies allocated enough server bandwidth to handle the Christmas traffic spike. However, because they failed to do just that, a cascade of server crashes plagued both networks. This was a true facepalm moment that turned both PlayStation Network and Nintendo eShop into Grinches on Christmas.

In Closing

Sometimes website downtime is inevitable. In such cases, the key to saving face is to communicate with your customers and maintain transparency. Unfortunately, in facepalm moments like these, transparency may not instill any confidence whatsoever. When an easily-avoided problem results in hours or even days of website downtime and frustration, it’s a bit harder to get the public to forgive and forget – especially when the reason for the downtime makes customers put their faces in their palms in exasperation.