03/23/2013

CloudFlare Outage Being Blamed on Juniper Networks

When your company provides website security and optimization services, the last thing you want is a website outage. That’s exactly what happened to CloudFlare, however. CloudFlare provides websites with optimization, analytic and security services. When a customer signs up for CloudFlare, the customer’s website traffic is routed through CloudFlare’s intelligent global network. The company automatically optimizes the delivery of customers’ web pages so that visitors can experience the fastest page load times possible and websites are able achieve top performance. CloudFlare also blocks threats and limits the damage done by abusive bots, protecting clients’ bandwidth and server resources. Unfortunately, CloudFlare experienced an outage of its own last week and clients weren’t able to benefit from any of these services for about an hour.

Doctor Cure Thyself

When a company is known for optimizing the performance of other websites, it doesn’t look very good when its own website and services goes down. That’s why CloudFlare is determined to find out and explain what happened to its site and what caused it to go down along with its services.

Apparently, from what the company can tell so far, a router glitch caused the company’s 23 data centers to fail all at once. All of the company’s services were affected by the outage, leaving CloudFlare customers to wonder exactly how a company that is supposed to optimize their websites could let their own site go down.

According to CloudFlare, “When a router goes down, the routers to the network that sit behind the router are withdrawn from the rest of the Internet.” The company added, “We have already reached out to Juniper to see if this is a known bug or something unique to our setup and the kind of traffic we were seeing at the time.”

From what CloudFlare knows, some of the Juniper routers failed to reboot automatically. As a result, management ports were not available to the company. This caused a delay in CloudFlare being able to get back online. While the network was fully restored within an hour, it was frustrating to customers to see some data centers coming back online only to fall down again.

A distributed DDoS attack was also detected as having targeted one of CloudFlare’s customers. According to CloudFlare, what should have happened was that no packet should have matched the rule but instead, the router that encountered the rule crashed due to the fact that the DDoS consumed all of the system’s RAM.

What is CloudFlare Doing about the Outage?

In addition to demanding answers as to what caused the outage, CloudFlare is offering service credits to accounts that were covered by service level agreements. CloudFlare doesn’t seem to be taking the issue lightly. They want to know what happened and exactly why it happened. CloudFlare customers are likely wondering the very same thing.

Are Cloud Systems Reliable?

It seems that cloud systems still have some glitches that need to be worked out. CloudFlare is not the first to experience an issue. Amazon’s cloud services have been known to fail on more than one occasion in the past. This brings up the question, are cloud systems really reliable? It seems that in order for cloud services to truly meet their full potential, continuity plans must be put into place to avoid interruptions to business. Had CloudFlare had a backup plan, the one-hour outage may have lasted only moments, allowing the company to get things under control and running again back under Plan A. Cloud systems can be reliable, but they have a long way to go and some serious continuity issues do indeed need to be addressed in order to make them as failsafe as possible.