Facebook Software Engineering Director explained why the social network went offline yesterday. He called the mishap “the worst outage we’ve had in over four years.” Facebook went offline around 11:30AM PST yesterday and wasn’t full functional against until about 3PM PST.
The downtime was caused by “an unfortunate handling of an error condition.” Facebook uses an automated system designed to verify configuration values in the cache and replace invalid values with updated values in the persistent store.
Below is part of the blog post that Johnson wrote:
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn?t allow the databases to recover.
To fix the problem, Facebook had to be completely turned off. Turning off a website that has 500 million users depending on it can be a hard decision sometimes, but the problem kept on elevating and something had to be done. Fortunately Facebook does not go offline as much as Twitter does.