Facebook blames system change for 'worst outage' in 4 years
PC Mag reports: Facebook blamed yesterday's 2.5-hour downtime on a change it made to its system, resulting in the worst outage the social-networking company had seen in four years.
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition," Facebook's Robert Johnson wrote in a blog post. "This is the worst outage we've had in over four years, and we wanted to first of all apologize."
Yesterday's outage was the second in as many days for Facebook, which was hit was sporadic downtime on Wednesday because of an "issue with a third-party provider."
Facebook has an automated system that checks for invalid configuration values throughout the site. If it finds an error, it replaces it with an updated value from its persistent store.
"This works well for a transient problem with the cache, but it doesn't work when the persistent store is invalid," Johnson wrote.
Unfortunately, Facebook made a change to its persistent store yesterday that ended up being invalid. As a result, the automated system checking for errors would replace those errors with values from the persistent store - which was also not working.
"Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," Johnson said. "To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key," Johnson continued. "This meant that even after the original problem had been fixed, the stream of queries continued."
The result was a "feedback loop" that didn't allow for database recovery time, he said.
"The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition," Facebook's Robert Johnson wrote in a blog post. "This is the worst outage we've had in over four years, and we wanted to first of all apologize."
Yesterday's outage was the second in as many days for Facebook, which was hit was sporadic downtime on Wednesday because of an "issue with a third-party provider."
Facebook has an automated system that checks for invalid configuration values throughout the site. If it finds an error, it replaces it with an updated value from its persistent store.
"This works well for a transient problem with the cache, but it doesn't work when the persistent store is invalid," Johnson wrote.
Unfortunately, Facebook made a change to its persistent store yesterday that ended up being invalid. As a result, the automated system checking for errors would replace those errors with values from the persistent store - which was also not working.
"Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second," Johnson said. "To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key," Johnson continued. "This meant that even after the original problem had been fixed, the stream of queries continued."
The result was a "feedback loop" that didn't allow for database recovery time, he said.
--> In other words, they goofed. Makes you realize that we're all just hanging on by a thread when it comes to technology.
Comments