Google broke its own cloud, again, with dud DB config change
Memcache was gone in 20 seconds and down for nearly two hours
Google's again 'fessed up to cooking its own cloud.
This time the mess was brief – just under two hours last Monday – and took down its Memcache service. The result was “Managed VMs experienced failures of all HTTP requests and App Engine API calls during this incident.”
There's a little upside in the fact that Google now offers a replacement for Managed VMs, the “App Engine Flexible Environment”, and it stayed up,
But there's also some egg to smear on a face because the wound was self-inflicted. Here's Google's root cause explanation:
The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters. The configuration which maps applications to datacenters is stored in a global database.
The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds. When the configuration could not be fetched by clients, Memcache became unavailable.
As you can see, whatever the update to the database did, it broke Memcache. And for all of Google's vaunted site reliability engineering expertise, this part of its cloud could fall over in 20 seconds.
It gets worse: Google's account of the incident also reports that it tried to rollback the database to its previous state, but doing so “required an update to the configuration in the global database, which also failed.” A workaround was eventually found.
The Register knows of at least six other occasions on which Google's own actions crocked its cloud. Those incidents include a typo, human error, automated services going down and missing a BGP problem, another bad configuration change, a bad patch and a failed attempt to patch two problems at once.
None of the outages were lengthy, or as embarrassing as IBM's load balancers disappearing because it mis-registered a domain name or AWS's major S3 outage caused by a typo. But they still happened. And keep happening. ?