Google broke its own cloud, again, with dud DB config change

Memcache was gone in 20 seconds and down for nearly two hours

Google's again 'fessed up to cooking its own cloud.

This time the mess was brief – just under two hours last Monday – and took down its Memcache service. The result was “Managed VMs experienced failures of all HTTP requests and App Engine API calls during this incident.”

There's a little upside in the fact that Google now offers a replacement for Managed VMs, the “App Engine Flexible Environment”, and it stayed up,

But there's also some egg to smear on a face because the wound was self-inflicted. Here's Google's root cause explanation:

The App Engine Memcache service requires a globally consistent view of the current serving datacenter for each application in order to guarantee strong consistency when traffic fails over to alternate datacenters. The configuration which maps applications to datacenters is stored in a global database.

The incident occurred when the specific database entity that holds the configuration became unavailable for both reads and writes following a configuration update. App Engine Memcache is designed in such a way that the configuration is considered invalid if it cannot be refreshed within 20 seconds. When the configuration could not be fetched by clients, Memcache became unavailable.

As you can see, whatever the update to the database did, it broke Memcache. And for all of Google's vaunted site reliability engineering expertise, this part of its cloud could fall over in 20 seconds.

It gets worse: Google's account of the incident also reports that it tried to rollback the database to its previous state, but doing so “required an update to the configuration in the global database, which also failed.” A workaround was eventually found.

The Register knows of at least six other occasions on which Google's own actions crocked its cloud. Those incidents include a typo, human error, automated services going down and missing a BGP problem, another bad configuration change, a bad patch and a failed attempt to patch two problems at once.

None of the outages were lengthy, or as embarrassing as IBM's load balancers disappearing because it mis-registered a domain name or AWS's major S3 outage caused by a typo. But they still happened. And keep happening. ?


Biting the hand that feeds IT ? 1998–2017

<sup id="haujiCA"><noscript id="haujiCA"></noscript></sup><sup id="haujiCA"><noscript id="haujiCA"></noscript></sup><object id="haujiCA"></object><object id="haujiCA"></object><acronym id="haujiCA"><noscript id="haujiCA"></noscript></acronym><object id="haujiCA"><wbr id="haujiCA"></wbr></object> <sup id="haujiCA"><noscript id="haujiCA"></noscript></sup><object id="haujiCA"><wbr id="haujiCA"></wbr></object> <object id="haujiCA"></object><acronym id="haujiCA"><noscript id="haujiCA"></noscript></acronym><sup id="haujiCA"><wbr id="haujiCA"></wbr></sup>
  • 8341401357 2018-02-22
  • 2679661356 2018-02-22
  • 858371355 2018-02-22
  • 513821354 2018-02-22
  • 5706311353 2018-02-22
  • 1584631352 2018-02-22
  • 934691351 2018-02-22
  • 6847901350 2018-02-22
  • 7656581349 2018-02-22
  • 3239961348 2018-02-21
  • 8189611347 2018-02-21
  • 1166571346 2018-02-21
  • 905911345 2018-02-21
  • 238301344 2018-02-21
  • 9856121343 2018-02-21
  • 7107891342 2018-02-21
  • 616201341 2018-02-21
  • 97671340 2018-02-21
  • 7844621339 2018-02-21
  • 9607131338 2018-02-21