Kubernetes bug ate my banking app! How code flaw crashed Brit upstart

Monzo engineering chief details exact cause of outage

Monzo, a UK online banking startup, suffered an outage on Friday for over an hour due to a four-month-old Kubernetes bug.

The Fatal Flaw, as the event might be titled by author Lemony Snicket, took down a complete production cluster, according to Oliver Beattie, head of engineering for Monzo, "through a very unfortunate series of events."

Customers saw incoming payments delayed during this period and outgoing payments failed. Monzo essentially operates as an internet-based bank, accessible through its smartphone app, that offers current accounts, budgeting tools, spending warnings, and so on.

On Monday, Beattie posted an analysis of the incident and lay the blame on Kubernetes and incompatibility with related software.

Monzo's stack, Beattie explained, relies on Kubernetes for cluster orchestration, the distributed database etcd, and linkerd, software that manages cluster routing and load balancing.

Two weeks prior to the outage, Monzo's platform team upgraded its etcd cluster to a new version and expanded its size from three nodes to nine. In so doing, they set the stage for the outage. On Thursday, an engineering team deployed a new feature for account holders, but started seeing issues and scaled the service down so it was not running on any replicas but remained as a Kubernetes service.

On Friday, around 14:10 BST, a change was made to a service used for processing payments. At that point, customers began experiencing payment failures. Two minutes later, the change was rolled back but the problems persisted.

By 14:18, Monzo's engineers traced the problem to linkerd. The software wasn't receiving updates from Kubernetes about where new pods were running on the network and was routing requests to IP addresses that were no longer valid.

At 14:26, they decided to restart the several hundred linkerd instances running on the backend in the belief doing so would fix the issue across the board. But they couldn't because the Kubelets running the cluster's nodes were unable to fetch configuration data from the Kubernetes apiservers.

Photo of a Jenga tower

Banking app startups go TITSUP as payment slurper keels over. Again

READ MORE

Suspecting additional issues affecting either Kubernetes or etcd, they restarted three apiservers processes. Come 15:13 and all the linkerd pods had restarted. But the banking app's services were not receiving any requests. It was, by this point, a full platform outage.

At 15:27, the engineers noticed linkerd logging a NullPointerException while trying to read the service discovery response from the apiservers. They realized the failure to parse empty responses was due to an incompatibility between the versions of Kubernetes and linkerd being run.

To restore service, they turned to an updated version of linkerd being tested in the company's staging environment. After deploying the necessary version upgrade, they recognized that they could avoid the error that arose from trying to parse services with no replicas by deleting them. That allowed linkerd to resume its service discovery and the platform started to recover.

Beattie said his team "found a bug in Kubernetes and the etcd client that can cause requests to timeout after cluster reconfiguration of the kind we performed the week prior. Because of these timeouts, when the service was deployed, linkerd failed to receive updates from Kubernetes about where it could be found on the network."

Restarting the linkerd instances compounded the problem, he said, because it revealed an incompatibility between specific versions of linkerd and Kubernetes.

"I want to reassure everyone that we take this incident very seriously; it’s among the worst technical incidents that have happened in our history, and our aim is to run a bank that our customers can always depend on," Beattie concluded. "We know we let you down, and we’re really sorry for that."

The frank mea culpa appears to have been well-received by customers, with a number of them voicing appreciation for the detailed disclosure and explanation. ?


Biting the hand that feeds IT ? 1998–2017

  • 305452893 2018-01-22
  • 61770892 2018-01-22
  • 59080891 2018-01-22
  • 87471890 2018-01-22
  • 79096889 2018-01-22
  • 734763888 2018-01-22
  • 455411887 2018-01-22
  • 685280886 2018-01-22
  • 615657885 2018-01-22
  • 700163884 2018-01-21
  • 866691883 2018-01-21
  • 994750882 2018-01-21
  • 92145881 2018-01-21
  • 263961880 2018-01-21
  • 5823879 2018-01-21
  • 202428878 2018-01-21
  • 235407877 2018-01-21
  • 949120876 2018-01-21
  • 530375875 2018-01-21
  • 14090874 2018-01-21