Friday's outage post mortem

Written by

Last editedJun 20242 min read

The below is a post-mortem on downtime experienced on Friday 3rd July. It was sent around the GoCardless team and is published here for the benefit of our integrators.

Any amount of downtime is completely unacceptable to us. We're sorry to have let you, our customers, down.

Summary

On Friday 3rd July 2015, we experienced three separate outages over a period of 1 hour and 35 minutes. It affected our API and our Pro API, as well as the dashboards powered by these APIs.

The first incident was reported at 16:37 BST and all our service were fully up and stable at 18:12 BST.

We apologise for the inconvenience caused. More importantly, we are doing everything we can to make sure this can't happen again.

Cause

On Friday 3rd July 2015, at 16:35 BST, we started experiencing network failures on our primary database cluster in SoftLayer’s Amsterdam datacentre. After several minutes of diagnosis it became evident that SoftLayer were having problems within their internal network. We were experiencing very high latency when troubleshooting and seeing high packet loss between servers in our infrastructure. We immediately got in touch with SoftLayer’s technical support team and they confirmed they were having issues with one of their backend routers, which was causing connectivy issues in the whole datacentre.

At 16:37 BST, all the interfaces on SoftLayer’s backend router went down. This caused our PostgreSQL cluster to be partitioned since all nodes lost connectivity to one another. As a result, our API and Pro API became unavailable, as well as the dashboards using these APIs. At 16:39 BST the network came back online and we started repairing our database cluster.

By 16:44 BST, we brought our PostgreSQL cluster up and normal service resumed.

At 17:07 BST, the interfaces on SoftLayer's router flapped again, causing our service to be unavailable. This time one of the standby nodes in our cluster became corrupted. While some of our team worked on repairing the cluster, the remainder started preparing to fail over our database cluster to another datacentre.

By 17:37 BST, all of SoftLayer’s internal network was working properly again. We received confirmation from SoftLayer that the situation was entirely mitigated. We came to the conclusion that at this point, transitioning our database cluster to a different datacentre would cause unnecessary further disruption.

At 18:03 BST, we saw a third occurrence of the internal network flapping, which caused our service to be unavailable again for a short period of time.

By 18:12 BST, our API, Pro API and dashboards were fully operational.

Post mortem

We don't yet have any further details from SoftLayer on the cause of the router issues, but we are awaiting a root cause analysis.

Currently, our PostgreSQL cluster automatically fails over to a new server within the same datacentre in case of any single-node failure. Following these events we have brought forward work to completely automate this failover across datacentres so that we can recover faster from datacentre-wide issues.