Update on the GoCardless service outage
Last editedJan 20202 min read
On Thursday, we experienced an outage of 1 hour and 40 minutes that affected all our services. During that time you may have seen brief service recovery but for the most part our service was unavailable.
On Tuesday 19th, at 22:00 UTC, we received a notification from our infrastructure provider scheduling emergency maintenance in one of their Amsterdam datacentres on the next day at 23:00 UTC.
They were hoping to fix a recurring issue with one of the routers by performing a reboot of the chassis pair. As part of this maintenance they also planned to perform firmware upgrades, for an estimated total downtime of approximately 20 minutes.
Since our outage in July 2015 we have been working on getting each component of our infrastructure to span multiple data centres so that we can gracefully handle this kind of datacentre-wide failure. All the infrastructure on the critical path for our service was, as of January 21st, available in multiple locations, except for our database.
On Wednesday 20th, in the morning, the Platform Engineering team gathered and put together a plan to migrate the database to a new datacentre before the start of the maintenance.
On Wednesday 20th, at 22:00 UTC, we decided that the plan was not viable to complete before the maintenance. During the day several complications prevented us properly testing the plan in our staging environment. We made the call to take the full 20 minutes of downtime rather than cause more disruption by executing a plan that we could not fully prepare.
On Thursday 21st, at 00:25 UTC, our provider performed the reboot of the chassis, which instantly caused the GoCardless API and Pro API to be unavailable.
At 00:49 UTC, they announced that the maintenance was over. Our services started slowly recovering.
At 00:50 UTC, our monitoring system alerted us that some of our services were still unavailable. We immediately started troubleshooting.
At 01:00 UTC, a scheduled email announcing the end of the maintenance window was sent incorrectly.
At 02:00 UTC, we discovered the issue. Our database cluster is setup in such away that any write to the database, for example, creating a payment, needs to be recorded on two servers before we say it’s successful. Unfortunately during the maintenance, the link between our primary and our standby server broke. That meant that no write transactions could go through until that link was restored. Also, since the writes were blocking, we quickly exhausted our connection slots and read requests started failing too. After a few seconds the writes would time out and we would start having successful read requests again. That explains why we were sporadically up during that time. Once we brought our cluster back together, requests started flowing through again.
By 02:05 UTC, all services were fully operational.
We are following up with our infrastructure provider to figure out why they had to perform maintenance in that datacentre at such short notice.
We have now provisioned a standby cluster in another datacentre so that we are able to migrate to a new datacentre with very little downtime in case this happens again. Long term, as we said in our last blog post, we have plans for how we can further minimise disruption in the case of datacentre-wide failure. This project is the main focus of the Platform Engineering team for the next six months.