Incident review: API and Dashboard outage on 10 October 2017
This post represents the collective work of our Core Infrastructure team's investigation into our API and Dashboard outage on 10 October 2017.
As a payments company, we take reliability very seriously. We hope that the transparency in technical write-ups like this reflects that.
We have included a high-level summary of the incident, and a more detailed technical breakdown of what happened, our investigation, and changes we've made since.
On the afternoon of 10 October 2017, we experienced an outage of our API and Dashboard, lasting 1 hour and 50 minutes. Any requests made during that time failed, and returned an error.
The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.
This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours.
All fun and games until you start with GameDays
As a payments company, our APIs need to have as close to 100% availability as possible. We therefore need to ensure we’re ready for whatever comes our way: from losing a server without bringing the API down, to knowing how to react if a company laptop is compromised.
To accomplish this we run GameDay exercises. What you will read below is our version of a GameDay. We hope that by sharing how we do GameDays we can give you a starting point for running your first GameDay.
In search of performance - how we shaved 200ms off every POST request
While doing some work on our Pro dashboard, we noticed that search requests were taking around 300ms. We've got some people in the team who have used Elasticsearch for much larger datasets, and they were surprised by how slow the requests were, so we decided to take a look.
Today, we'll show how that investigation led to a 200ms improvement on all internal POST requests.
What we did
We started by taking a typical search request from the app and measuring how long it took. We tried this with both Ruby's
Net::HTTP and from the command line using
curl. The latter was visibly faster. Timing the requests showed that the request from Ruby took around 250ms, whereas the one from
curl took only 50ms.
We were confident that whatever was going on was isolated to Ruby1, but we wanted to dig deeper, so we moved over to our staging environment. At that point, the problem disappeared entirely.
For a while, we were stumped. We run the same versions of Ruby and Elasticsearch in staging and production. It didn't make any sense! We took a step back, and looked over our stack, piece by piece. There was something in the middle which we hadn't thought about - HAProxy.
We quickly discovered that, due to an ongoing Ubuntu upgrade2, we were using different versions of HAProxy in staging (1.4.24) and production (1.4.18). Something in those 6 patch revisions was responsible, so we turned our eyes to the commit logs. There were a few candidates, but one patch stood out in particular.
We did a custom build of HAProxy 1.4.18, with just that patch added, and saw request times drop by around 200ms. Job done.