Incident review: API and Dashboard outage on 10 October 2017
This post represents the collective work of our Core Infrastructure team's investigation into our API and Dashboard outage on 10 October 2017.
As a payments company, we take reliability very seriously. We hope that the transparency in technical write-ups like this reflects that.
We have included a high-level summary of the incident, and a more detailed technical breakdown of what happened, our investigation, and changes we've made since.
On the afternoon of 10 October 2017, we experienced an outage of our API and Dashboard, lasting 1 hour and 50 minutes. Any requests made during that time failed, and returned an error.
The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.
This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours.
From idea to reality: containers in production at GoCardless
As developers, we work on features that our users interact with every day. When you're working on the infrastructure that underpins those features, success is silent to the outside world, and failure looks like this:
Recently, GoCardless moved to a container-based infrastructure. We were lucky, and did so silently. We think that our experiences, and the choices we made along the way, are worth sharing with the wider community. Today, we're going to talk about:
- deploying software reliably
- why you might want a container-based infrastructure
- what it takes to reliably run containers in production
We'll wrap up with a little chat about the container ecosystem as it is today, and where it might go over the next year or two.
Zero-downtime Postgres migrations - a little help
We're pleased to announce the release of
ActiveRecord::SaferMigrations, a library to make changing the schema of Postgres databases safer. Interested how? Read on.
Previously, we looked at how seemingly safe schema changes in Postgres can take your site down. We ended that article with some advice, and today we want to make that advice a little easier to follow.
In a nutshell, there are some operations in Postgres that take exclusive locks on tables, causing other queries involving those tables to block until the exclusive lock is released1. Typically, this sort of operation is run infrequently as part of a deployment which changes the schema of your database.
For the most part, these operations are fine as long as they execute quickly. As we explored in our last post, there's a caveat to that - if the operation has to wait to acquire its exclusive lock, all queries which arrive after it will queue up behind it.
You typically don't want to block the queries from your app for more than a few hundred milliseconds, maybe a second or two at a push2. Achieving that means reading up on locking in Postgres, and being very careful with those schema-altering queries. Make a mistake and, as we found out, your next deployment stops your app from responding to requests.
In search of performance - how we shaved 200ms off every POST request
While doing some work on our Pro dashboard, we noticed that search requests were taking around 300ms. We've got some people in the team who have used Elasticsearch for much larger datasets, and they were surprised by how slow the requests were, so we decided to take a look.
Today, we'll show how that investigation led to a 200ms improvement on all internal POST requests.
What we did
We started by taking a typical search request from the app and measuring how long it took. We tried this with both Ruby's
Net::HTTP and from the command line using
curl. The latter was visibly faster. Timing the requests showed that the request from Ruby took around 250ms, whereas the one from
curl took only 50ms.
We were confident that whatever was going on was isolated to Ruby1, but we wanted to dig deeper, so we moved over to our staging environment. At that point, the problem disappeared entirely.
For a while, we were stumped. We run the same versions of Ruby and Elasticsearch in staging and production. It didn't make any sense! We took a step back, and looked over our stack, piece by piece. There was something in the middle which we hadn't thought about - HAProxy.
We quickly discovered that, due to an ongoing Ubuntu upgrade2, we were using different versions of HAProxy in staging (1.4.24) and production (1.4.18). Something in those 6 patch revisions was responsible, so we turned our eyes to the commit logs. There were a few candidates, but one patch stood out in particular.
We did a custom build of HAProxy 1.4.18, with just that patch added, and saw request times drop by around 200ms. Job done.
Zero-downtime Postgres migrations - the hard parts
A few months ago, we took around 15 seconds of unexpected API downtime during a planned database migration. We're always careful about deploying schema changes, so we were surprised to see one go so badly wrong. As a payments company, the uptime of our API matters more than most - if we're not accepting requests, our merchants are losing money. It's not in our nature to leave issues like this unexplored, so naturally we set about figuring out what went wrong. This is what we found out.
We're no strangers to zero-downtime schema changes. Having the database stop responding to queries for more than a second or two isn't an option, so there's a bunch of stuff you learn early on. It's well covered in other articles1, and it mostly boils down to:
- Don't rename columns/tables which are in use by the app - always copy the data and drop the old one once the app is no longer using it
- Don't rewrite a table while you have an exclusive lock on it (e.g. no
ALTER TABLE foos ADD COLUMN bar varchar DEFAULT 'baz' NOT NULL)
- Don't perform expensive, synchronous actions while holding an exclusive lock (e.g. adding an index without the
This advice will take you a long way. It may even be all you need to scale this part of your app. For us, it wasn't, and we learned that the hard way.
Syncing Postgres to Elasticsearch: lessons learned
At a high level, the problem is that you have your data in one place (for us, that's Postgres), and you want to keep a copy of it in Elasticsearch. This means every write you make (
DELETE statements) needs to be replicated to Elasticsearch. At first this sounds easy: just add some code which pushes a document to Elasticsearch after updating Postgres, and you're done.
But what happens if Elasticsearch is slow to acknowledge the update? What if Elasticsearch processes those updates out of order? How do you know Elasticsearch processed every update correctly?
We thought those issues through, and decided our indexes had to be:
- Updated asynchronously - The user's request should be delayed as little as possible.
- Eventually consistent - While it can lag behind slightly, serving stale results indefinitely isn't an option.
- Easy to rebuild - Updates can be lost before reaching Elasticsearch, and Elasticsearch itself is known to lose data under network partitions.
This is the easy part. Rather than generating and indexing the Elasticsearch document inside the request cycle, we enqueue a job to resync it asynchronously. Those jobs are processed by a pool of workers, either individually or in batches - as you start processing higher volumes, batching makes more and more sense.
Leaving the JSON generation and Elasticsearch API call out of the request cycle helps keep our API response times low and predictable.
The easiest way to get data into Elasticsearch is via the update API, setting any fields which were changed. Unfortunately, this offers no safety when it comes to concurrent updates, so you can end up with old or corrupt data in your index.
To handle this, Elasticsearch offers a versioning system with optimistic locking. Every write to a document causes its version to increment by 1. When posting an update, you read the current version of a document, increment it and supply that as the version number in your update. If someone else has written to the document in the meantime, the update will fail. Unfortunately, it's still possible to have an older update win under this scheme. Consider a situation where users Alice and Bob make requests which update some data at the same time:
|Postgres update commits||-|
|Elasticsearch request delayed||-|
|-||Postgres update commits|
|-||Reads v2 from Elasticsearch|
|-||Writes v3 to Elasticsearch|
|Reads v3 from Elasticsearch||-|
|Writes v4 to Elasticsearch||Changes lost|
This may seem unlikely, but it isn't. If you're making a lot of updates, especially if you're doing them asynchronously, you will end up with bad data in your search cluster. Fortunately, Elasticsearch provides another way of doing versioning. Rather than letting it generate version numbers, you can set
external in your requests, and provide your own version numbers. Elasticsearch will always keep the highest version of a document you send it.
Since we're using Postgres, we already have a great version number available to us: transaction IDs. They're 64-bit integers, and they always increase on new transactions. Getting hold of the current one is as simple as:
The asynchronous job simply selects the current transaction ID, loads the relevent data from Postgres, and sends it to Elasticsearch with that ID set as the version. Since this all happens after the data is committed in Postgres, the document we send to Elasticsearch is at least as up to date as when we enqueued the asynchronous job. It can be newer (if another transaction has committed in the meantime), but that's fine. We don't need every version of every record to make it to Elasticsearch. All we care about is ending up with the newest one once all our asynchronous jobs have run.
Rebuilding from scratch
The last thing to take care of is to handle any inconsistencies from lost updates. We do so by periodically resyncing all recently written Postgres records, and the same code allows us to easily rebuild our indexes from scratch without downtime.
With the asynchronous approach above, and without a transactional, Postgres-backed queue, it's possible to lose updates. If an app server dies after committing the transaction in Postgres, but before enqueueing the sync job, that update won't make it to Elasticsearch. Even with a transactional, Postgres-backed queue there is a chance of losing updates for other reasons (such as the issues under network partition mentioned earlier).
To handle the above, we decided to periodically resync all recently updated records. To do this we use Elasticsearch's Bulk API, and reindex anything which was updated after the last resync (with a small overlap to make sure no records get missed by this catch-up process).
The great thing about this approach is you can use the same code to rebuild the entire index. You'll need to do this routinely, when you change your mappings, and it's always nice to know you can recover from disaster.
On the point of rebuilding indexes from scratch, you'll want to do that without downtime. It's worth taking a look at how to do this with aliases right from the start. You'll avoid a bunch of pain later on.
There's a lot more to building a great search experience than you can fit in one blog post. Different applications have different constraints, and it's worth thinking yours through before you start writing production code. That said, hopefully you'll find some of the techniques in this post useful.