Zero-downtime Postgres migrations - a little help
By Chris SinjakliDec 20153 min read
We're pleased to announce the release of
ActiveRecord::SaferMigrations, a library to make changing the schema of Postgres databases safer. Interested how? Read on.
Previously, we looked at how seemingly safe schema changes in Postgres can take your site down. We ended that article with some advice, and today we want to make that advice a little easier to follow.
In a nutshell, there are some operations in Postgres that take exclusive locks on tables, causing other queries involving those tables to block until the exclusive lock is released1. Typically, this sort of operation is run infrequently as part of a deployment which changes the schema of your database.
For the most part, these operations are fine as long as they execute quickly. As we explored in our last post, there's a caveat to that - if the operation has to wait to acquire its exclusive lock, all queries which arrive after it will queue up behind it.
You typically don't want to block the queries from your app for more than a few hundred milliseconds, maybe a second or two at a push2. Achieving that means reading up on locking in Postgres, and being very careful with those schema-altering queries. Make a mistake and, as we found out, your next deployment stops your app from responding to requests.
How we reacted
In the months following our lock-related outage, we became extra careful about every schema migration we ran.
We spent longer in code reviews. We pushed most schema changes outside of high traffic hours. We got more reliant on a few people's Postgres knowledge to greenlight even simple schema changes.
This isn't how we like running things. We'd ended up with clunky process, and it was hampering our ability to build the product, so we started looking for a way out.
A better solution
Right at the end of the last blog post, we mentioned a Postgres setting called
lock_timeout, which limits the amount of time Postgres will wait for a query to acquire a lock (and consequently how long other queries will queue up behind it).
As well as
statement_timeout, which sets a cap on how long an individual statement can hold a lock for. For schema migrations, this protects you from queries that take an exclusive lock on the table and hold it for a long time while updating rows (e.g.
ALTER TABLE foos ADD COLUMN bar varchar DEFAULT 'baz' NOT NULL).
We decided that we wanted
statement_timeout to automatically be set to reasonable values for every migration3.
We wrote a library which does just that for ActiveRecord (the Rails ORM) migrations being run against a Postgres database4. Today, we're open sourcing it.
ActiveRecord::SaferMigrations automatically sets
statement_timeout before running migrations. The library comes with default values for both settings, but you can change them if your app has different constraints.
What this means is that each statement in your migration will spend no longer than:
750ms waiting to acquire a lock
1500ms running a single query
The default values can be changed globally or overridden per-migration, and we have a simple rule at GoCardless - if you want to override them then you need to explain in your pull request why it's safe to do so5.
If either of those limits are hit, Postgres will return an error, and you'll have to retry your migration later. Inconvenient, but delaying the deployment of a new feature beats taking the service down!
Of course, there's a 'but'
That's all good, but we're left with one problem - transactions. Let's start with a couple of facts about transactions:
When a lock is acquired inside a transaction, it is held until that transaction either commits or rolls back.
By default, ActiveRecord wraps every migration in transaction6.
You can see where this is going. Every statement in a transaction might run faster than the limits we impose, but if the transaction is sufficiently large, the locks acquired by the first statement could be held for much longer than the service can tolerate.
There's no easy way out of this one, no
transaction_timeout for us to fall back on. You've just got to make sure you keep your transactions short.
We've found the easiest solution is to split large schema migrations into multiple files so that ActiveRecord runs each part in a separate transaction. This means it's easy to retry deployments which fail7 and the risk of blocking app queries for too long is minimal.
What happened next
We've been running all our migrations this way for about 2 months now, and it's gone well. On a few occasions, the timeouts have prevented locking issues from taking our API down. Having sensible defaults and letting Postgres enforce them has made routine schema changes feel routine again.
It's a small library, but it's had a big impact for our team. We hope it'll do the same for you.
Everyone's constraints are different. For some products, it's fine to have small periods of unavailability. For a payments API, that's costly. ↩
What's reasonable for us might not be reasonable for you. Maybe a second is too long a pause, or maybe you're fine with 5. ↩
While our library is specific to ActiveRecord, what it does isn't. You can use the same principle with other systems which manage schema migrations. ↩
It's fairly common to override
statement_timeout. There are some operations (e.g.
CREATE INDEX CONCURRENTLY) which are safe, but not fast. ↩
You can disable this behaviour with
disable_ddl_transaction!, but often you don't want to. If a non-transactional migration bails out half way through, your schema is left in an intermediate state, and the migration won't apply unless you go in and revert those partial changes. ↩
disable_ddl_transaction!, this approach can't leave you with a partially-applied migration. ↩