Zero-downtime Postgres migrations - the hard parts

Written by

Last editedJun 2024

A few months ago, we took around 15 seconds of unexpected API downtime during a planned database migration. We're always careful about deploying schema changes, so we were surprised to see one go so badly wrong. As a payments company, the uptime of our API matters more than most - if we're not accepting requests, our merchants are losing money. It's not in our nature to leave issues like this unexplored, so naturally we set about figuring out what went wrong. This is what we found out.

Background

We're no strangers to zero-downtime schema changes. Having the database stop responding to queries for more than a second or two isn't an option, so there's a bunch of stuff you learn early on. It's well covered in other articles¹, and it mostly boils down to:

Don't rename columns/tables which are in use by the app - always copy the data and drop the old one once the app is no longer using it
Don't rewrite a table while you have an exclusive lock on it (e.g. no ALTER TABLE foos ADD COLUMN bar varchar DEFAULT 'baz' NOT NULL)
Don't perform expensive, synchronous actions while holding an exclusive lock (e.g. adding an index without the CONCURRENTLY flag)

This advice will take you a long way. It may even be all you need to scale this part of your app. For us, it wasn't, and we learned that the hard way.

The migration

Jump back to late January. At the time, we were building invoicing for our Pro product. We'd been through a couple of iterations, and settled on model/table names. We'd already deployed an earlier revision, so we had to rename the tables. That wasn't a problem though - the tables were empty, and there was no code depending on them in production.

The foreign key constraints on those tables had out of date names after the rename, so we decided to drop and recreate them². Again, we weren't worried. The tables were empty, so there would be no long-held lock taken to validate the constraints.

So what happened?

We deployed the changes, and all of our assumptions got blown out of the water. Just after the schema migration started, we started getting alerts about API requests timing out. These lasted for around 15 seconds, at which point the migration went through and our API came back up. After a few minutes collecting our thoughts, we started digging into what went wrong.

First, we re-ran the migrations against a backup of the database from earlier that day. They went through in a few hundred milliseconds. From there we turned back to the internet for an answer.

Information was scarce. We found lots of blog posts giving the advice from above, but no clues on what happened to us. Eventually, we stumbled on an old thread on the Postgres mailing list, which sounded exactly like the situation we'd ran into. We kept looking, and found a blog post which went into more depth³.

In order to add a foreign key constraint, Postgres takes AccessExclusive locks on both the table with the constraint⁴, and the one it references while it adds the triggers which enforce the constraint. When a lock can't be acquired because of a lock held by another transaction, it goes into a queue. Any locks that conflict with the queued lock will queue up behind it. As AccessExclusive locks conflict with every other type of lock, having one sat in the queue blocks all other operations⁵ on that table.

Here's a worked example using 3 concurrent transactions, started in order:

-- Transaction 1
SELECT DISTINCT(email)     -- Takes an AccessShare lock on "parent"
FROM parent;               -- for duration of slow query.

-- Transaction 2
ALTER TABLE child          -- Needs an AccessExclusive lock on
ADD CONSTRAINT parent_fk   -- "child" /and/ "parent". AccessExclusive
  FOREIGN KEY (parent_id)  -- conflicts with AccessShare, so sits in
  REFERENCES parent        -- a queue.
  NOT VALID;

-- Transaction 3
SELECT *                   -- Normal query also takes an AccessShare,
FROM parent                -- which conflicts with AccessExclusive
WHERE id = 123;            -- so goes to back of queue, and hangs.

While the tables we were adding the constraints to were unused by the app code at that point, the tables they referenced were some of the most heavily used. An unfortunately timed, long-running read query on the parent table collided with the migration which added the foreign key constraint.

The ALTER TABLE statement itself was fast to execute, but the effect of it waiting for an AccessExclusive lock on the referenced table caused the downtime - read/write queries issued by calls to our API piled up behind it, and clients timed out.

Avoiding downtime

Applications vary too much for there to be a "one size fits all" solution to this problem, but there are a few good places to start:

Eliminate long-running queries/transactions from your application.⁶ Run analytics queries against an asynchronously updated replica.
- It's worth setting log_min_duration_statement and log_lock_waits to find these issues in your app before they turn into downtime.
Set lock_timeout in your migration scripts to a pause your app can tolerate. It's better to abort a deploy than take your application down.
Split your schema changes up.
- Problems become easier to diagnose.
- Transactions around DDL are shorter, so locks aren't held so long.
Keep Postgres up to date. The locking code is improved with every release.

Whether this is worth doing comes down to the type of project you're working on. Some sites get by just fine putting up a maintenance page for the 30 seconds it takes to deploy. If that's not an option for you, then hopefully the advice in this post will help you avoid unexpected downtime one day.

Braintree have a really good post on this.↩
At the time, partly as an artefact of using Rails migrations which don't include a method to do it, we didn't realise that Postgres had support for renaming constraints with ALTER TABLE. Using this avoids the AccessExclusive lock on the table being referenced, but still takes one on the referencing table. Either way, we want to be able to add new foreign keys, not just rename the ones we have.↩
It's also worth noting that the Postgres documentation and source code are extremely high quality. Once we had an idea of what was happening, we went straight to the locking code for ALTER TABLE statements.↩
This still applies if you add the constraint with the NOT VALID flag. Postgres will briefly hold an AccessExclusive lock against both tables while it adds constraint triggers to them. 9.4 does make the VALIDATE CONSTRAINT step take a weaker ShareUpdateExclusive lock though, which makes it possible to validate existing data in large tables without downtime.↩
SELECT statements take an AccessShare lock.↩
If developers have access to a console where they can run queries against the production database, they need to be extremely cautious. BEGIN; SELECT * FROM some_table WHERE id = 123; /* Developer goes to make a cup of tea */ will cause downtime if someone deploys a schema change for some_table.↩