Fear-free PostgreSQL migrations for Rails

Written by

Last editedJun 2024

As anyone who reads our engineering posts will quickly learn, we really like Postgres at GoCardless. It’s our database of first resort, and it backs our main API monolith, which means we trust it very much, even when things go wrong - as they inevitably do.

We like Postgres for the same reasons many others do: its rich feature set, its impressive development process, and the strong ecosystem of tools for working with it. There are sharp edges, however; and this article is about how we dealt with one such class of problems: schema migrations that leave the database unresponsive to routine queries.

What’s the problem?

Any long-lived, database-backed application is going to undergo change. When the app changes, sooner or later the data it deals with changes too; and then the database must change. It’s perfectly possible to issue statements directly to Postgres to make these changes (these statements are typically called data definition language (or DDL) queries, to distinguish them from routine data access), but this is error-prone and difficult to audit, so best practices are to use a migration tool instead. This groups related DDL queries into atomic steps, called migrations.

At GoCardless, we primarily use Ruby on Rails, whose ORM provides a simple interface for creating migrations. The developer writes a file:

class MyMigration < ActiveRecord::Migration[6.0]
  def change
    add_column :payments, :reference, :text, null: true
  end
end

Then you only need to run rails db:migrate (for example, at deployment time), and Rails will generate and execute the correct SQL:

ALTER TABLE "payments" ADD COLUMN "reference" TEXT;

The migration, meanwhile, is in source control so is easy to review and manage. So what’s the problem? Here, we have to take a deeper dive into what Postgres is actually doing when it changes the schema.

Migrations: how do they work?

One of Postgres’s main jobs is to keep your data consistent. Unsurprisingly, schema changes pose serious challenges here. It’s like changing a car tyre while it barrels down the motorway.

ALTER TABLE ADD TYRE

If a column is added in the middle of a SELECT statement's execution, for example, we don’t want some rows in the result set to have the new column and others not.

Similarly, if a column is dropped between the issuance of an UPDATE that writes to it and the actual execution of that UPDATE, we certainly don’t want to write that obsolete value to some undefined location in memory or on disk!

One of the most important mechanisms Postgres has for avoiding these problems is locking. Before making a change to a table, the Postgres client will acquire a lock on that table, which will forbid certain other actions on it. For most DDL changes, an ACCESS EXCLUSIVE lock will be required, a very strict lock which prevents all reads and all writes for the duration of the query.

You can already see one potential footgun here - what if the schema change takes a long time? Before Postgres 11, for example, adding a new column with a default value meant writing the new value to every row in the table, which for a very large table could take minutes. Changing the data type of a column can still require a full rewrite (depending on the types), which is a slow business.

But things are actually even hairier than that.

Suppose that, when we run our migration, we can't acquire the lock we need immediately. This is a very common scenario for ACCESS EXCLUSIVE locks, since all queries conflict with it. Let's say that there's a big batch job, running in a transaction, inserting millions of rows into a table in a process that takes hours. These queries will take a ROW EXCLUSIVE lock, which does not prevent any other transaction from reading or writing to the table, but does prevent schema changes.

When we issue our DDL command, however, our query then gets put into a queue. Any other queries that come in will be behind us in the queue, and cannot execute unless they do not conflict with any query ahead of them.

Depending on how long the big batch job has left to run, this can be very bad news. Though we only actually hold the ACCESS EXCLUSIVE lock for a fraction of a second, we make the table completely unavailable for the length of time we have to wait for that lock as well. Out of the box, Postgres will never time out a query. This way, even a trivial change, like adding a column, can cause your app to fall over completely.

Queries not blocked by slow transaction

Figure 1: queries not blocked by slow transaction

Queries blocked briefly by migration

Figure 2: Queries blocked briefly by migration

Queries blocked for a long time by slow transaction and migration

Figure 3: Queries blocked for a long time by slow transaction and migration

Now, recall that we’re using Rails’s built-in migration system, which makes it easier to express your changes than ‘raw’ DDL queries. Too easy, perhaps? Consider the following, harmless-looking migration:

class MyMigration < ActiveRecord::Migration[5.2]
  def change
    add_reference :payments, :customer, foreign_key: true
  end
end

Relational data being what it is, this is a pretty routine thing to do. You’re linking two entities, in strict accordance with Rails conventions. What could go wrong? The problem is a little clearer if we look at the generated SQL (simplified somewhat):

BEGIN;
ALTER TABLE payments ADD COLUMN customer_id INTEGER;
ALTER TABLE payments ADD CONSTRAINT customers_fk
  FOREIGN KEY (customer_id) REFERENCES customers (id);
COMMIT;

The problem is that adding that constraint actually takes an ACCESS EXCLUSIVE lock on both payments and customers. And all this happens in a transaction, so the locks will be held continuously for as long as it takes to apply both statements. In particular, payments will be unreadable by applications for the time taken to lock it, plus the time taken to lock customers, plus the time taken to execute the schema changes (and on a large table, adding that constraint can be very slow, as it happens). This is a recipe for pain, and at GoCardless, we have been bitten by these kinds of scenarios more and more as the sheer size of our database grows by orders of magnitude.

These problems, in the end, are abstraction leaks, which means they rub against the user's understanding of the system; in order to avoid them, you need to have arcane knowledge of how Postgres handles concurrency, which - interesting as it is - you will probably only get by having outages of this sort. Surely we can do better to prevent these situations than merely hoping someone on our team has the correct scar tissue to know the answer?

Solving the problem

We've previously published an open-source gem called activerecord-safer_migrations, which deals with one part of the problem by restricting lock and statement timeouts to low defaults. This prevents accidentally making tables unavailable indefinitely, which is a start; but sometimes things take a while, and you genuinely need to raise the timeouts, or you still stumble upon a combination of locks that causes an availability issue.

So today we're introducing a more ambitious solution: Nandi.

Nandi is an alternative DSL for writing migrations to ActiveRecord::Migration. It is designed to allow only availability-safe migrations to be written.

To get started, add the gem to your bundle:

gem "nandi"

You will now have a handful of new generators at your disposal. Let's start by generating a simple migration:

rails g nandi:migration add_widgets_table
#=> create  db/safe_migrations/20191217170056_add_widgets_table.rb

This will generate an empty migrations file. Adding a table is a familiar experience if you've interacted with Rails before:

# db/safe_migrations/20191217170056_add_widgets_table.rb

class AddWidgetsTable < Nandi::Migration
  def up
    create_table :widgets do |t|
      t.text :name
      t.timestamps
    end
  end

  def down
    drop_table :widgets
  end
end

Now we're done with that, we run another generator, to compile our migration to a plain old ActiveRecord migration:

rails g nandi:compile
#=> create  db/migrate/20191217170056_add_widgets_table.rb

class AddWidgetsTable < ActiveRecord::Migration[5.2]

  set_lock_timeout(5000)
  set_statement_timeout(1500)


  def up
    create_table :widgets do |t|
      t.column "name", :text
      t.timestamps
end

  end
  def down
    drop_table :widgets

  end
end

(As you can see, the formatting of this file is pretty awful - Nandi is still a young project and we have prioritised correct output over attractive output. Think of these files as equivalent to minified JS bundles or suchlike.)

Nandi has not intervened much here - only to borrow the timeout macros from activerecord-safer_migrations, to make sure we exit if something takes forever for some reason. In this case, it really shouldn't.

How about if we try to reproduce one of our troublesome items from above?

class MyMigration < Nandi::Migration
  def change
    add_reference :payments, :customer, foreign_key: true
  end
end

If you shuddered in recognition when reading that, then you know what the safe way to do this is: create the column without foreign_key: true; then use PostgreSQL's ALTER TABLE ADD CONSTRAINT with the NOT VALID flag, which will update the catalogs but not actually attempt to verify that the data in the column are currently valid; then, finally, validate the constraint in a separate transaction. Phew!

In Nandi, this is much easier: just run one generator command, and - assuming you're following Rails conventions - you're done. Three migration files will be generated, for creating the column, creating the constraint and validating the constraint. All for less effort than writing the unsafe Rails migration above!

rails g nandi:foreign_key payments customers

That's just a couple of the things Nandi helps you with: take a look at the project README if you want to know more about using and configuring it. We've been using it in production for a long time now, and have pretty much covered our usage of Postgres - we've written nearly 600 migrations with it. We're proud to open it up to the world, and excited to see what people do with it; and, of course, issues and pull requests are very much welcome!

Appendix: why not use ... ?

Keen followers of the Rails ecosystem may know that this is not the only gem that plays in this space. There are also Andrew Kane's strong_migrations, Braintree's pg_ha_migrations and safe-pg-migrations from Doctolib, to name only three that we have used and evaluated. There's no one right way to solve this problem, and all these gems are the product of real operational experience, and represent defensible sets of tradeoffs.

We used strong_migrations for a long time, but its functionality is limited to scanning the statements issued and throwing if they're obviously unsafe. It's a good option if you want a relatively unintrusive way to get a bit of extra security, but we wanted more.

Both pg_ha_migrations and safe-pg-migrations take the view that it's necessary to disable transactions to ensure safety. This is a justifiable view, since it reduces the amount of time you hold on to locks; also, there are some changes that can only be done safely outside a transaction (principally creating and dropping indexes). However, it also means that migrations are no longer retryable; if you get through half the work in one and then crash, an operator must psql into the database to manually fiddle around with the schema. This is stressful, error-prone and insecure, and minimising this sort of manual remediation was also a goal for us.

We still need to make sure you don't do anything too crazy in a transaction, however, which is why we went for the 'compiler' approach - we can do static analysis and validation of your entire migration before you run any of it, verifying that (for example) you are not looking to access-exclusive lock more than one table.