Debugging the Postgres query planner

At GoCardless Postgres is our database of choice. Not only does it power our API, Postgres is used extensively in a number of internal services. We love it so much we just won't shut up about it, even when things go wrong.

One of the reasons we love it so much is how easily you can dig into Postgres' implementation when there are issues. The docs are great and the code exceptionally readable. These resources have been invaluable while scaling our primary database to the ~2TB we now run; no doubt they will continue to provide value as our organisation grows.

We at GoCardless believe that failure can be a great learning opportunity, and nothing proves that more than the amount we've learned from Postgres issues. This post shares a specific issue we encountered that helped us level-up our understanding of the Postgres query planner. We'll detail our investigation by covering:

We conclude with the actions we took to prevent this happening again.

Continue reading...

We’re hiring developers
See job listing

Incident review: API and Dashboard outage on 10 October 2017

This post represents the collective work of our Core Infrastructure team's investigation into our API and Dashboard outage on 10 October 2017.

As a payments company, we take reliability very seriously. We hope that the transparency in technical write-ups like this reflects that.

We have included a high-level summary of the incident, and a more detailed technical breakdown of what happened, our investigation, and changes we've made since.


On the afternoon of 10 October 2017, we experienced an outage of our API and Dashboard, lasting 1 hour and 50 minutes. Any requests made during that time failed, and returned an error.

The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.

This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours.

Continue reading...

We're hiring SREs
Join us

All fun and games until you start with GameDays

As a payments company, our APIs need to have as close to 100% availability as possible. We therefore need to ensure we’re ready for whatever comes our way: from losing a server without bringing the API down, to knowing how to react if a company laptop is compromised.

To accomplish this we run GameDay exercises. What you will read below is our version of a GameDay. We hope that by sharing how we do GameDays we can give you a starting point for running your first GameDay.

Continue reading...

Want to help build reliable systems?
Join our team

(Re)designing the DevOps interview process

Interviewing is hard. Both the company and the candidate have to make an incredibly important decision based on just a few hours’ worth of data, so it’s worth investing the time to make those precious hours as valuable as possible.

We recently made some changes to our DevOps interview process, with the aim of making it fairer, better aligned with the role requirements, and more representative of real work.

We started by defining the basics of our DevOps roles. What makes someone successful in this role and team? What are the skills and experience that we're looking for at different levels of the role?

It was important that the process would work for candidates with varying experience levels, and so it needed to be flexible and clear to assess skills at each of these levels.

The skills we’re looking for fall into three broad categories: existing technical knowledge (e.g., programming languages), competency-based skills (e.g., problem solving), and personal characteristics (e.g., passion for the role, teamwork and communication skills). After defining these skills, we mapped out how we would assess them at each stage of the interview process.

Continue reading...

From idea to reality: containers in production at GoCardless

As developers, we work on features that our users interact with every day. When you're working on the infrastructure that underpins those features, success is silent to the outside world, and failure looks like this:

Recently, GoCardless moved to a container-based infrastructure. We were lucky, and did so silently. We think that our experiences, and the choices we made along the way, are worth sharing with the wider community. Today, we're going to talk about:

  • deploying software reliably
  • why you might want a container-based infrastructure
  • what it takes to reliably run containers in production

We'll wrap up with a little chat about the container ecosystem as it is today, and where it might go over the next year or two.

Continue reading...

Sound like something you'd enjoy?
Join our team

Zero-downtime Postgres migrations - a little help

We're pleased to announce the release of ActiveRecord::SaferMigrations, a library to make changing the schema of Postgres databases safer. Interested how? Read on.

Previously, we looked at how seemingly safe schema changes in Postgres can take your site down. We ended that article with some advice, and today we want to make that advice a little easier to follow.

A recap

In a nutshell, there are some operations in Postgres that take exclusive locks on tables, causing other queries involving those tables to block until the exclusive lock is released1. Typically, this sort of operation is run infrequently as part of a deployment which changes the schema of your database.

For the most part, these operations are fine as long as they execute quickly. As we explored in our last post, there's a caveat to that - if the operation has to wait to acquire its exclusive lock, all queries which arrive after it will queue up behind it.

You typically don't want to block the queries from your app for more than a few hundred milliseconds, maybe a second or two at a push2. Achieving that means reading up on locking in Postgres, and being very careful with those schema-altering queries. Make a mistake and, as we found out, your next deployment stops your app from responding to requests.

Continue reading...

Want to work with us on a growing payments infrastructure?
We're hiring

The Troubleshooting Tales: issues scaling Postgres connections

After making some changes to our Postgres setup, we started noticing occasional errors coming from deep within ActiveRecord (Rails’ ORM). This post details the process we went through to determine the cause of the issue, and what we did to fix it.

The situation

First, it’s important to understand the changes we made to our Postgres setup. Postgres connections are relatively slow to establish (particularly when using SSL), and on a properly-tuned server they use a significant amount of memory. The amount of memory used limits the number of connections you can feasibly have open at once on a single server, and the slow establishment encourages clients to maintain long-lived connections. Due to these constraints, we recently hit the limit of connections our server could handle, preventing us from spinning up more application servers. To get around this problem, the common advice is to use connection pooling software such as PgBouncer to share a small number of Postgres connections between a larger number of client (application) connections.

When we first deployed PgBouncer, we were running it in “session pooling” mode, which assigns a dedicated Postgres server connection to each connected client. However, with this setup, if you have a large number of idle clients connected to PgBouncer you’ll have to maintain an equal number of (expensive) idle connections on your Postgres server. To combat this, there is an alternative mode: “transaction pooling”, which only uses a Postgres server connection for the duration of each transaction. The downside of transaction pooling is that you can’t use any session-level features (e.g. prepared statements, session-level advisory locks). After combing through our apps to remove all usages of session-level features, we enabled transaction pooling.

Shortly after making the switch, we started seeing (relatively infrequent) exceptions coming from deep within ActiveRecord: NoMethodError: undefined method 'fields' for nil:NilClass. We also noticed that instances of this exception appeared to be correlated with INSERT queries that violated unique constraints.

Continue reading...

Find this kind of work interesting?
We're hiring

In search of performance - how we shaved 200ms off every POST request

While doing some work on our Pro dashboard, we noticed that search requests were taking around 300ms. We've got some people in the team who have used Elasticsearch for much larger datasets, and they were surprised by how slow the requests were, so we decided to take a look.

Today, we'll show how that investigation led to a 200ms improvement on all internal POST requests.

What we did

We started by taking a typical search request from the app and measuring how long it took. We tried this with both Ruby's Net::HTTP and from the command line using curl. The latter was visibly faster. Timing the requests showed that the request from Ruby took around 250ms, whereas the one from curl took only 50ms.

We were confident that whatever was going on was isolated to Ruby1, but we wanted to dig deeper, so we moved over to our staging environment. At that point, the problem disappeared entirely.

For a while, we were stumped. We run the same versions of Ruby and Elasticsearch in staging and production. It didn't make any sense! We took a step back, and looked over our stack, piece by piece. There was something in the middle which we hadn't thought about - HAProxy.

We quickly discovered that, due to an ongoing Ubuntu upgrade2, we were using different versions of HAProxy in staging (1.4.24) and production (1.4.18). Something in those 6 patch revisions was responsible, so we turned our eyes to the commit logs. There were a few candidates, but one patch stood out in particular.

We did a custom build of HAProxy 1.4.18, with just that patch added, and saw request times drop by around 200ms. Job done.

Continue reading...

Sound like something you'd enjoy?
We're hiring

Zero-downtime Postgres migrations - the hard parts

A few months ago, we took around 15 seconds of unexpected API downtime during a planned database migration. We're always careful about deploying schema changes, so we were surprised to see one go so badly wrong. As a payments company, the uptime of our API matters more than most - if we're not accepting requests, our merchants are losing money. It's not in our nature to leave issues like this unexplored, so naturally we set about figuring out what went wrong. This is what we found out.


We're no strangers to zero-downtime schema changes. Having the database stop responding to queries for more than a second or two isn't an option, so there's a bunch of stuff you learn early on. It's well covered in other articles1, and it mostly boils down to:

  • Don't rename columns/tables which are in use by the app - always copy the data and drop the old one once the app is no longer using it
  • Don't rewrite a table while you have an exclusive lock on it (e.g. no ALTER TABLE foos ADD COLUMN bar varchar DEFAULT 'baz' NOT NULL)
  • Don't perform expensive, synchronous actions while holding an exclusive lock (e.g. adding an index without the CONCURRENTLY flag)

This advice will take you a long way. It may even be all you need to scale this part of your app. For us, it wasn't, and we learned that the hard way.

Continue reading...

We’re hiring developers
See job listing