in Engineering

Service outage on 6 April 2018: our response

On Friday 6 April at 16:04 BST, we experienced a service outage. For a period of 27 minutes, the GoCardless API and Dashboard were unavailable, and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.

Submissions to the banks to collect payments and pay out collected funds were unaffected.

We’d like to apologise for any inconvenience caused to you and your customers. As a payments company, we know how important reliability is to our customers, and we take incidents like this extremely seriously. We’re completing a detailed review of the incident and taking the required steps to improve our technology and processes to ensure this doesn’t happen again.

What happened?

All of our most critical data is stored in a PostgreSQL database. When we make changes to certain tables in that database (i.e. create or update a row), we use a trigger to keep a separate record of exactly what changed. We use this “update log” to enable data analysis tasks like fraud detection.

Each entry in the log has an automatically-generated sequential ID (1, 2, 3, and so on). This ID is stored using the serial type in the database, which means it can be a value between 1 and 2147483648.

At 16:04 on Friday 6 April, we hit this upper limit, meaning we could no longer write to the “update log” table. In PostgreSQL, when a trigger fails, the original database write that triggered it fails too. This caused requests to our application to fail, returning a 500 Internal Server Error.

This issue also affected API requests (including those from the Dashboard) which only appear to read data (e.g. listing your customers or fetching a specific payment), since authenticated requests update access tokens to record when you last accessed the API.

How we responded

Having identified the root cause of the problem, we disabled the trigger which sends writes to the “update log”, thereby restoring service.

We’ve resolved this problem for the future by storing the IDs for our “update log” using the bigserial type, which allows values up to 9223372036854775807. This is effectively unlimited, and can be expected to provide enough IDs to last millions of years.

Next steps

In the next few days, we’ll be running a full post-mortem to better understand:

  • how we can reduce the chance of similar errors occurring in future; and
  • how we can respond more effectively when things do go wrong

We’ll publish the results of this post-mortem in a follow-up post within the next 4 weeks.