in Engineering

Service outage on 6 April 2018: post-mortem results

On Friday 6th April at 16:04 BST, we experienced a service outage. For a period of 27 minutes, the GoCardless API and Dashboard were unavailable, and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.

What happened

The outage was caused by a misconfiguration of our database, which stopped us making changes to the data we have stored.

For those of you wanting the technical details, the error was due to us reaching a limit on the ID given to each entry in our “update log”. Every time we make changes to certain tables in our database (i.e. create or update a row), we make a record of the changes to an “update log” to provide an audit trail. Each entry in the log has an automatically-generated sequential ID (1, 2, 3, and so on). Our database configuration meant that the maximum possible value for this ID was 2,147,483,648.

We hit this limit, so we were unable to write to the “update log”, which blocked writes to the database. For more details, see our previous blog post.

Our response

As a payments company, we know how critical our service is to our customers, and we take incidents like this extremely seriously.

As such, once we’ve responded to an incident and restored service for our users, we run “post-mortems” to make sure we understand:

  • exactly what went wrong
  • how we can reduce the chance of similar incidents occurring in future
  • how we can respond more effectively when things do go wrong

Following the post-mortem for this incident, we’ve already taken a number of steps to improve our systems and processes for the future:

  • We’ve improved the robustness of our automated alerting, guaranteeing that we’ll be informed within seconds if our service goes down
  • We’ve improved the reliability of our code for turning off the “update log”, allowing us to recover more quickly if we see a similar failure in the future
  • We’ve made our code more resilient, so “read-only” requests to the API will still work even if we’re unable to write to the database

We’d like to apologise again for any inconvenience caused to you and your customers. We will continue to invest in our technology and processes to ensure we guard against any similar incidents in the future.