Service outage on 6 April 2018: post-mortem results
By Tim RogersMay 20181 min read
On Friday 6th April at 16:04 BST, we experienced a service outage. For a period of 27 minutes, the GoCardless API and Dashboard were unavailable, and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.
The outage was caused by a misconfiguration of our database, which stopped us making changes to the data we have stored.
For those of you wanting the technical details, the error was due to us reaching a limit on the ID given to each entry in our “update log”. Every time we make changes to certain tables in our database (i.e. create or update a row), we make a record of the changes to an “update log” to provide an audit trail. Each entry in the log has an automatically-generated sequential ID (1, 2, 3, and so on). Our database configuration meant that the maximum possible value for this ID was 2,147,483,648.
We hit this limit, so we were unable to write to the “update log”, which blocked writes to the database. For more details, see our previous blog post.
As a payments company, we know how critical our service is to our customers, and we take incidents like this extremely seriously.
As such, once we’ve responded to an incident and restored service for our users, we run “post-mortems” to make sure we understand:
exactly what went wrong
how we can reduce the chance of similar incidents occurring in future
how we can respond more effectively when things do go wrong
Following the post-mortem for this incident, we’ve already taken a number of steps to improve our systems and processes for the future:
We’ve improved the robustness of our automated alerting, guaranteeing that we’ll be informed within seconds if our service goes down
We’ve improved the reliability of our code for turning off the “update log”, allowing us to recover more quickly if we see a similar failure in the future
We’ve made our code more resilient, so “read-only” requests to the API will still work even if we’re unable to write to the database
We’d like to apologise again for any inconvenience caused to you and your customers. We will continue to invest in our technology and processes to ensure we guard against any similar incidents in the future.