in Engineering

Track flaky specs automatically using this simple tweak in RSpec builds

GoCardless’ Product Development team recently had a summer Hackathon – during which, we developed a simple tweak for our RSpec builds to track flaky specs (or, intermittently failing tests).

The problem

There’s a lot said about flaky specs because they often lead to wasted time and effort. At one point or another, we have probably all been about to merge or deploy changes and a required build check stops us – only because a flaky spec acted-up!

When this happens, there’s no option but to retrigger the build to see it go green to unblock the change. We try and remind ourselves to write up a ticket and come back to this flaky spec but this isn’t always possible when you have a number of other things you are working on.

So, we needed an automated way to keep track of flaky specs so that they are visible and we are reminded about fixing them.

Continue reading...

Want to work with us?
We're hiring!
in Engineering

Update on service disruption: 27 June, 2019

On Thursday 27 June between 12:41 and 13:21 (BST), we experienced a 40 minute service outage.

Our API and Dashboard were unavailable and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.

Submissions to the banks to collect payments and pay out collected funds were unaffected.

We’d like to apologise for any inconvenience caused. We know how important reliability is to you and we take incidents like this extremely seriously.

We’ll be completing a more detailed internal review of the incident in the coming weeks but we want to provide more details on exactly what happened and what we’re doing about it.

Continue reading...

Debugging the Postgres query planner

At GoCardless Postgres is our database of choice. Not only does it power our API, Postgres is used extensively in a number of internal services. We love it so much we just won't shut up about it, even when things go wrong.

One of the reasons we love it so much is how easily you can dig into Postgres' implementation when there are issues. The docs are great and the code exceptionally readable. These resources have been invaluable while scaling our primary database to the ~2TB we now run; no doubt they will continue to provide value as our organisation grows.

We at GoCardless believe that failure can be a great learning opportunity, and nothing proves that more than the amount we've learned from Postgres issues. This post shares a specific issue we encountered that helped us level-up our understanding of the Postgres query planner. We'll detail our investigation by covering:

We conclude with the actions we took to prevent this happening again.

Continue reading...

We’re hiring developers
See job listing

How to integrate with the GoCardless API

What is the GoCardless API?

The GoCardless API is a way for developers to interact via software with GoCardless, allowing you to integrate us into your website, mobile app or desktop software. This means you can build your own customised integration to automate payment collection and reconciliation.

What do you need to do to integrate with the GoCardless API?

Integrating with the GoCardless API is incredibly simple and can be done in minutes with the following simple steps and our easy to use API libraries.

1. Sign up for a sandbox account if you haven’t already - it’ll only take a minute to create your account. If you already have a sandbox account, just log in to your dashboard.

Continue reading...

Get started with our API
Visit the developer site
in Engineering

Service outage on 6 April 2018: post-mortem results

On Friday 6th April at 16:04 BST, we experienced a service outage. For a period of 27 minutes, the GoCardless API and Dashboard were unavailable, and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.

What happened

The outage was caused by a misconfiguration of our database, which stopped us making changes to the data we have stored.

For those of you wanting the technical details, the error was due to us reaching a limit on the ID given to each entry in our “update log”. Every time we make changes to certain tables in our database (i.e. create or update a row), we make a record of the changes to an “update log” to provide an audit trail. Each entry in the log has an automatically-generated sequential ID (1, 2, 3, and so on). Our database configuration meant that the maximum possible value for this ID was 2,147,483,648.

We hit this limit, so we were unable to write to the “update log”, which blocked writes to the database. For more details, see our previous blog post.

Our response

As a payments company, we know how critical our service is to our customers, and we take incidents like this extremely seriously.

As such, once we’ve responded to an incident and restored service for our users, we run “post-mortems” to make sure we understand:

  • exactly what went wrong
  • how we can reduce the chance of similar incidents occurring in future
  • how we can respond more effectively when things do go wrong

Following the post-mortem for this incident, we’ve already taken a number of steps to improve our systems and processes for the future:

  • We’ve improved the robustness of our automated alerting, guaranteeing that we’ll be informed within seconds if our service goes down
  • We’ve improved the reliability of our code for turning off the “update log”, allowing us to recover more quickly if we see a similar failure in the future
  • We’ve made our code more resilient, so “read-only” requests to the API will still work even if we’re unable to write to the database

We’d like to apologise again for any inconvenience caused to you and your customers. We will continue to invest in our technology and processes to ensure we guard against any similar incidents in the future.

in Engineering

Service outage on 6 April 2018: our response

On Friday 6 April at 16:04 BST, we experienced a service outage. For a period of 27 minutes, the GoCardless API and Dashboard were unavailable, and users were unable to set up a Direct Debit via our payment pages or connect their account to a partner integration through our OAuth flow.

Submissions to the banks to collect payments and pay out collected funds were unaffected.

We’d like to apologise for any inconvenience caused to you and your customers. As a payments company, we know how important reliability is to our customers, and we take incidents like this extremely seriously. We’re completing a detailed review of the incident and taking the required steps to improve our technology and processes to ensure this doesn’t happen again.

What happened?

All of our most critical data is stored in a PostgreSQL database. When we make changes to certain tables in that database (i.e. create or update a row), we use a trigger to keep a separate record of exactly what changed. We use this “update log” to enable data analysis tasks like fraud detection.

Each entry in the log has an automatically-generated sequential ID (1, 2, 3, and so on). This ID is stored using the serial type in the database, which means it can be a value between 1 and 2147483648.

At 16:04 on Friday 6 April, we hit this upper limit, meaning we could no longer write to the “update log” table. In PostgreSQL, when a trigger fails, the original database write that triggered it fails too. This caused requests to our application to fail, returning a 500 Internal Server Error.

This issue also affected API requests (including those from the Dashboard) which only appear to read data (e.g. listing your customers or fetching a specific payment), since authenticated requests update access tokens to record when you last accessed the API.

How we responded

Having identified the root cause of the problem, we disabled the trigger which sends writes to the “update log”, thereby restoring service.

We’ve resolved this problem for the future by storing the IDs for our “update log” using the bigserial type, which allows values up to 9223372036854775807. This is effectively unlimited, and can be expected to provide enough IDs to last millions of years.

Next steps

In the next few days, we’ll be running a full post-mortem to better understand:

  • how we can reduce the chance of similar errors occurring in future; and
  • how we can respond more effectively when things do go wrong

We’ll publish the results of this post-mortem in a follow-up post within the next 4 weeks.

in Engineering

Moving fast at GoCardless: why we invest in our workflow and processes

At GoCardless, how we work is the secret sauce that allows us to deliver excellent customer experiences, from how we run customer support to how our design and marketing teams collaborate.

In this post, I’ll talk about how we changed the way we work over the last 9 months to build truly global software and introduce a localisation process which allows us to move quickly and deliver real value for customers.

The GoCardless Dashboard in French

We wanted to provide a great experience for our users, whatever language they speak — but it was imperative to do so in a way that didn’t slow us down as we continue to build out our product. When we get processes like this wrong, we not only make our team’s work harder than it needs to be, but we place a drag on what we care about most: delivering value for our users.

There’s a whole other post I could write about the intricacies of that process and how we’ve invested in sourcing skilled translators and ensuring we have perfect translations with quality assurance (QA) processes - but in this post, we’ll focus on the developer workflow.

Continue reading...

Want to work with us?
We're hiring

Incident review: API and Dashboard outage on 10 October 2017

This post represents the collective work of our Core Infrastructure team's investigation into our API and Dashboard outage on 10 October 2017.

As a payments company, we take reliability very seriously. We hope that the transparency in technical write-ups like this reflects that.

We have included a high-level summary of the incident, and a more detailed technical breakdown of what happened, our investigation, and changes we've made since.

Summary

On the afternoon of 10 October 2017, we experienced an outage of our API and Dashboard, lasting 1 hour and 50 minutes. Any requests made during that time failed, and returned an error.

The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.

This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours.

Continue reading...

We're hiring SREs
Join us