Debugging the Postgres query planner
At GoCardless Postgres is our database of choice. Not only does it power our API, Postgres is used extensively in a number of internal services. We love it so much we just won't shut up about it, even when things go wrong.
One of the reasons we love it so much is how easily you can dig into Postgres' implementation when there are issues. The docs are great and the code exceptionally readable. These resources have been invaluable while scaling our primary database to the ~2TB we now run; no doubt they will continue to provide value as our organisation grows.
We at GoCardless believe that failure can be a great learning opportunity, and nothing proves that more than the amount we've learned from Postgres issues. This post shares a specific issue we encountered that helped us level-up our understanding of the Postgres query planner. We'll detail our investigation by covering:
We conclude with the actions we took to prevent this happening again.
Incident review: API and Dashboard outage on 10 October 2017
This post represents the collective work of our Core Infrastructure team's investigation into our API and Dashboard outage on 10 October 2017.
As a payments company, we take reliability very seriously. We hope that the transparency in technical write-ups like this reflects that.
We have included a high-level summary of the incident, and a more detailed technical breakdown of what happened, our investigation, and changes we've made since.
On the afternoon of 10 October 2017, we experienced an outage of our API and Dashboard, lasting 1 hour and 50 minutes. Any requests made during that time failed, and returned an error.
The cause of the incident was a hardware failure on our primary database node, combined with unusual circumstances that prevented our database cluster automation from promoting one of the replica database nodes to act as the new primary.
This failure to promote a new primary database node extended an outage that would normally last 1 or 2 minutes to one that lasted almost 2 hours.
Coach: An alternative to Rails controllers
Today we're open sourcing Coach, a library that removes the complexity from Rails controllers. Bundle your shared behaviour into highly robust, heavily tested middlewares and rely on Coach to join them together, providing static analysis over the entire chain. Coach ensures you only require a glance to see what's being run on each controller endpoint.
At GoCardless we've replaced all our controller code with Coach middlewares.
Safely retrying API requests
Today we're announcing support for idempotency keys on our Pro API, which make it safe to retry non-idempotent API requests.
Why are they necessary?
Here's an example that illustrates the purpose of idempotency keys.
You submit a
POST request to our
/payments endpoint to create a payment. If all goes
well, you'll receive a
201 Created response. If the request is invalid, you'll receive a
4xx response, and know that the payment wasn't created. But what if something goes wrong
our end and we issue a
500 response? Or what if there's a network issue that means you
get no response at all? In these cases you have no way of knowing whether or not the
payment was created. This leaves you with two options:
- Hope the request succeeded, and take no further action.
- Assume the request failed, and retry it. However, if the request did succeed you'll end up with a duplicate payment.
Not an ideal situation.