The Troubleshooting Tales: issues scaling Postgres connections
After making some changes to our Postgres setup, we started noticing occasional errors coming from deep within ActiveRecord (Rails’ ORM). This post details the process we went through to determine the cause of the issue, and what we did to fix it.
First, it’s important to understand the changes we made to our Postgres setup. Postgres connections are relatively slow to establish (particularly when using SSL), and on a properly-tuned server they use a significant amount of memory. The amount of memory used limits the number of connections you can feasibly have open at once on a single server, and the slow establishment encourages clients to maintain long-lived connections. Due to these constraints, we recently hit the limit of connections our server could handle, preventing us from spinning up more application servers. To get around this problem, the common advice is to use connection pooling software such as PgBouncer to share a small number of Postgres connections between a larger number of client (application) connections.
When we first deployed PgBouncer, we were running it in “session pooling” mode, which assigns a dedicated Postgres server connection to each connected client. However, with this setup, if you have a large number of idle clients connected to PgBouncer you’ll have to maintain an equal number of (expensive) idle connections on your Postgres server. To combat this, there is an alternative mode: “transaction pooling”, which only uses a Postgres server connection for the duration of each transaction. The downside of transaction pooling is that you can’t use any session-level features (e.g. prepared statements, session-level advisory locks). After combing through our apps to remove all usages of session-level features, we enabled transaction pooling.
Shortly after making the switch, we started seeing (relatively infrequent) exceptions coming from deep within ActiveRecord:
NoMethodError: undefined method 'fields' for nil:NilClass. We also noticed that instances of this exception appeared to be correlated with INSERT queries that violated unique constraints.
Building APIs: lessons learned the hard way
This post is about the lessons we’ve learned building and maintaining APIs over the last three years. It starts off with some high-level thoughts, and then dives right into the detail of how and why we've made the design decisions we have, while building our second API - GoCardless Pro.
The problem with APIs
The hard thing about APIs is change. Redesign your website and your users adapt; change how timestamps are encoded in your API and all of your customers' integrations break. As a payments API provider even a single broken integration can leave thousands of people unable to pay.
As a fast-moving startup we're particularly poorly positioned to get things right first time. We're wired to constantly iterate our products: ship early then tweak, adjust, and improve every day. As a startup the uncertainty goes even deeper: we're always learning more about our customers and making course corrections to our business.
Building dependable APIs in this environment is hard.
Different types of change
The most important lesson we've learned is to think about structural changes differently to functionality changes.
Structural changes affect the way the API works, rather than what it does. They include changes to the URL structure, pagination, errors, payload encoding, and levels of abstraction offered. They are the worst kind of changes to have to make because they are typically difficult to introduce gracefully and add little to existing customers. Fortunately they're also the decisions you have the most chance of getting right first time - an API's structure isn't tied to constantly-evolving business needs. It just takes time and effort, and is discussed in "Getting your structure right" below.
Functionality changes affect what the API does. They include adding new endpoints and attributes or changing the behaviour of existing ones, and they're necessary as a business changes. Fortunately they can almost always be introduced incrementally, without breaking backwards compatibility.
Getting your structure right
In our first API we made some structural mistakes. None of them are serious issues, but they have led to an API that is difficult to extend due to its quirks. For instance, the pagination scheme relies on limits and offsets, which causes performance issues. We also had frequent discussions about whether resources should be nested in URLs, which resulted in inconsistencies.
Before we started work on GoCardless Pro, we spent a lot of time laying the structural foundations. Thinking about the structure of GoCardless Pro was helpful for several reasons:
- we were forced to think upfront about issues that would affect us down-the-line, such as versioning, rate-limiting, and pagination;
- when implementing the API, we could focus on its functionality, rather than debating the virtues of PUT vs PATCH;
- consistency across the API came for free, as we’re not making ad-hoc structural decisions.
We adopted JSON API as the basis for our framework and put together a document detailing our HTTP API design principles. The decisions made came from three years of experience running and maintaining a payments API, as well as from examining similar efforts by other companies.
Amongst other things, our framework includes:
- Versioning. API versions should be represented as dates, and submitted as headers. This promotes incremental improvement to the API and discourages rewrites. As versions are only present in incoming requests, WebHooks should not contain serialised resources - instead, provide an id to the resource that changed and let the client request the appropriate version.
- Pagination. Pagination is enabled for all endpoints that may respond with multiple records. Cursors are used rather than the typical limit/offset approach to prevent missing or duplicated records when viewing growing collections, and to avoid the database performance penalties associated with large offsets.
- URL structure. Never nest resources in URLs - it enforces relationships that could change, and makes clients harder to write. Use filtering instead (e.g.
This list is by no means exhaustive - for more examples, check out our full HTTP API design document on GitHub.
When starting work on a new API we encourage you to build your own document rather than just using ours (though feel free to use ours as a template). The exercise of writing it is extremely helpful in itself, and many of these decisions are a question of preference rather than "right or wrong".
You can’t ever break integrations, but you have to keep improving your API. No product is right first time - if you can’t apply what you learn after your API is launched you’ll stagnate.
Many functionality changes can be made without breaking backwards compatibility, but occasionally breaking changes are necessary.
To keep improving our API whilst supporting existing integrations we:
- Use betas extensively to get early feedback from developers who are comfortable with changes to our API. All new endpoints on the GoCardless Pro API go through a public beta.
- Version all breaking changes and continue to support historic versions. By only ever introducing backwards-incompatible changes behind a new API version, we avoid breaking existing integrations. As we keep track of the API version that each customer uses, we can explain exactly what changes they'll need to make to take advantage of improvements we've made.
- Release the minimum API possible. Where we have a choice between taking an API decision and waiting, we choose to wait. This is an unusual mentality in a startup, but when building an API, defaulting to inaction is probably the right approach. As we’re constantly learning, decisions made later are decisions made better.
- Introduce “change management” to slow things down. This is not our typical approach - “change management” is a term that makes many of us shudder! But changes to public APIs need be introduced carefully, so however uneasy it makes us, putting speed bumps in place can be a good idea. At GoCardless, all public API changes need to be agreed on by at least three senior engineers.
Stop thinking like a startup
To sum up: when building an API, you need to throw most advice about building products at startups out the window.
APIs are inflexible so the advice founded on change being easy doesn’t apply. You still need to constantly improve your product, but it pays to do work up-front, to make changes slowly and cautiously, and to get as much as possible right from the start.
Business: simple business date calculations in Ruby
We just open-sourced business, a simple library for doing business date calculations.
calendar = Business::Calendar.new( working_days: %w( mon tue wed thu fri ), holidays: ["01/01/2014", "03/01/2014"] ) calendar.business_day?(Date.parse("Monday, 9 June 2014")) # => true calendar.business_day?(Date.parse("Sunday, 8 June 2014")) # => false date = Date.parse("Thursday, 12 June 2014") calendar.add_business_days(date, 4).strftime("%A, %d %B %Y") # => "Wednesday, 18 June 2014" calendar.subtract_business_days(date, 4).strftime("%A, %d %B %Y") # => "Friday, 06 June 2014" date = Date.parse("Saturday, 14 June 2014") calendar.business_days_between(date, date + 7) # => 5
But other libraries already do this
Another gem, business_time, also exists for this purpose. We previously used business_time, but encountered several issues that prompted us to start business.
Firstly, business_time works by monkey-patching
FixNum. While this enables syntax like
Time.now + 1.business_day, it means that all configuration has to be global. GoCardless handles payments across several geographies, so being able to work with multiple working-day calendars is essential for us. Business provides a simple
Calendar class, that is initialized with a configuration that specifies which days of the week are considered to be working days, and which dates are holidays.
Secondly, business_time supports calculations on times as well as dates. For our purposes, date-based calculations are sufficient. Supporting time-based calculations as well makes the code significantly more complex. We chose to avoid this extra complexity by sticking solely to date-based mathematics.
Earlier this week, Heartbleed - a security vulnerability in the OpenSSL library - was publicly disclosed. GoCardless uses software that depends on OpenSSL, which means we were among the large number of companies affected.
Our engineering team patched our affected software on Tuesday morning (April 8th), and replaced our SSL certificates. This means that we are no longer vulnerable to Heartbleed.
We have no reason to believe that any GoCardless data has been compromised, but given the nature of the vulnerability we recommend taking the following precautions:
- We recommend that GoCardless users reset their passwords.
- We have invalidated any sessions that were in use prior to the resolution of the issue.
- We are adding the ability for API users to reset their API keys; we'll post an update as soon as this is possible.
If you have any questions, don't hesitate to email us at email@example.com.
Hutch: Inter-Service Communication with RabbitMQ
Today we're open-sourcing Hutch, a tool we built internally, which has become a crucial part of our infrastructure. So what is Hutch? Hutch is a Ruby library for enabling asynchronous inter-service communication in a service-oriented architecture, using RabbitMQ. First, I'll cover the motivation behind Hutch by outlining some issues we were facing. Next, I'll explain how we used a message queue (RabbitMQ) to solved these issues. Finally, I'll go over what Hutch itself provides.
GoCardless's Architecture Evolution
GoCardless has evolved from a single, overweight Rails application to a suite of services, each with a distinct set of responsibilities. We have a service that takes care of user authentication, another that encapsulates the logic behind Direct Debit payments, another that serves our public API. So, how do these services talk to each other?
The go-to route for getting services communicating is HTTP. We're a web-focussed engineering team, used to building HTTP APIs, and debating the virtues of RESTfulness. So this is where we started. Each service exposed an HTTP API, which would be used via a corresponding client library from the dependent services. However, we soon encountered some issues:
App server availability. There are several situations that cause inter-service communication to spike dramatically. We frequently receive and process information in bulk. For instance, the payment failure notifications we receive from the banks are processed once per day in a large batch. If another service needs to be made aware of these failures, an HTTP request would be sent to each service for each failure. This places our app servers under a significant amount of load. This issue could be mitigated by implementing special "bulk" endpoints, queuing requests as they arrive, or imposing rate limits, but not without the cost of additional complexity.
Client speed. Often when we're sending a message from one service to another, we don't need a response immediately (or sometimes, ever). If a response isn't required, why are we waiting around for the server to finish processing the message? This situation is particularly detrimental if the communication occurs during an end-user's request-response cycle.
Failure handling. When HTTP requests fail, they generally need to be retried. Implementing this retry logic properly can be tricky, and can easily cause further issues (e.g. thundering herds).
Service coupling. Using HTTP for inter-service communication means that mappings between events and dependent services are required. For example: when a payment fails, services a, b, and c, need to know, when a payment succeeds, services b, c, and d need to know, etc, etc. These dependency graphs become increasingly unwieldily as the system grows.
It quickly became evident that most of these issues would be solved by using a message queue for communication between services. After evaluating a number of options, we settled on RabbitMQ. It's a stable piece of software that has been battle-tested at large organisations around the world, and has some useful features not found in other message brokers, which we can use to our advantage.
How we use RabbitMQ
We run a single RabbitMQ cluster that sits between all of our services, acting
as a central communications hub. Inter-service communication happens through a
single topic exchange. All messages are assigned routing keys, which typically
specify the originating service, the subject (noun) of the message, and an
Each service in our infrastructure has a set of consumers, which handle messages of a particular type. A consumer is defined by a function, which processes messages as they arrive, and a binding key that indicates which messages the consumer is interested in. For each consumer, we create a queue, which is bound to the central exchange using the consumer binding key.
RabbitMQ messages carry a binary payload: no serialisation format is enforced. We settled on JSON for serialising our messages, as JSON libraries are widely available in all major languages.
This setup provides us with a flexible way of managing communication between services. Whenever an action takes place that may interest another service, a message is sent to the central exchange. Any number of services may have consumers set up, ready to receive the message.
There are several mature Ruby libraries for interfacing with RabbitMQ, however, they're relatively low-level libraries providing access to the full suite of RabbitMQ's functionality. We use RabbitMQ in a specific, opinionated fashion, which resulted in a lot of repeated boilerplate code. So we set about building our conventions into a library that we could share between all of our services. We called it Hutch. Here's a high level summary of what it provides:
- A simple way to define consumers (queues are automatically created and bound to the exchange with the appropriate binding keys)
- An executable and CLI for running consumers (akin to
- Automatic setup of the central exchange
- Sensible out-of-the-box configuration (e.g. durable messages, persistent queues, message acknowledgements)
- Management of queue subscriptions
- Rails integration
- Configurable exception handling
Here's a brief example demonstrating how consumers are defined and how messages are published:
# Producer in the payments service Hutch.publish('paysvc.payment.chargedback', payment_id: payment.id) # Consumer in the notifications service class ChargebackNotificationConsumer include Hutch::Consumer consume 'paysvc.payment.chargedback' def process(message) PaymentMailer.chargeback_email(message[:payment_id]).deliver end end
At it's core, Hutch is simply a Ruby implementation of a set of conventions and opinions for using RabbitMQ: subscriber acks, durable messages, topic exchanges, JSON-encoded messages, UUID message ids, etc. These conventions could easily be ported to another language, enabling this same kind of communication in an environment composed of services written in many programming languages.
Today, we're making Hutch open source. It's available here on GitHub, and any contributions or suggestions are very welcome. For questions and comments, discuss on Hacker News or tweet me at @harrymarr.
Hacking on Side Projects: The Pool Ball Tracker
By day, we're a London-based start-up that spends most of our time making payments simple so that merchants can collect money from their customers online. Occasionally, however, we enjoy hacking on side projects as a way of winding down while continuing to build stuff as a team.
Since we have a pool table at our office, we decided to build a system to automatically score pool games. This post focusses on how we approached the initial version of the ball tracker. It's by no means complete, but it demonstrates the progress we made on it during the first 48 hour hackathon.
The balls would be tracked via a webcam mounted above the pool table (duct taped to the ceiling).
We split the system into three components:
- Ball tracker: this reads the webcam feed (illustrated below), and tracks the positions of the balls.
- Rules engine: accepts the ball positions as input, and applies rules to keep track of the score.
- Web frontend: a web-based interface that shows the state of the game.
- Refurbished pub pool table
- Set of pool balls (red and yellow)
- Consumer webcam
The camera was set up directly above the centre of the pool table to avoid spending time fighting with projective transformations.
We chose to write the ball detector and tracker in C, using OpenCV. In retrospect, it may have made more sense to prototype the system in Python first. However, C is a language most people are comfortable with, and many of the online OpenCV resources cover the C API.
We spent some time thinking about different approaches to tracking balls. There are a few main steps to the tracking process:
- Filter the image based on the balls' colours, to consider only the relevant parts of the image.
- Find objects that looked roughly ball-shaped.
- Use knowledge of previous ball positions to reduce noise and filter out anomalies.
Colour Range Extraction
We converted the input frames to the HSV colour space, which made selecting areas based on a given hue easier. The image could then be filtered using cvInRangeS, which makes it possible to find pixels that lie between two HSV values. We ran multiple passes of this process - once for each of the ball colours - yellow, red, black, and white.
Finding the Balls
Our initial stab involved using the Hough transform (cvHoughCircles) to locate the circular balls. After spending some time tweaking parameters, we got some promising results.
Tracking Moving Balls
One problem that became immediately apparent was that the tracked balls would frequently pop in and out of existence. One cause for this was the Hough transform failing to handle the deformation of the balls in motion (caused by a relatively slow shutter speed) The colour mask would also occasionally hide balls due to changes in lighting. We needed some kind of tracking.
The first approach was the simplest thing we could do. The position of the balls was stored in memory, and if a pool ball was detected within a threshold it would add confidence to this position. Positions that hasn't been detected for a set number frames were discarded.
Later, we expanded on this approach and mapped the balls onto previous positions with a simple distance heuristic. This meant they would more smoothly track across the table instead of leaving 'ghosts'. This approach can potentially be expanded in interesting ways - for example, using a basic physics simulation to predict where the ball should be based on its past trajectory.
Balls Ain't Round
The approach so far was working for simple cases, where balls didn't touch each other and were sitting still. However, as soon as balls started to move they transformed from sharp, bright u-circles to blurry, elongated blobs. This made them very hard to track using the Hough Transform.
We reimplemented the ball detection code using a generic blob detector, and a bit of morphology. A great deal of parameter tweaking was necessary before we started getting convincing results. In the end, the blob tracker performed much better than the Hough transform did, especially when it came to fast moving balls.
This video shows the ball tracking progress at the end of the hackathon:
Although we didn't get perfect results, we were happy with the progress we made. But this isn't the end - we plan to continue working on it. The main priorities are:
- Speed. The tracker currently runs at about 10 frames / second, which isn't nearly fast enough. We're currently experimenting with moving parts of it to the GPU.
- Frame rate. The new GoPro sports camera streams high-definition video at 60fps. This should make it easier to track moving balls between frames.
- More advanced tracking. The motion tracking we're currently doing is very naïve. We've been discussing how we could use more intelligent approaches to compensate for more of the errors from the detection phase.
This is still a work in progress, so if you're interested in helping out or have any advice to offer us, drop us an email or a tweet. We also had help from the London Ruby community - thanks in particular to Riccardo Cambiassi on this project.
In a future post we'll talk about the other parts of the system - the rules engine and the web interface.
If you find problems like this interesting, get in touch - we're hiring.
Discuss this post on Hacker News