in Engineering

Coach: An alternative to Rails controllers

Today we're open sourcing Coach, a library that removes the complexity from Rails controllers. Bundle your shared behaviour into highly robust, heavily tested middlewares and rely on Coach to join them together, providing static analysis over the entire chain. Coach ensures you only require a glance to see what's being run on each controller endpoint.

At GoCardless we've replaced all our controller code with Coach middlewares.

Why controller code is tricky

Controller code often suffers from hidden behaviour and tangled data dependencies, making it hard to read and difficult to test.

To keep your endpoints performant, you never want to be running a cacheable operation more than once. This leads to memoizing your database queries but the question is then where to store this data? If you want the rest of your controller code to be able to access it, then storing the data on the controller instance is the easiest way to make that happen.

In an attempt to reuse code, controller methods are then split out into controller concerns (mixins), which are included as needed. This leads to controllers that look skinny but have a large amount of included behaviour, all defined far from their call site.

Some of these implicitly defined methods are called in before_actions, making it even more unclear what code is being run when you hit your controllers. Inherited before_actions can lead to a controller that runs several methods before every action without it being clear which these are.

One of the first things we do when a request hits GoCardless is parse authentication details and pull a reference to that access token out of the database. As the request progresses, we make use of that token to scope our database queries, tag our logs and verify permissions. This sharing and reuse of data is what makes writing controller code complex.

So how does Coach help?

Coach rethinks this approach by building your controller code around the data needed for your request. All your controller code is built from Coach::Middlewares, which take a request context and decide on a response. Each middleware can opt to respond itself, or call the next middleware in the chain.

Each middleware can specify the data it requires from those that have ran before it, and can declare what data it will pass to those that come after. This makes the flow of data explicit, and Coach will verify that the requirements have been met before you ever mount an endpoint.

Coach by example

The best way to see the benefits of Coach is with a demonstration...

Mounting an endpoint

class HelloWorld < Coach::Middleware
  def call
    # Middleware return a Rack response
    [ 200, {}, ['hello world'] ]
  end
end

So we've created ourselves a piece of middleware, HelloWorld. As you'd expect, HelloWorld simply outputs the string 'hello world'.

In an example Rails app, called Example, we can mount this route like so...

Example::Application.routes.draw do
  match "/hello_world",
        to: Coach::Handler.new(HelloWorld),
        via: :get
end

Once you've booted Rails locally, the following should return 'hello world':

$ curl -XGET http://localhost:3000/hello_world

Building chains

Suppose we didn't want just anybody to see our HelloWorld endpoint. In fact, we'd like to lock it down behind some authentication.

Our request will now have two stages, one where we check authentication details and another where we respond with our secret greeting to the world. Let's split into two pieces, one for each of the two subtasks, allowing us to reuse this authentication flow in other middlewares.

class Authentication < Coach::Middleware
  def call
    unless User.exists?(login: params[:login])
      return [ 401, {}, ['Access denied'] ]
    end

    next_middleware.call
  end
end

class HelloWorld < Coach::Middleware
  uses Authentication

  def call
    [ 200, {}, ['hello world'] ]
  end
end

Here we detach the authentication logic into its own middleware. HelloWorld now uses Authentication, and will only run if it has been called via next_middleware.call from authentication.

Notice we also use params just like you would in a normal Rails controller. Every middleware class will have access to a request object, which is an instance of ActionDispatch::Request.

Passing data through middleware

So far we've demonstrated how Coach can help you break your controller code into modular pieces. The big innovation with Coach, however, is the ability to explicitly pass your data through the middleware chain.

An example usage here is to create a HelloUser endpoint. We want to protect the route by authentication, as we did before, but this time greet the user that is logged in. Making a small modification to the Authentication middleware we showed above...

class Authentication < Coach::Middleware
  provides :user  # declare that Authentication provides :user

  def call
    return [ 401, {}, ['Access denied'] ] unless user.present?

    provide(user: user)
    next_middleware.call
  end

  def user
    @user ||= User.find_by(login: params[:login])
  end
end

class HelloUser < Coach::Middleware
  uses Authentication
  requires :user  # state that HelloUser requires this data

  def call
    # Can now access `user`, as it's been provided by Authentication
    [ 200, {}, [ "hello #{user.name}" ] ]
  end
end

# Inside config/routes.rb
Example::Application.routes.draw do
  match "/hello_user",
        to: Coach::Handler.new(HelloUser),
        via: :get
end

Coach analyses your middleware chains whenever a new Handler is created. If any middleware requires :x when its chain does not provide :x, we'll error out before the app even starts.

Summary

Our problems with controller code were implicit behaviours, hidden data dependencies and as a consequence of both, difficult testing. Coach has tackled each of these, providing the framework to restructure own controllers into code that is easily understood, and easily maintained.

Coach also hooks into ActiveSupport::Notifications, making monitoring performance of our API really easy. We've written a little adapter that sends detailed performance metrics up to Skylight, where we can keep an eye out for sluggish endpoints.

Coach is on GitHub. As always, we love suggestions, feedback, and pull requests!

We're hiring developers
See job listing
in Engineering

Prius: environmentally-friendly app config

We just open-sourced Prius, a simple library for handling environment variables in Ruby.

Safer environment variables

Environment variables are a convenient way of managing application config, but it's easy to misconfigure or forget them. This can cause big problems:

# If ENCRYPTION_KEY is missing, a nil encryption key will be used
encrypted_data = Cryto.encrypt(really_secret_data, ENV["ENCRYPTION_KEY"])

# If FOO_API_KEY is missing, this code will bomb out at run time
FooApi::Client.new(ENV.fetch("FOO_API_KEY")).make_request

Prius helps you guarantee that your environment variables are:

  • Present - an exception is raised if an environment variable is missing, so you can hear about it as soon as your app boots.
  • Valid - an environment variable can be coerced to a desired type (integer, boolean or string), and an exception will be raised if the value doesn't match the type.

Usage

# Load a required environment variable (GITHUB_TOKEN) into Prius.
Prius.load(:github_token)

# Use the environment variable.
Prius.get(:github_token)

# Load an optional environment variable:
Prius.load(:might_be_here_or_not, required: false)

# Load and alias an environment variable:
Prius.load(:short_name, env_var: "LONG_NAME_WE_HAVE_NO_CONTROL_OVER")

# Load and coerce an environment variable (or raise):
Prius.load(:my_flag, type: :bool)

How we use Prius

All environment variables we use are loaded as our app starts, so we catch config issues at boot time. If an app can't boot, it won't be deployed, meaning we can't release mis-configured apps to production.

We check a file of dummy config values into source control, which makes running the app in development and test environments easier. Dotenv is used to automatically load this in non-production environments.

We're hiring developers
See job listing

Safely retrying API requests

Today we're announcing support for idempotency keys on our Pro API, which make it safe to retry non-idempotent API requests.

Why are they necessary?

Here's an example that illustrates the purpose of idempotency keys.

You submit a POST request to our /payments endpoint to create a payment. If all goes well, you'll receive a 201 Created response. If the request is invalid, you'll receive a 4xx response, and know that the payment wasn't created. But what if something goes wrong our end and we issue a 500 response? Or what if there's a network issue that means you get no response at all? In these cases you have no way of knowing whether or not the payment was created. This leaves you with two options:

  • Hope the request succeeded, and take no further action.
  • Assume the request failed, and retry it. However, if the request did succeed you'll end up with a duplicate payment.

Not an ideal situation.

Idempotency Keys

To solve this, we've now rolled out support for idempotency keys across all our creation endpoints. Idempotency keys are unique tokens that you submit as a request header, that guarantee that only one resource will be created regardless of how many times a request is sent to us.

For example, the following request can be made repeatedly, with only one payment ever being created:

POST https://api.gocardless.com/payments HTTP/1.1
Idempotency-Key: PROCESS-ME-ONCE
{
  "payments": {
    "amount": 100,
    "currency": "GBP",
    "charge_date": "2015-06-20",
    "reference": "DOLLAR01",
    "links": {
      "mandate": "MD00001EKBQ412"
    }
  }
}

If the request fails, then it's perfectly safe to retry as long as you use the same idempotency key. If the original request was successful, then you'll receive the following response:

HTTP/1.1 409 (Conflict)
{
  "error": {
    "code": 409,
    "type": "invalid_state",
    "message": "A resource has already been created with this idempotency key",
    "documentation_url": "https://developer.gocardless.com/pro#idempotent_creation_conflict",
    "request_id": "5f917bf9-df56-460f-a165-15d9e77414cb",
    "errors": [
      {
        "reason": "idempotent_creation_conflict",
        "message": "A resource has already been created with this idempotency key",
        "links": {
          "conflicting_resource_id": "PM00001KKVGTS0"
        }
      }
    ]
  }
}

It's worth noting that we haven't added support for idempotency keys to our update endpoints, as they're idempotent by nature. For example, trying to cancel the same payment multiple times will have no adverse effect.

We're constantly improving our API to provide a better experience to our integrators. If you have any feedback or suggestions then get in touch, we'd love to hear from you!

Wondering whether GoCardless Pro could be for you?
Find out more

Zero-downtime Postgres migrations - the hard parts

A few months ago, we took around 15 seconds of unexpected API downtime during a planned database migration. We're always careful about deploying schema changes, so we were surprised to see one go so badly wrong. As a payments company, the uptime of our API matters more than most - if we're not accepting requests, our merchants are losing money. It's not in our nature to leave issues like this unexplored, so naturally we set about figuring out what went wrong. This is what we found out.

Background

We're no strangers to zero-downtime schema changes. Having the database stop responding to queries for more than a second or two isn't an option, so there's a bunch of stuff you learn early on. It's well covered in other articles1, and it mostly boils down to:

  • Don't rename columns/tables which are in use by the app - always copy the data and drop the old one once the app is no longer using it
  • Don't rewrite a table while you have an exclusive lock on it (e.g. no ALTER TABLE foos ADD COLUMN bar varchar DEFAULT 'baz' NOT NULL)
  • Don't perform expensive, synchronous actions while holding an exclusive lock (e.g. adding an index without the CONCURRENTLY flag)

This advice will take you a long way. It may even be all you need to scale this part of your app. For us, it wasn't, and we learned that the hard way.

The migration

Jump back to late January. At the time, we were building invoicing for our Pro product. We'd been through a couple of iterations, and settled on model/table names. We'd already deployed an earlier revision, so we had to rename the tables. That wasn't a problem though - the tables were empty, and there was no code depending on them in production.

The foreign key constraints on those tables had out of date names after the rename, so we decided to drop and recreate them2. Again, we weren't worried. The tables were empty, so there would be no long-held lock taken to validate the constraints.

So what happened?

We deployed the changes, and all of our assumptions got blown out of the water. Just after the schema migration started, we started getting alerts about API requests timing out. These lasted for around 15 seconds, at which point the migration went through and our API came back up. After a few minutes collecting our thoughts, we started digging into what went wrong.

First, we re-ran the migrations against a backup of the database from earlier that day. They went through in a few hundred milliseconds. From there we turned back to the internet for an answer.

Information was scarce. We found lots of blog posts giving the advice from above, but no clues on what happened to us. Eventually, we stumbled on an old thread on the Postgres mailing list, which sounded exactly like the situation we'd ran into. We kept looking, and found a blog post which went into more depth3.

In order to add a foreign key constraint, Postgres takes AccessExclusive locks on both the table with the constraint4, and the one it references while it adds the triggers which enforce the constraint. When a lock can't be acquired because of a lock held by another transaction, it goes into a queue. Any locks that conflict with the queued lock will queue up behind it. As AccessExclusive locks conflict with every other type of lock, having one sat in the queue blocks all other operations5 on that table.

Here's a worked example using 3 concurrent transactions, started in order:

-- Transaction 1
SELECT DISTINCT(email)     -- Takes an AccessShare lock on "parent"
FROM parent;               -- for duration of slow query.

-- Transaction 2
ALTER TABLE child          -- Needs an AccessExclusive lock on
ADD CONSTRAINT parent_fk   -- "child" /and/ "parent". AccessExclusive
  FOREIGN KEY (parent_id)  -- conflicts with AccessShare, so sits in
  REFERENCES parent        -- a queue.
  NOT VALID;

-- Transaction 3
SELECT *                   -- Normal query also takes an AccessShare,
FROM parent                -- which conflicts with AccessExclusive
WHERE id = 123;            -- so goes to back of queue, and hangs.

While the tables we were adding the constraints to were unused by the app code at that point, the tables they referenced were some of the most heavily used. An unfortunately timed, long-running read query on the parent table collided with the migration which added the foreign key constraint.

The ALTER TABLE statement itself was fast to execute, but the effect of it waiting for an AccessExclusive lock on the referenced table caused the downtime - read/write queries issued by calls to our API piled up behind it, and clients timed out.

Avoiding downtime

Applications vary too much for there to be a "one size fits all" solution to this problem, but there are a few good places to start:

  • Eliminate long-running queries/transactions from your application.6 Run analytics queries against an asynchronously updated replica.
    • It's worth setting log_min_duration_statement and log_lock_waits to find these issues in your app before they turn into downtime.
  • Set lock_timeout in your migration scripts to a pause your app can tolerate. It's better to abort a deploy than take your application down.
  • Split your schema changes up.
    • Problems become easier to diagnose.
    • Transactions around DDL are shorter, so locks aren't held so long.
  • Keep Postgres up to date. The locking code is improved with every release.

Whether this is worth doing comes down to the type of project you're working on. Some sites get by just fine putting up a maintenance page for the 30 seconds it takes to deploy. If that's not an option for you, then hopefully the advice in this post will help you avoid unexpected downtime one day.


  1. Braintree have a really good post on this. 

  2. At the time, partly as an artefact of using Rails migrations which don't include a method to do it, we didn't realise that Postgres had support for renaming constraints with ALTER TABLE. Using this avoids the AccessExclusive lock on the table being referenced, but still takes one on the referencing table. Either way, we want to be able to add new foreign keys, not just rename the ones we have. 

  3. It's also worth noting that the Postgres documentation and source code are extremely high quality. Once we had an idea of what was happening, we went straight to the locking code for ALTER TABLE statements

  4. This still applies if you add the constraint with the NOT VALID flag. Postgres will briefly hold an AccessExclusive lock against both tables while it adds constraint triggers to them. 9.4 does make the VALIDATE CONSTRAINT step take a weaker ShareUpdateExclusive lock though, which makes it possible to validate existing data in large tables without downtime. 

  5. SELECT statements take an AccessShare lock. 

  6. If developers have access to a console where they can run queries against the production database, they need to be extremely cautious. BEGIN; SELECT * FROM some_table WHERE id = 123; /* Developer goes to make a cup of tea */ will cause downtime if someone deploys a schema change for some_table

We’re hiring developers
See job listing

How we built the new gocardless.com

We recently deployed and open sourced a new version of our splash pages. We've had some great feedback on how fast the site is so we wanted to give a high level overview of our technical approach in the hope that others will find it useful.

In this post we'll cover both how we made the site so fast and what our development setup looks like.

Performance

When browsing gocardless.com you'll notice that after the initial page load navigation around the site is super snappy. Until you leave our splash pages, or choose a different locale, all your browser is requesting are new images. If you happen to be on a slow connection, or you have disabled JavaScript, the site still works fine.

React

Using React to render on the server means that we can deliver not only a fully rendered page to the browser, but also the rest of our splash pages too. Once the initial page is displayed, a request is made to fetch the rest of the site as a React application (only 250kB). All subsequent navigation within the splash pages is blindingly fast, not suffering the latency of HTTP requests since the browser already has the whole app.

If the user has disabled JavaScript, or they are on a poor connection and the JavaScript takes a while to load, they still experience the fully rendered page and only miss out on the fast navigation.

Static deploy

While we develop against an Express server, we don't deploy that. Since gocardless.com is entirely static, we're able to deploy and serve static HTML. One huge benefit of this is that we can host the site off an S3 bucket and not have to worry about a running web server and everything that goes along with it, such as exception handling, monitoring, security issues, etc.

In order to generate our static HTML we need to know every available URL. Since we're keeping all of our routing configuration in one place, this is trivial; we're able to easily extract the paths for all pages in every locale. Once we have those URLs, we simply crawl our locally running Express app and write the responses to disk ready for deployment.

Development tools

Babel

The newest version of JavaScript, ES6 (or ES2015 as it's now known), contains numerous improvements that make building applications in JavaScript a much nicer experience. We use Babel which makes it possible to take advantage of language additions now by compiling ES2015 (and even ES7 code) down to ES5, which has much better browser support.

webpack

We use webpack to bundle our CSS and JavaScript. There are plenty of other tools which do that too, however we particularly love the development experience when using the React Hot Loader plugin for webpack. This lets us see our changes live in the browser without losing app state.

Immutable

The main motivation for rebuilding our splash pages was to allow us to more easily manage pages across different locales. We wanted to be able have one place where we could see, for each page, its handler, which locales it was available in, and its route for that locale.

Having one data structure holding all of this information means that you can see the structure of the entire site in one place. A downside of this is that it can get a bit tricky querying such a large and nested structure. Since the whole application relies on this structure, we also need to be really careful not to accidentally mutate it. This is typically dealt with by lots of defensive cloneDeeping. Not doing so would (and initially did) lead to subtle, hard to diagnose bugs across the application.

Our solution to the above problems was to use Facebook's Immutable library. Using immutable data structures, such as Map and List, means that we don't have to worry about mutating the actual routes data structure as it gets passed into functions that operate on it. Immutable also has a great API for dealing with nested properties, including getIn and setIn, which we use extensively.

Conclusion

We're pretty happy with the simplicity of our solution and the user experience that it enables. We hope others can benefit from our experience and, of course, if you have any feedback or suggestions, please get in touch!

We're hiring developers
See job listing