In search of performance - how we shaved 200ms off every POST request

While doing some work on our Pro dashboard, we noticed that search requests were taking around 300ms. We've got some people in the team who have used Elasticsearch for much larger datasets, and they were surprised by how slow the requests were, so we decided to take a look.

Today, we'll show how that investigation led to a 200ms improvement on all internal POST requests.

What we did

We started by taking a typical search request from the app and measuring how long it took. We tried this with both Ruby's Net::HTTP and from the command line using curl. The latter was visibly faster. Timing the requests showed that the request from Ruby took around 250ms, whereas the one from curl took only 50ms.

We were confident that whatever was going on was isolated to Ruby1, but we wanted to dig deeper, so we moved over to our staging environment. At that point, the problem disappeared entirely.

For a while, we were stumped. We run the same versions of Ruby and Elasticsearch in staging and production. It didn't make any sense! We took a step back, and looked over our stack, piece by piece. There was something in the middle which we hadn't thought about - HAProxy.

We quickly discovered that, due to an ongoing Ubuntu upgrade2, we were using different versions of HAProxy in staging (1.4.24) and production (1.4.18). Something in those 6 patch revisions was responsible, so we turned our eyes to the commit logs. There were a few candidates, but one patch stood out in particular.

We did a custom build of HAProxy 1.4.18, with just that patch added, and saw request times drop by around 200ms. Job done.

Under the hood

Since this issue was going to be fixed by the Ubuntu upgrades we were doing, we decided it wasn't worth shipping a custom HAProxy package. Before calling it a day, we decided to take a look at the whole request cycle using tcpdump, to really understand what was going on.

What we found was that Ruby's Net::HTTP splits POST requests across two TCP packets - one for the headers, and another for the body. curl, by contrast, combines the two if they'll fit in a single packet. To make things worse, Net::HTTP doesn't set TCP_NODELAY on the TCP socket it opens, so it waits for acknowledgement of the first packet before sending the second. This behaviour is a consequence of Nagle's algorithm.

Moving to the other end of the connection, HAProxy has to choose how to acknowledge those two packets. In version 1.4.18 (the one we were using), it opted to use TCP delayed acknowledgement.

Delayed acknowledgement interacts badly with Nagle's algorithm, and causes the request to pause until the server reaches its delayed acknowledgement timeout3.

HAProxy 1.4.19 adds a special case for incomplete HTTP POST requests - if it receives a packet which only contains the first part of the request, it enables TCP_QUICKACK on the socket, and immediately acknowledges that packet.

More than just search

Having understood what was happening, we realised the fix had a far wider reach than our search endpoint. We run all of our services behind HAProxy, and it's no secret that we write a lot of Ruby. This combination meant that almost every POST request made inside our infrastructure incurred a 200ms delay. We took some measurements before and after rolling out the new version of HAProxy:

POST /endpointA
average (ms/req) before HAProxy upgrade: 271.13
average (ms/req) after HAProxy upgrade: 19.08

POST /endpointB
average (ms/req) before HAProxy upgrade: 323.78
average (ms/req) after HAProxy upgrade: 66.47

Quite the improvement!

Wrap-up

Even though the fix was as simple as upgrading a package, the knowledge we gained along the way is invaluable in the long-term.

It would have been easy to say search was fast enough and move on, but by diving into the problem we got to know more about how our applications run in production.

Doing this work really reinforced our belief that it's worth taking time to understand your stack.


  1. We used the same query in each request, and waited for the response time to settle before measuring. We also tried Python's requests library, and it performed similarly to curl

  2. At the time, we were trialling Ubuntu 14.04 in our staging environment, before rolling it out to production. 

  3. On Linux the timeout is around 200ms. The exact value is determined by the kernel, and depends on the round-trip-time of the connection

Sound like something you'd enjoy?
We're hiring
in Announcements

Optimising SEPA Direct Debit

Here at GoCardless we're always looking for ways to optimise the speed of the Direct Debit process. We recently made our GBP payouts arrive a day sooner than before, but our Euro payouts have been lagging behind, taking 2-3 working days to arrive. Now we're able to bring that down to just one.

At the same time we're making SEPA COR1 available. This is a faster version of SEPA Direct Debit, taking just 2 working days to collect payments. It's currently supported for collecting from most German, Austrian, and Spanish banks. If you don't specify the scheme when setting up a mandate, we'll automatically use COR1 whenever possible.

All new merchants will be able to take advantage of both of these optimisations straight away, and we'll be enabling them for existing merchants over the next few weeks. Get in touch if you're keen to be in the first batch!

Wondering whether GoCardless could be for you?
Find out more
in Business, Hiring

The Account Executive Interview Process

One of the most important things as we look to scale from processing $1 billion payments a year to $10 billion is growing our sales team. Our vision is to create a global payments network, making payments simpler on the internet no matter what country you're from. We’ve previously written about how we train our salespeople. This blog post is aimed at guiding you through the account executive interview process. We want you to succeed, so we’re going to outline what we look for and how the process works.

The process is split into three sections:

  1. Phone screen
  2. First round interview (1.5 hours)
  3. Final round interview (2.5 hours)

We have identified the following characteristics as being most important for successful salespeople here:

  • Smart. Our product is technical. To succeed here you need to be able to learn technical detail so you communicate potentially confusing things in a simple way.
  • Driven to learn and improve. We believe how good you’ll be a year from now is more dependant on attitude than current ability. How driven you are and how motivated you are to improve are the two most important drivers of success.
  • Coachability. Giving and taking feedback is one of the most important aspects of our company culture. The quicker you are able to act on feedback, the faster you learn.
  • Likeable. People don’t like to buy from people they don’t like. It is essential that anyone we employ is friendly.
  • Communication skills. The ability to explain complex ideas simply is absolutely crucial for our salespeople.

The interview process is designed to test these skills.

1. Phone screen

This interview lasts 10-15 minutes and you will speak to our hiring co-ordinator. We are assessing a few basic things here:

  • Have you done your research on the company?
  • Are you passionate about startups?
  • Do you want to work in sales?

To do well in this interview:

  • Be in a quiet place. Good signal helps too!
  • Be prepared. Make sure you’ve done research on the company. You should have more specific reasons on why you want to join than just ‘interested in startups or fintech.’
  • Be enthusiastic. We want to know you’re excited about joining us.
  • Be positive and engaging. We want to hire people who will make a contribution to our culture, who are respectful, polite, collaborative and energetic.

2. First round interview

This consists of two parts:

  • The background interview. Here we’ll be trying to assess the skills we value in salespeople as well as find out a bit more about your motivations for joining us and why you want to work in sales.
  • The roleplay. For this you will be doing a sales meeting with a mock client. You don’t know anything about them in advance.

In order to succeed in culture fit:

  • Think of examples of where you have demonstrated the skills we care about.
  • Be structured. Good communication is really important in sales. Read about the pyramid principle and apply it in your answers.
  • Be honest. People join a startup for a variety of reasons; tell us what you’re really looking for and what you want to do in your career. It doesn't matter if you’ve not got a history in sales (I didn’t do sales before GC!), but you need to have a good reason to work in sales with us.
  • Do your research. Why is it you want to work for us? Try and be specific as possible. How did you find out about us? What about us excites you?

In order to succeed in the role play:

  • Read SPIN Selling. This forms a great foundation on how we do sales here. If you don’t have time for the whole book, at a minimum read a summary. Prepare some questions for the role play using what you learn.
  • Know our product and industry. Sign up for a GoCardless account and learn how to use it. At a minimum you should read our Recurring Payments Guide and our Direct Debit Guide.
  • Listen attentively. Good questioning and good listening are two of the most important skills in sales. Listen carefully to what we say and use it to ask intelligent questions.
  • Don’t pitch. You should think of this as a conversation not a ‘pitch.’ The key is to really understand the client’s needs. Only then can you align our solution to what the client wants.

3. Final round interview

This consists of three parts:

  • The roleplay. The is the same format as the previous round. After the first role play we will give you feedback. The second round is the same scenario: we want to see if you’ve responded to the feedback.
  • The exec interview. In the final round you get the opportunity to meet our CEO and another member of our management team. The same advice as the background interview applies here.
  • Working with sales. This is a quick challenge. There is no need to prepare anything for this. We’ll give you something you might be faced with on a typical day here.

Good luck and we look forward to meeting you! If you’re unsure about applying, please just do. We'd love to hear from you!

Want to help us hit $10bn? We're hiring!
See job listing
in Engineering

Friday's outage post mortem

The below is a post-mortem on downtime experienced on Friday 3rd July. It was sent around the GoCardless team and is published here for the benefit of our integrators.

Any amount of downtime is completely unacceptable to us. We're sorry to have let you, our customers, down.

Summary

On Friday 3rd July 2015, we experienced three separate outages over a period of 1 hour and 35 minutes. It affected our API and our Pro API, as well as the dashboards powered by these APIs.

The first incident was reported at 16:37 BST and all our service were fully up and stable at 18:12 BST.

We apologise for the inconvenience caused. More importantly, we are doing everything we can to make sure this can't happen again.

Cause

On Friday 3rd July 2015, at 16:35 BST, we started experiencing network failures on our primary database cluster in SoftLayer’s Amsterdam datacentre. After several minutes of diagnosis it became evident that SoftLayer were having problems within their internal network. We were experiencing very high latency when troubleshooting and seeing high packet loss between servers in our infrastructure. We immediately got in touch with SoftLayer’s technical support team and they confirmed they were having issues with one of their backend routers, which was causing connectivy issues in the whole datacentre.

At 16:37 BST, all the interfaces on SoftLayer’s backend router went down. This caused our PostgreSQL cluster to be partitioned since all nodes lost connectivity to one another. As a result, our API and Pro API became unavailable, as well as the dashboards using these APIs. At 16:39 BST the network came back online and we started repairing our database cluster.

By 16:44 BST, we brought our PostgreSQL cluster up and normal service resumed.

At 17:07 BST, the interfaces on SoftLayer's router flapped again, causing our service to be unavailable. This time one of the standby nodes in our cluster became corrupted. While some of our team worked on repairing the cluster, the remainder started preparing to fail over our database cluster to another datacentre.

By 17:37 BST, all of SoftLayer’s internal network was working properly again. We received confirmation from SoftLayer that the situation was entirely mitigated. We came to the conclusion that at this point, transitioning our database cluster to a different datacentre would cause unnecessary further disruption.

At 18:03 BST, we saw a third occurrence of the internal network flapping, which caused our service to be unavailable again for a short period of time.

By 18:12 BST, our API, Pro API and dashboards were fully operational.

Post mortem

We don't yet have any further details from SoftLayer on the cause of the router issues, but we are awaiting a root cause analysis.

Currently, our PostgreSQL cluster automatically fails over to a new server within the same datacentre in case of any single-node failure. Following these events we have brought forward work to completely automate this failover across datacentres so that we can recover faster from datacentre-wide issues.

Hitting $1bn

We're excited to announce that GoCardless is now processing over $1bn each year. It's a great milestone in our journey to create a new payment network for the Internet.

Over the last year we’ve seen a 250% increase in volume as our customers increasingly recognise the potential of Direct Debit. We've welcomed large new customers, including The Financial Times, Box.com, and Habitat, at the same time as continuing to serve thousands of small business. In fact, over half of our volume comes from businesses that have never had access to Direct Debit before.

What's next?

With over 2,000 businesses signing up to collect payment via GoCardless every month it still feels like we are at the beginning of our journey.

European expansion is well under way - we just launched in France, are in open beta in Germany and Ireland, and will be launching across the rest of Europe as soon as we have the team to build and support our business internationally.

We're not planning on stopping at SEPA & UK Direct Debit, either. Our vision is to create an international Direct Debit network to power simple payments for the Internet. Expect more announcements soon!

Want to help us hit $10bn? We're hiring!
See job listings