In search of performance - how we shaved 200ms off every POST request
Last editedApr 2023
While doing some work on our Pro dashboard, we noticed that search requests were taking around 300ms. We've got some people in the team who have used Elasticsearch for much larger datasets, and they were surprised by how slow the requests were, so we decided to take a look.
Today, we'll show how that investigation led to a 200ms improvement on all internal POST requests.
What we did
We started by taking a typical search request from the app and measuring how long it took. We tried this with both Ruby's
Net::HTTP and from the command line using
curl. The latter was visibly faster. Timing the requests showed that the request from Ruby took around 250ms, whereas the one from
curl took only 50ms.
We were confident that whatever was going on was isolated to Ruby1, but we wanted to dig deeper, so we moved over to our staging environment. At that point, the problem disappeared entirely.
For a while, we were stumped. We run the same versions of Ruby and Elasticsearch in staging and production. It didn't make any sense! We took a step back, and looked over our stack, piece by piece. There was something in the middle which we hadn't thought about - HAProxy.
We quickly discovered that, due to an ongoing Ubuntu upgrade2, we were using different versions of HAProxy in staging (1.4.24) and production (1.4.18). Something in those 6 patch revisions was responsible, so we turned our eyes to the commit logs. There were a few candidates, but one patch stood out in particular.
We did a custom build of HAProxy 1.4.18, with just that patch added, and saw request times drop by around 200ms. Job done.
Under the hood
Since this issue was going to be fixed by the Ubuntu upgrades we were doing, we decided it wasn't worth shipping a custom HAProxy package. Before calling it a day, we decided to take a look at the whole request cycle using
tcpdump, to really understand what was going on.
What we found was that Ruby's
Net::HTTP splits POST requests across two TCP packets - one for the headers, and another for the body.
curl, by contrast, combines the two if they'll fit in a single packet. To make things worse,
Net::HTTP doesn't set
TCP_NODELAY on the TCP socket it opens, so it waits for acknowledgement of the first packet before sending the second. This behaviour is a consequence of Nagle's algorithm.
Moving to the other end of the connection, HAProxy has to choose how to acknowledge those two packets. In version 1.4.18 (the one we were using), it opted to use TCP delayed acknowledgement.
Delayed acknowledgement interacts badly with Nagle's algorithm, and causes the request to pause until the server reaches its delayed acknowledgement timeout3.
HAProxy 1.4.19 adds a special case for incomplete HTTP POST requests - if it receives a packet which only contains the first part of the request, it enables
TCP_QUICKACK on the socket, and immediately acknowledges that packet.
More than just search
Having understood what was happening, we realised the fix had a far wider reach than our search endpoint. We run all of our services behind HAProxy, and it's no secret that we write a lot of Ruby. This combination meant that almost every POST request made inside our infrastructure incurred a 200ms delay. We took some measurements before and after rolling out the new version of HAProxy:
POST /endpointA average (ms/req) before HAProxy upgrade: 271.13 average (ms/req) after HAProxy upgrade: 19.08 POST /endpointB average (ms/req) before HAProxy upgrade: 323.78 average (ms/req) after HAProxy upgrade: 66.47
Quite the improvement!
Even though the fix was as simple as upgrading a package, the knowledge we gained along the way is invaluable in the long-term.
It would have been easy to say search was fast enough and move on, but by diving into the problem we got to know more about how our applications run in production.
Doing this work really reinforced our belief that it's worth taking time to understand your stack.
- We used the same query in each request, and waited for the response time to settle before measuring. We also tried Python's
requestslibrary, and it performed similarly to
- At the time, we were trialling Ubuntu 14.04 in our staging environment, before rolling it out to production.↩
- On Linux the timeout is around 200ms. The exact value is determined by the kernel, and depends on the round-trip-time of the connection.↩