The Troubleshooting Tales: issues scaling Postgres connections
By Harry MarrDec 20155 min read
After making some changes to our Postgres setup, we started noticing occasional errors coming from deep within ActiveRecord (Rails’ ORM). This post details the process we went through to determine the cause of the issue, and what we did to fix it.
First, it’s important to understand the changes we made to our Postgres setup. Postgres connections are relatively slow to establish (particularly when using SSL), and on a properly-tuned server they use a significant amount of memory. The amount of memory used limits the number of connections you can feasibly have open at once on a single server, and the slow establishment encourages clients to maintain long-lived connections. Due to these constraints, we recently hit the limit of connections our server could handle, preventing us from spinning up more application servers. To get around this problem, the common advice is to use connection pooling software such as PgBouncer to share a small number of Postgres connections between a larger number of client (application) connections.
When we first deployed PgBouncer, we were running it in “session pooling” mode, which assigns a dedicated Postgres server connection to each connected client. However, with this setup, if you have a large number of idle clients connected to PgBouncer you’ll have to maintain an equal number of (expensive) idle connections on your Postgres server. To combat this, there is an alternative mode: “transaction pooling”, which only uses a Postgres server connection for the duration of each transaction. The downside of transaction pooling is that you can’t use any session-level features (e.g. prepared statements, session-level advisory locks). After combing through our apps to remove all usages of session-level features, we enabled transaction pooling.
Shortly after making the switch, we started seeing (relatively infrequent) exceptions coming from deep within ActiveRecord:
NoMethodError: undefined method 'fields' for nil:NilClass. We also noticed that instances of this exception appeared to be correlated with INSERT queries that violated unique constraints.
Investigating the problem
Some initial digging indicated that on executing certain queries, the
async_exec method in the Ruby Postgres driver was returning
nil, rather than
PG::Result as ActiveRecord was expecting. To get a better sense of what could be causing this, we set about finding a way to reliably reproduce the exception.
We set up a test database cluster that matched our production setup (see image above), and wrote a script that used the Ruby Postgres driver to issue lots of unique-constraint-violating queries in parallel, using one connection per thread. No dice - we didn't see any exceptions. Next, we tried introducing a generic connection pooling library, so we started sharing connections between threads. Again, it all worked as expected. Finally, we swapped out the connection pooling library for ActiveRecord, and we were immediately able to reproduce the exception.
Given that we had just switched over to transaction-pooling mode, we became curious about whether wrapping the INSERT in a transaction would change anything. We tried issuing a
BEGIN, followed by the constraint-violating insert, then a
COMMIT (sent to the database server one at a time), and the error persisted. However, when we wrapped it all up into one string:
BEGIN; INSERT …; COMMIT, the error suddenly stopped occurring. Something didn’t seem right - wrapping a single statement in a transaction should have no effect at all. We then tried running
SELECT 1; INSERT …and found that it had the same effect: the error went away. So this “fix” actually generalised to command strings that included multiple statements. Most peculiar.
To shed more light on why the issue happened with ActiveRecord but not other libraries, we turned to tcpdump to get a more complete view over what was going on. We quickly noticed that when using ActiveRecord, we’d see a load of extra queries were being sent over the wire. These queries all seemed to be changing session-level settings, and session-level settings don’t play nice with transaction pooling. The reason that transaction pooling doesn’t work with session-level settings is each transaction (or query, when not in a transaction) may be sent to a different Postgres connection. So, modifying session-level settings will change the setting on a random connection, and not necessarily affect successive queries, which may be sent to different connections entirely.
Looking over the settings one-by-one, most seemed pretty innocuous1 - ensuring that the right timezone was set, making sure “standard_conforming_strings” were being used, etc. Then we spotted a query that was setting
client_min_messages determines which messages are reported back to the client. Usually it’s set to
NOTICE. Postgres considers unique constraint violations to be
ERRORs, which are below
PANIC so were not being reported. The Ruby Postgres driver was issuing a query, and expecting either a normal result or an error. However, because errors were disabled, it was getting nothing back, causing it to return
nil. Finally, we found the issue!
But why was ActiveRecord disabling errors? It turns out that in Postgres 8.1
standard_conforming_strings was read-only, and any attempt to set it would result in an error. ActiveRecord enabled the setting if it was available, but didn’t want to show the error if it wasn’t available or was read-only. The solution was to set
PANIC, then try to set
standard_conforming_strings, then reset
client_min_messagesback to its original value. Fortunately, Rails has since dropped support for Postgres 8.1, so the fix was easy: simply remove the queries that modify
client_min_messages and assume that
standard_conforming_strings isn't read-only. Our patch that changes this has been present in Rails since v4.2.5.
Even though most of the settings appear innocuous, they could still cause issues when combined with transaction pooling. For instance, you could end up with different connections in different timezones. We've brought this up on the rubyonrails-core list, but as of Dec '15 it's still an unresolved issue. ↩