Debugging the Postgres query planner

Written by

Last editedSep 2022

At GoCardless Postgres is our database of choice. Not only does it power our API, Postgres is used extensively in a number of internal services. We love it so much we just won't shut up about it, even when things go wrong.

One of the reasons we love it so much is how easily you can dig into Postgres' implementation when there are issues. The docs are great and the code exceptionally readable. These resources have been invaluable while scaling our primary database to the ~2TB we now run; no doubt they will continue to provide value as our organisation grows.

We at GoCardless believe that failure can be a great learning opportunity, and nothing proves that more than the amount we've learned from Postgres issues. This post shares a specific issue we encountered that helped us level-up our understanding of the Postgres query planner. We'll detail our investigation by covering:

What we saw
What went wrong
How we fixed it

We conclude with the actions we took to prevent this happening again.

What we saw

One day we started seeing unusual query performance from several sources: both async workers and API requests seemed to take longer to respond than was normal. Examining the problematic database queries, the common factor was that each query touched our payment transitions table (almost half a billion rows) and associated relations.

Looking further, we saw a concerning number of queries that were very long-running that normally execute in under 50ms. Now suspicious that we'd hit a poor query plan, it was time to find an exemplar query and dig in:

  SELECT *
    FROM payment_transitions
    JOIN payments
      ON payments.id = payment_transitions.payment_id
   WHERE payment_transitions.payout_id = 'PO00123456789Z'
ORDER BY payment_transitions.id ASC
   LIMIT 1;

How many payments, how many transitions?

Debugging a query plan almost always follows the same pattern: take time to understand the query, identify why the plan you received is bad, then hypothesise an ideal plan that would be fast. That new plan often requires an index that is yet to be created. On the other hand, perhaps a fast plan doesn't exist for this query. Whatever the outcome, it's key to every step that you understand the shape of the data you're querying.

Our query references two tables, payments and payment_transitions. In this system every payment has states it can transition through¹, and each of those states is represented as a row in the payment_transitions table. We'll be filtering on a foreign key of payment_transitions called payout_id which marks that transition as having been included in a payout.

Approximately 20% of our payment transitions will be marked with a payout_id, and there are approximately 20 payment transitions per payout. We can reasonably expect the number of payout_id values to grow linearly with the size of our payment_transitions table.

Using approximate figures, if we have 350m payment transitions, we can expect 70m to be marked with a payout_id, and there would be almost 3.5m distinct payout_id values in the payment_transitions table. Finally, we have an index on the payout_id column that looks like this:

CREATE INDEX index_payment_transitions_on_payout_id
          ON payment_transitions
       USING btree (payout_id);

This should provide enough context for us to properly evaluate each potential plan for this query.

EXPLAIN the plan

Using the query we'd pinpointed as problematic, we ran an EXPLAIN in a Postgres prompt to display the selected query plan.

 EXPLAIN
  SELECT *
    FROM payment_transitions
    JOIN payments
      ON payments.id = payment_transitions.payment_id
   WHERE payment_transitions.payout_id = 'PO00123456789Z'
ORDER BY payment_transitions.id ASC
   LIMIT 1;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.14..21700.47 rows=1 width=262)
   ->  Nested Loop  (cost=1.14..58045700.29 rows=2675 width=262)
         ->  Index Scan using payment_transitions_pkey on payment_transitions  (cost=0.57..58022604.77 rows=2688 width=262)
               Filter: (payout_id = 'PO00123456789Z'::text)
         ->  Index Scan using payments_pkey on payments  (cost=0.57..8.58 rows=1 width=14)
               Index Cond: ((id)::text = (payment_transitions.payment_id)::text)
(6 rows)

This query includes a join operation between our payments and payment_transitions table. Formally, a relational join is an operation on two sets - R and S - which will produce a result consisting of all combinations of tuples (tuple means a single row result) in R and S that are equal under a particular matching condition.

When joining two tables, Postgres employs one of three strategies: merge, hash or nested loop. The strategy Postgres has chosen for our query is nested loop, the most naive of join strategies. Nested loop joins will iterate over every tuple in payment_transitions and for each tuple scan the payments table for tuples that match the join condition, which in this case is payments.id = payment_transitions.payment_id. Our result will be all the tuples that satisfied our condition.

Looking at our plan, we're using the payment_transitions_pkey to select each transition tuple and for each transition that has a matching payout_id, we'll use an index lookup into the payments table to perform the join. The advantage of this query plan is that the first matching row we find using the payment_transitions_pkey index is guaranteed to match our query ordering² constraint (ORDER BY payment_transitions.id), so we can halt execution at this point as we only require a single tuple (LIMIT 1).

Sadly, this query plan is not going to work well for us. Recalling the underlying distribution of this data, for every payout we expect there to be approximately 20 matching transitions. If we assume that these matches are evenly distributed³ throughout the payment_transitions_pkey index (a fair assumption for query planning purposes) then we expect to scan 1/20th of the table before we find our first match.

Our real world example happens to be queries for recently created payout_id values, and given the payment_transitions.id is a monotonically increasing sequence, we can expect our matching transitions to be right at the end of our scan.

This is an example of how reasonable assumptions in theory can lead to pathological data access patterns in practice.

At 350m rows, this amounts to 17.5m rows we need to scan, or about 20GB⁴ of data. This plan will never match the performance we expect from this query, so something has gone deeply wrong.

What did Postgres expect

The calculations we just performed are very similar to how Postgres evaluates query plans. In fact, the statistics we've been quoting are tracked and updated regularly by Postgres through the auto-analyze process, and are known as the statistic values n_distinct and null_frac.

Having a measure for each column of the number of distinct values (n_distinct) and fraction of rows for which the column is null (null_frac) enables Postgres to compute the expected number of rows returned for a given query as approximately row_count * (1 - null_frac) / n_distinct⁵.

Looking at the explained output of our plan:

-> Index Scan using payment_transitions_pkey on payment_transitions  (cost=0.57..58022604.77 rows=2688 width=262)
     Filter: (payout_id = 'PO00123456789Z'::text)

We see that Postgres expected that 2688 payment transitions would match our filter condition on payout_id. Assuming this is a typical payout (it doesn't appear in Postgres' most common values⁵) then we've way over-estimated the number of transitions attached to the average payout, which should be about 20. When we look at our statistics for this column, we start to see some concerning numbers:

postgres=# select attname, n_distinct, null_frac from pg_stats where tablename='payment_transitions' and attname='payout_id';
  attname  | n_distinct | null_frac
-----------+------------+-----------
 payout_id |      25650 |      0.81

Plug these values into our formula from above and they imply that the number of payment_transitions we'll find that match our payout_id is 350m * (1 - 0.81) / 25650 = 2592. We know from our manual calculations that the average payout is associated with 20 payment_transitions, so Postgres' estimate of the number of distinct payouts is incorrect by two orders of magnitude (25k vs 3.5m). Such an error will prevent Postgres from making sane decisions when comparing plans.

Our ideal plan

Our ideal plan would be to fetch all matching payment transitions for our payout, then (knowing this will be a small number) perform an in-memory sort on the results, returning the transition with minimum ID. The initial fetching of matching transitions would be fast due to our index_payment_transitions_on_payout_id index.

The plan (with correct row estimations) would look something like this:

                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit  (rows=1)
  -> Sort  (rows=20)                                                        Sort Key: payment_transitions.id
      ->  Nested Loop  (rows=20)
          ->  Index Scan using index_payment_transitions_on_payout_id on payment_transitions  (rows=20)
                Index Cond: (payout_id = 'PO00123456789Z'::text)
          ->  Index Scan using payments_pkey on payments  (rows=1)
                Index Cond: ((id)::text = (payment_transitions.payment_id)::text)

Materializing all matching transitions for most payouts will be quick and the subsequent sort cheap, as on average there will be so few of them. This is what we want the query plan to produce but our statistics meant Postgres vastly overestimated the cost of our sort, opting for the much more expensive primary key scan.

What went wrong?

Postgres' planner has made a choice to use a query plan that could potentially require far more data that the alternative, given it believes the chance of an early exit - finding a payment_transition with a matching payout_id - will be high. We can see in the planner code exactly why this has happened and how the decision was made:

/* src/backend/optimizer/util/pathnode.c:3414
 * 'count_est' is the estimated value of the LIMIT expression
 */
LimitPath *
create_limit_path(
  PlannerInfo *root, RelOptInfo *rel, Path *subpath,
  Node *limitOffset, Node *limitCount,
  int64 offset_est, int64 count_est)
{
  ...

  if (count_est != 0)
  {
    if (subpath->rows > 0)
      // subpath->rows is an estimated row count of our
      // source query, while count_est will be the LIMIT
      // value when known. This means we're discounting our
      // cost by the fraction of the source query we expect
      // to use.
      pathnode->path.total_cost = pathnode->path.startup_cost +
        (subpath->total_cost - subpath->startup_cost)
        * count_est / subpath->rows;
    ...
  }
}

This code is taken from src/backend/optimizer/util/pathnode.c which contains utilities to prune and optimise possible query plans (paths). This extract is where we calculate the cost of a plan that has a limit, as this will mean some optimisations might be possible.

The most impactful optimisation for limit queries is to exit early. If your query has a very small limit, then Postgres should stop execution as soon as the limit is satisfied rather than continue scanning all your data. When a limit is combined with an ordering, exiting early becomes possible for candidate plans provided they find rows an the order that matches the sort clause.

Recall that our query is asking for a single (LIMIT 1) payment_transitions row sorted by id. If Postgres finds our payment_transitions by searching the payment_transitions_pkey index then we'll find our results in order of their id, meaning the first tuple we return that matches our query's WHERE clause will satisfy our query and we can exit early.

The possibility of early exit means we can discount the total cost of any already-sorted plans by count_est / subpath->rows, as - assuming our matching tuples are found evenly distributed across our search space - this is the fraction of the plan we'll need to execute before we've produced all tuples we require.

In our case, our count_est is 1, and our subpath->rows is high (2592) due to our underestimation of n_distinct for payout_id. Small numerator and large denominator means the discount is huge and is why the nested join through the payment_transitions_pkey index was chosen as the best plan.

How we fixed it

As soon as we realised our statistics were causing such a poor query plan we re-ran an analyze to cause Postgres to resample. We had to do this a few times before our plans got better, which hints at the more concerning root cause of this problem.

When Postgres runs an analyze, it takes a sample of the table to use for generating statistics. There are several subtleties around how Postgres samples the table that can impact the accuracy of the tables statistics, and at present it's not possible to solve all of them.

Sample size

The Postgres GUC (Grand Unified Configuration) variable default_statistics_target defines the default sample size Postgres uses for computing statistics, as well as setting the number of most common values to track for each column. The default value is 100, which means "take samples of 100 * 300 (magic number) pages when running an analyze", then sample randomly from amongst the rows included in these pages.

But how large a sample is large enough? The n-distinct estimator used by Postgres is from IBM Research Report RJ 10025 (Haas and Stokes⁶), where the authors discuss the bias and error that is expected from the estimator given various sample sizes and underlying data characteristics.

In their analysis of the estimator, they note that it has been proven by Bunge and Fitzpatrick (Estimating the Number of Species: A Review⁷) that unbiased estimators do not exist when the sample size is smaller than the count of the most frequently occurring value in the population. The bias in these estimators is significant (anywhere up-to 80%) and small sample sizes will cause the bias to increase.

The estimator bias is always negative, meaning we estimate fewer distinct values than are actually present - this could explain the underestimation leading to our query plan malfunction. There can be anywhere up to 100k payment_transitions with the same payout_id value, so at minimum we should sample as many transition rows to guarantee we'll find more than one distinct payout_id value. As ~80% of payout_ids are NULL, we require 100k / (1 - 0.8) = 500k rows, or 1666 as a statistics target (recalling that Postgres multiplies your statistic target by 300 to find the desired number of samples).

We bumped the statistics target for this column like so:

ALTER TABLE payment_transitions
ALTER COLUMN payout_id SET STATISTICS 1666;

And repeatedly ran analyzes, checking the n_distinct value at each run. While the values were slightly better than what we'd seen before we continued to see large variation and massive underestimation. It wasn't until we bumped our target to 5000 that the value became stable.

Not all samples are created equal

We expected that a moderate bump (say to 500) of our sample size would produce markedly better results than the default 100, but this turned out not to be the case. Careful thought about how Postgres generates our random sample lead to the conclusion that we were unduly biasing our estimator by taking a fair, random sample from a statistically biased selection of pages.

Postgres generates its samples in a two stage process⁸: if we want to collect a sample of 100k rows, we'll first gather 100k pages and then collect our sample from those pages. It is not the case that every table tuple has the same probability of appearing in our sample, as we're confined to the pages we selected in our first pass. Ideally this shouldn't be a problem, assuming column values are distributed independently amongst pages, but in our case (and we suspect many others) this is not true.

Our system creates all payment_transitions for the same payout_id in one sweep. The payment_transitions table is mostly append-only, so Postgres is prone to place all those new transitions physically adjacent to one another, sharing the same pages. If we take our random sample from a restricted set of pages we've vastly increased the probability of sampling a value multiple times in comparison to selecting from the entire table.

We can confirm this bias by using Postgres table samples to compute our statistics with a system strategy (approximates our analyze process) vs statistically fair sampling with bernoulli. A statistics target of 500 means we'd sample 150k (300 * 500) rows, which is 0.043% of our table. Using this sample size and comparing the two sampling methods we can see a stark difference:

postgres=# select count(*) as sampled,
                  count(distinct(payout_id)) as unique
           from payment_transitions tablesample system(0.043);
 sampled | unique
---------+--------
  153667 |  11029

postgres=# select count(*) as sampled,
                  count(distinct(payout_id)) as unique
           from payment_transitions tablesample bernoulli(0.043);
 sampled | unique
---------+--------
  153667 |  25351

Sample everything?

Everything we've explained so far might have you asking why you would ever not use the maximum sample size. If our sample is the whole table then we can't have statistical bias, right?

Higher statistic targets will increase the amount of time required to perform an analyze on a table. The Postgres auto-vacuum daemon is constantly triggering ANALYZE's in response to database activity, and the ACCESS SHARE locks they take on their table can block incoming ALTER TABLE commands. This can be a problem if you frequently want to change table structure, or don't take care to timeout the lock to avoid blocking other incoming queries⁹.

One non-obvious consequence is how this affects upgrading your database. Postgres major upgrades will reset all statistic values. If your database depends on good statistics to perform (hint: it probably does!) then a full ANALYZE will be required before you can accept production traffic. The longer your ANALYZE takes, the longer you'll be down.

For this and other reasons you may want to hold off from dialing the sample size to max. With the exception of the n-distinct estimator we've covered in this post, our experience has shown Postgres to make very good decisions with normal sample sizes. More varied data shapes than our payment_transitions table may be better suited to restructuring¹⁰ than hacking the statistics engine.

Conclusions

Postgres has a fantastic query planner that can help scale a database far beyond the size of the average organisation. It also provides great tools and documentation that can help you deal with most performance issues that arise with growth, and the flexibility via configuration values to handle most use cases.

That said, the heuristics powering the query planner can cause Postgres to make decisions that flip performance on its head. In this production issue we saw a normally fast query degrade in a spectacular fashion, and it was only after peeling back a few layers that we began to understand why it happened.

We've covered some ways you can configure Postgres to adapt beyond the size of the average database, and explained some trade-offs involved. We choose to increase our statistics target to provide better information to the query planner, but we're also pushing for more careful thought around data access patterns to reduce the complexity of each query, making it less likely for the planner to make a mistake.

Hopefully this is a useful case-study for those who want to learn more about the query planner. If you have feedback or questions about this article we'd love to hear from you @GoCardlessEng.

We use Statesman to model state machines in Postgres, where each transition is a new row.↩
The payment_transitions_pkey index contains references to payment_transitions tuples in order of payment_transitions.id. This is why the first result from scanning our transitions using this index is guaranteed to have the minimum id value.↩
In our case, the assumption that payout_id values are evenly distributed with respect to the payment_transitions.id is going to be terrible for us.↩
Scanning such a large amount of data is not only going to make the current query slow but will have a large performance impact on the rest of your database. Scans like these are likely to read old data that is not currently in our page cache, causing eviction of pages that are needed for other on-going queries. It's worth bearing this in mind when your database has strict performance requirements and depends on hot data being cached to meet them.↩
In practice, Postgres also adjusts for the known most common values and their histogram bounds, allowing the computation to take into account statistical outliers. In this example, we can safely ignore these histogram bounds because the most common values cover only a small percentage of the table.↩
http://almaden.ibm.com/cs/people/peterh/jasa3rj.pdf ↩
https://www.jstor.org/stable/2290733 ↩
See src/backend/commands/analyze.c for an explanation of Postgres' two-phase sample strategy. The comment also mentions some flaws of this approach.↩
See our post around lock timeouts to avoid causing disruption when taking locks in Postgres↩
Table partitioning- either declarative or by inheritance- can help manage large, heterogenous datasets. When all the rows look similar there will be fewer candidate plans and less chance of a disasterous mis-step.↩