# Debugging the Postgres query planner

Last editedSep 2022

At GoCardless Postgres is our database of choice. Not only does it power our API, Postgres is used extensively in a number of internal services. We love it so much we just won't shut up about it, even when things go wrong.

One of the reasons we love it so much is how easily you can dig into Postgres' implementation when there are issues. The docs are great and the code exceptionally readable. These resources have been invaluable while scaling our primary database to the ~2TB we now run; no doubt they will continue to provide value as our organisation grows.

We at GoCardless believe that failure can be a great learning opportunity, and nothing proves that more than the amount we've learned from Postgres issues. This post shares a specific issue we encountered that helped us level-up our understanding of the Postgres query planner. We'll detail our investigation by covering:

We conclude with the actions we took to prevent this happening again.

## What we saw

One day we started seeing unusual query performance from several sources: both async workers and API requests seemed to take longer to respond than was normal. Examining the problematic database queries, the common factor was that each query touched our payment transitions table (almost half a billion rows) and associated relations.

Looking further, we saw a concerning number of queries that were very long-running that normally execute in under 50ms. Now suspicious that we'd hit a poor query plan, it was time to find an exemplar query and dig in:

```
SELECT *
FROM payment_transitions
JOIN payments
ON payments.id = payment_transitions.payment_id
WHERE payment_transitions.payout_id = 'PO00123456789Z'
ORDER BY payment_transitions.id ASC
LIMIT 1;
```

### How many payments, how many transitions?

Debugging a query plan almost always follows the same pattern: take time to understand the query, identify why the plan you received is bad, then hypothesise an ideal plan that would be fast. That new plan often requires an index that is yet to be created. On the other hand, perhaps a fast plan doesn't exist for this query. Whatever the outcome, it's key to every step that you understand the shape of the data you're querying.

Our query references two tables, `payments`

and `payment_transitions`

. In this
system every payment has states it can transition through^{1}, and each of those
states is represented as a row in the `payment_transitions`

table. We'll be
filtering on a foreign key of `payment_transitions`

called `payout_id`

which
marks that transition as having been included in a payout.

Approximately 20% of our payment transitions will be marked with a `payout_id`

,
and there are approximately 20 payment transitions per payout. We can reasonably
expect the number of `payout_id`

values to grow linearly with the size of our
`payment_transitions`

table.

Using approximate figures, if we have 350m payment transitions, we can expect
70m to be marked with a `payout_id`

, and there would be almost 3.5m distinct
`payout_id`

values in the `payment_transitions`

table. Finally, we have an index
on the `payout_id`

column that looks like this:

```
CREATE INDEX index_payment_transitions_on_payout_id
ON payment_transitions
USING btree (payout_id);
```

This should provide enough context for us to properly evaluate each potential plan for this query.

### EXPLAIN the plan

Using the query we'd pinpointed as problematic, we ran an `EXPLAIN`

in a
Postgres prompt to display the selected query plan.

```
EXPLAIN
SELECT *
FROM payment_transitions
JOIN payments
ON payments.id = payment_transitions.payment_id
WHERE payment_transitions.payout_id = 'PO00123456789Z'
ORDER BY payment_transitions.id ASC
LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=1.14..21700.47 rows=1 width=262)
-> Nested Loop (cost=1.14..58045700.29 rows=2675 width=262)
-> Index Scan using payment_transitions_pkey on payment_transitions (cost=0.57..58022604.77 rows=2688 width=262)
Filter: (payout_id = 'PO00123456789Z'::text)
-> Index Scan using payments_pkey on payments (cost=0.57..8.58 rows=1 width=14)
Index Cond: ((id)::text = (payment_transitions.payment_id)::text)
(6 rows)
```

This query includes a join operation between our `payments`

and
`payment_transitions`

table. Formally, a relational join is an operation on two
sets - R and S - which will produce a result consisting of all combinations of
tuples (tuple means a single row result) in R and S that are equal under a
particular matching condition.

When joining two tables, Postgres employs one of three
strategies:
merge, hash or nested loop. The strategy Postgres has chosen for our query is
nested loop, the most naive of join strategies. Nested loop joins will iterate
over every tuple in `payment_transitions`

and for each tuple scan the `payments`

table for tuples that match the join condition, which in this case is
`payments.id = payment_transitions.payment_id`

. Our result will be all the
tuples that satisfied our condition.

Looking at our plan, we're using the `payment_transitions_pkey`

to select each
transition tuple and for each transition that has a matching `payout_id`

, we'll
use an index lookup into the `payments`

table to perform the join. The advantage
of this query plan is that the first matching row we find using the
`payment_transitions_pkey`

index is guaranteed to match our query ordering^{2}
constraint (`ORDER BY payment_transitions.id`

), so we can halt execution at this
point as we only require a single tuple (`LIMIT 1`

).

Sadly, this query plan is not going to work well for us. Recalling the
underlying distribution of this data, for every payout we expect there to be
approximately 20 matching transitions. If we assume that these matches are
evenly distributed^{3} throughout the `payment_transitions_pkey`

index (a fair
assumption for query planning purposes) then we expect to scan 1/20th of the
table before we find our first match.

Our real world example happens to be queries for recently created `payout_id`

values, and given the `payment_transitions.id`

is a monotonically increasing
sequence, we can expect our matching transitions to be right at the end of our
scan.

This is an example of how reasonable assumptions in theory can lead to pathological data access patterns in practice.

At 350m rows, this amounts to 17.5m rows we need to scan, or about 20GB^{4} of
data. This plan will never match the performance we expect from this query, so
something has gone deeply wrong.

### What did Postgres expect

The calculations we just performed are very similar to how Postgres evaluates
query plans. In fact, the statistics we've been quoting are tracked and updated
regularly by Postgres through the auto-analyze process, and are
known as the statistic values `n_distinct`

and `null_frac`

.

Having a measure for each column of the number of distinct values (`n_distinct`

)
and fraction of rows for which the column is null (`null_frac`

) enables Postgres
to compute the expected number of rows returned for a given query as
approximately `row_count * (1 - null_frac) / n_distinct`

^{5}.

Looking at the explained output of our plan:

```
-> Index Scan using payment_transitions_pkey on payment_transitions (cost=0.57..58022604.77 rows=2688 width=262)
Filter: (payout_id = 'PO00123456789Z'::text)
```

We see that Postgres expected that 2688 payment transitions would match our
filter condition on `payout_id`

. Assuming this is a typical payout (it doesn't
appear in Postgres' most common values^{5}) then we've way over-estimated the
number of transitions attached to the average payout, which should be about 20.
When we look at our statistics for this column, we start to see some concerning
numbers:

```
postgres=# select attname, n_distinct, null_frac from pg_stats where tablename='payment_transitions' and attname='payout_id';
attname | n_distinct | null_frac
-----------+------------+-----------
payout_id | 25650 | 0.81
```

Plug these values into our formula from above and they imply that the number of
`payment_transitions`

we'll find that match our `payout_id`

is 350m * (1 - 0.81)
/ 25650 = 2592. We know from our manual calculations that the average payout is
associated with 20 `payment_transitions`

, so Postgres' estimate of the number of
distinct payouts is incorrect by two orders of magnitude (25k vs 3.5m). Such an
error will prevent Postgres from making sane decisions when comparing plans.

### Our ideal plan

Our ideal plan would be to fetch all matching payment transitions for our
payout, then (knowing this will be a small number) perform an in-memory sort on
the results, returning the transition with minimum ID. The initial fetching of
matching transitions would be fast due to our
`index_payment_transitions_on_payout_id`

index.

The plan (with correct row estimations) would look something like this:

```
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (rows=1)
-> Sort (rows=20) Sort Key: payment_transitions.id
-> Nested Loop (rows=20)
-> Index Scan using index_payment_transitions_on_payout_id on payment_transitions (rows=20)
Index Cond: (payout_id = 'PO00123456789Z'::text)
-> Index Scan using payments_pkey on payments (rows=1)
Index Cond: ((id)::text = (payment_transitions.payment_id)::text)
```

Materializing all matching transitions for most payouts will be quick and the subsequent sort cheap, as on average there will be so few of them. This is what we want the query plan to produce but our statistics meant Postgres vastly overestimated the cost of our sort, opting for the much more expensive primary key scan.

## What went wrong?

Postgres' planner has made a choice to use a query plan that could potentially
require far more data that the alternative, given it believes the chance of an
early exit - finding a `payment_transition`

with a matching `payout_id`

- will
be high. We can see in the planner code exactly why this has happened and how
the decision was made:

```
/* src/backend/optimizer/util/pathnode.c:3414
* 'count_est' is the estimated value of the LIMIT expression
*/
LimitPath *
create_limit_path(
PlannerInfo *root, RelOptInfo *rel, Path *subpath,
Node *limitOffset, Node *limitCount,
int64 offset_est, int64 count_est)
{
...
if (count_est != 0)
{
if (subpath->rows > 0)
// subpath->rows is an estimated row count of our
// source query, while count_est will be the LIMIT
// value when known. This means we're discounting our
// cost by the fraction of the source query we expect
// to use.
pathnode->path.total_cost = pathnode->path.startup_cost +
(subpath->total_cost - subpath->startup_cost)
* count_est / subpath->rows;
...
}
}
```

This code is taken from `src/backend/optimizer/util/pathnode.c`

which contains utilities to prune and optimise possible query plans (paths).
This extract is where we calculate the cost of a plan that has a limit, as this
will mean some optimisations might be possible.

The most impactful optimisation for limit queries is to exit early. If your query has a very small limit, then Postgres should stop execution as soon as the limit is satisfied rather than continue scanning all your data. When a limit is combined with an ordering, exiting early becomes possible for candidate plans provided they find rows an the order that matches the sort clause.

Recall that our query is asking for a single (`LIMIT 1`

) `payment_transitions`

row sorted by `id`

. If Postgres finds our `payment_transitions`

by searching the
`payment_transitions_pkey`

index then we'll find our results in order of their
`id`

, meaning the first tuple we return that matches our query's `WHERE`

clause
will satisfy our query and we can exit early.

The possibility of early exit means we can discount the total cost of any
already-sorted plans by `count_est / subpath->rows`

, as - assuming our matching
tuples are found evenly distributed across our search space - this is the
fraction of the plan we'll need to execute before we've produced all tuples we
require.

In our case, our `count_est`

is 1, and our `subpath->rows`

is high (2592) due to
our underestimation of `n_distinct`

for `payout_id`

. Small numerator and large
denominator means the discount is huge and is why the nested join through the
`payment_transitions_pkey`

index was chosen as the best plan.

## How we fixed it

As soon as we realised our statistics were causing such a poor query plan we re-ran an analyze to cause Postgres to resample. We had to do this a few times before our plans got better, which hints at the more concerning root cause of this problem.

When Postgres runs an analyze, it takes a sample of the table to use for generating statistics. There are several subtleties around how Postgres samples the table that can impact the accuracy of the tables statistics, and at present it's not possible to solve all of them.

### Sample size

The Postgres GUC (Grand Unified Configuration) variable
`default_statistics_target`

defines the default sample size Postgres uses for
computing statistics, as well as setting the number of most common values to
track for each column. The default value is 100, which means "take samples of
100 * 300 (magic number) pages when running an analyze", then sample randomly
from amongst the rows included in these pages.

But how large a sample is large enough? The n-distinct estimator used by
Postgres is from IBM Research Report RJ 10025 (Haas and Stokes^{6}), where the
authors discuss the bias and error that is expected from the estimator given
various sample sizes and underlying data characteristics.

In their analysis of the estimator, they note that it has been proven by Bunge
and Fitzpatrick (Estimating the Number of Species: A Review^{7}) that unbiased
estimators do not exist when the sample size is smaller than the count of the
most frequently occurring value in the population. The bias in these estimators
is significant (anywhere up-to 80%) and small sample sizes will cause the bias
to increase.

The estimator bias is always negative, meaning we estimate fewer distinct values
than are actually present - this could explain the underestimation leading to
our query plan malfunction. There can be anywhere up to 100k
`payment_transitions`

with the same `payout_id`

value, so at minimum we should
sample as many transition rows to guarantee we'll find more than one distinct
`payout_id`

value. As ~80% of `payout_id`

s are NULL, we require 100k / (1 - 0.8)
= 500k rows, or 1666 as a statistics target (recalling that Postgres multiplies
your statistic target by 300 to find the desired number of samples).

We bumped the statistics target for this column like so:

```
ALTER TABLE payment_transitions
ALTER COLUMN payout_id SET STATISTICS 1666;
```

And repeatedly ran analyzes, checking the `n_distinct`

value at each run. While
the values were slightly better than what we'd seen before we continued to see
large variation and massive underestimation. It wasn't until we bumped our
target to 5000 that the value became stable.

### Not all samples are created equal

We expected that a moderate bump (say to 500) of our sample size would produce markedly better results than the default 100, but this turned out not to be the case. Careful thought about how Postgres generates our random sample lead to the conclusion that we were unduly biasing our estimator by taking a fair, random sample from a statistically biased selection of pages.

Postgres generates its samples in a two stage process^{8}: if we want to collect
a sample of 100k rows, we'll first gather 100k pages and then collect our sample
from those pages. It is not the case that every table tuple has the same
probability of appearing in our sample, as we're confined to the pages we
selected in our first pass. Ideally this shouldn't be a problem, assuming column
values are distributed independently amongst pages, but in our case (and we
suspect many others) this is not true.

Our system creates all `payment_transitions`

for the same `payout_id`

in one
sweep. The `payment_transitions`

table is mostly append-only, so Postgres is
prone to place all those new transitions physically adjacent to one another,
sharing the same pages. If we take our random sample from a restricted set of
pages we've vastly increased the probability of sampling a value multiple times
in comparison to selecting from the entire table.

We can confirm this bias by using Postgres table samples to compute our statistics with a system strategy (approximates our analyze process) vs statistically fair sampling with bernoulli. A statistics target of 500 means we'd sample 150k (300 * 500) rows, which is 0.043% of our table. Using this sample size and comparing the two sampling methods we can see a stark difference:

```
postgres=# select count(*) as sampled,
count(distinct(payout_id)) as unique
from payment_transitions tablesample system(0.043);
sampled | unique
---------+--------
153667 | 11029
postgres=# select count(*) as sampled,
count(distinct(payout_id)) as unique
from payment_transitions tablesample bernoulli(0.043);
sampled | unique
---------+--------
153667 | 25351
```

### Sample everything?

Everything we've explained so far might have you asking why you would ever not use the maximum sample size. If our sample is the whole table then we can't have statistical bias, right?

Higher statistic targets will increase the amount of time required to perform an
analyze on a table. The Postgres auto-vacuum daemon is constantly triggering
`ANALYZE`

's in response to database activity, and the `ACCESS SHARE`

locks they
take on their table can block incoming `ALTER TABLE`

commands. This can be a
problem if you frequently want to change table structure, or don't take care to
timeout the lock to avoid blocking other incoming queries^{9}.

One non-obvious consequence is how this affects upgrading your database.
Postgres major upgrades will reset all statistic values. If your database
depends on good statistics to perform (hint: it probably does!) then a full
`ANALYZE`

will be required before you can accept production traffic. The longer
your `ANALYZE`

takes, the longer you'll be down.

For this and other reasons you may want to hold off from dialing the sample size
to max. With the exception of the n-distinct estimator we've covered in this
post, our experience has shown Postgres to make very good decisions with normal
sample sizes. More varied data shapes than our `payment_transitions`

table may
be better suited to restructuring^{10} than hacking the statistics engine.

## Conclusions

Postgres has a fantastic query planner that can help scale a database far beyond the size of the average organisation. It also provides great tools and documentation that can help you deal with most performance issues that arise with growth, and the flexibility via configuration values to handle most use cases.

That said, the heuristics powering the query planner can cause Postgres to make decisions that flip performance on its head. In this production issue we saw a normally fast query degrade in a spectacular fashion, and it was only after peeling back a few layers that we began to understand why it happened.

We've covered some ways you can configure Postgres to adapt beyond the size of the average database, and explained some trade-offs involved. We choose to increase our statistics target to provide better information to the query planner, but we're also pushing for more careful thought around data access patterns to reduce the complexity of each query, making it less likely for the planner to make a mistake.

Hopefully this is a useful case-study for those who want to learn more about the query planner. If you have feedback or questions about this article we'd love to hear from you @GoCardlessEng.

- We use Statesman to model state machines in Postgres, where each transition is a new row.↩
- The
`payment_transitions_pkey`

index contains references to`payment_transitions`

tuples in order of`payment_transitions.id`

. This is why the first result from scanning our transitions using this index is guaranteed to have the minimum`id`

value.↩ - In our case, the assumption that
`payout_id`

values are evenly distributed with respect to the`payment_transitions.id`

is going to be terrible for us.↩ - Scanning such a large amount of data is not only going to make the current query slow but will have a large performance impact on the rest of your database. Scans like these are likely to read old data that is not currently in our page cache, causing eviction of pages that are needed for other on-going queries. It's worth bearing this in mind when your database has strict performance requirements and depends on hot data being cached to meet them.↩
- In practice, Postgres also adjusts for the known most common values and their histogram bounds, allowing the computation to take into account statistical outliers. In this example, we can safely ignore these histogram bounds because the most common values cover only a small percentage of the table.↩
- http://almaden.ibm.com/cs/people/peterh/jasa3rj.pdf↩
- https://www.jstor.org/stable/2290733↩
- See src/backend/commands/analyze.c for an explanation of Postgres' two-phase sample strategy. The comment also mentions some flaws of this approach.↩
- See our post around lock timeouts to avoid causing disruption when taking locks in Postgres↩
- Table partitioning- either declarative or by inheritance- can help manage large, heterogenous datasets. When all the rows look similar there will be fewer candidate plans and less chance of a disasterous mis-step.↩