Load Sorted Data
Assume we have a large sales fact table:
= CREATE TABLE sale_fact (id serial, username text, sold_at date);
dbCREATE TABLE
Let's say we load data to the table every night:
= INSERT INTO sale_fact (username, sold_at)
db- SELECT
db- md5(random()::text) AS username,
db- '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db- FROM
db- generate_series(1, 100000);
dbINSERT 0 100000
= VACUUM ANALYZE sale_fact;
db VACUUM
We index for fast querying:
= CREATE INDEX sale_fact_sold_at_ix ON sale_fact(sold_at);
dbCREATE INDEX
And then we use the index in a query:
= EXPLAIN (ANALYZE)
db- SELECT *
db- FROM sale_fact
db- WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
db
QUERY PLAN
-----------------------
Bitmap Heap Scan on sale_fact (cost=108.30..1107.69 rows=4293 width=41)
>= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Recheck Cond: ((sold_at Heap Blocks: exact=927
-> Bitmap Index Scan on sale_fact_sold_at_ix (cost=0.00..107.22 rows=4293 width=0)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Time: 0.191 ms
Planning Time: 5.906 ms Execution
This does a Bitmap Heap Scan
, which has two parts:
Bitmap Index Scan
, which goes through the entire
index and maps all the table pages that contain relevant rows:Bitmap Heap Scan
, which reads the pages that contain
relevant rows, and checks to make sure that rows satisfy the
condition.This takes ~6ms to do.
We can make this better by sorting the data after loading it:
= TRUNCATE sale_fact;
dbTRUNCATE TABLE
= INSERT INTO sale_fact (username, sold_at)
db- SELECT
db- md5(random()::text) AS username,
db- '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db- FROM
db- generate_series(1, 100000)
db- ORDER BY sold_at;
dbINSERT 0 100000
= VACUUM ANALYZE sale_fact;
db VACUUM
We include an ORDER BY
on the column we query, so the
table inserts and then sorts.
When querying now we see this:
= EXPLAIN (ANALYZE)
db- SELECT *
db- FROM sale_fact
db- WHERE sold_at BETWEEN '2020-07-01' AND '2020-07-31';
db
QUERY PLAN
---------------------
Index Scan using sale_fact_sold_at_ix on sale_fact (cost=0.29..184.73 rows=4272 width=41)
Index Cond: ((sold_at >= '2020-07-01'::date) AND (sold_at <= '2020-07-31'::date))
Time: 0.145 ms
Planning Time: 2.294 ms Execution
This query now only uses the index, and trusts that the data is in order. Querying data in order is almost always faster (think of a range query over a sorted vector over an unsorted vector),
This works because of Correlation
.
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk.
Correlation of 1:
Correlation of 0:
We can query the correlation of sold_at
by looking at
the pg_stats
for the table:
Before inserting it sorted:
= SELECT tablename, attname, correlation
db- FROM pg_stats
db= WHERE tablename = 'sale_fact';
db
tablename | attname | correlation-----------+----------+--------------
id | 1
sale | -0.005344716
sale | username | -0.011389783 sale | sold_at |
Afterwards:
= SELECT tablename, attname, correlation
db- FROM pg_stats
db= WHERE tablename = 'sale_fact';
db
tablename | attname | correlation-----------+----------+----------------
id | 1
sale_fact | -0.00041992788
sale_fact | username | 1 sale_fact | sold_at |
We can see that when the correlation is close to 1 or -1, that the
table will be more likely to use an Index Scan
instead of a
Bitmap Heap Scan
, and this is likely to be faster.