Cluster allows you to sort a table by a specific index:
Imagine this table:
= TRUNCATE sale_fact;
dbTRUNCATE TABLE
-- Insert rows without sorting
= INSERT INTO sale_fact (username, sold_at)
db- SELECT
db- md5(random()::text) AS username,
db- '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db- FROM
db- generate_series(1, 100000)
dbINSERT 0 100000
= ANALYZE sale_fact;
dbANALYZE
= SELECT tablename, attname, correlation
db- FROM pg_stats
db- WHERE tablename = 'sale_fact';
db
tablename | attname | correlation-----------+-----------+----------------
-5.9702674e-05
sale_fact | sold_at | id | 1
sale_fact | 0.010033822 sale_fact | username |
We have 0 correlation on sold_at
, so if we query using a
WHERE
on sold_at
, we'll get a
Bitmap Heap Scan
, which will be much slower than an
Index Scan
.
Let's cluster on our sold_at
index.
= CLUSTER sale_fact USING sale_fact_sold_at_ix;
dbCLUSTER
= ANALYZE sale_fact;
dbANALYZE
= SELECT tablename, attname, correlation
db- FROM pg_stats
db- WHERE tablename = 'sale_fact';
db
tablename | attname | correlation-----------+----------+--------------
1
sale_fact | sold_at | id | -0.002239401
sale_fact | 0.013389298 sale_fact | username |
Clustering affects the correlatio of other columns, like
id
, which went from being a correlation of 1
to 0
. Try to avoid this by inserting data sorted, by
inserting the data with an ORDER BY
.
Compare that table above with a correlation table using an
ORDER BY
on an insert:
= SELECT tablename, attname, correlation
db- FROM pg_stats
db= WHERE tablename = 'sale_fact';
db
tablename | attname | correlation-----------+----------+----------------
id | 1
sale_fact | -0.00041992788
sale_fact | username | 1 sale_fact | sold_at |
Queries that use the id
index and sold_at
index will be much faster, whereas with a cluster, only queries on
sold_at
will be fast.