Cluster
Cluster allows you to sort a table by a specific index:
Imagine this table:
db= TRUNCATE sale_fact;
TRUNCATE TABLE
-- Insert rows without sorting
db= INSERT INTO sale_fact (username, sold_at)
db- SELECT
db- md5(random()::text) AS username,
db- '2020-01-01'::date + (interval '1 day') * round(random() * 365 * 2) AS sold_at
db- FROM
db- generate_series(1, 100000)
INSERT 0 100000
db= ANALYZE sale_fact;
ANALYZE
db= SELECT tablename, attname, correlation
db- FROM pg_stats
db- WHERE tablename = 'sale_fact';
tablename | attname | correlation
-----------+-----------+----------------
sale_fact | sold_at | -5.9702674e-05
sale_fact | id | 1
sale_fact | username | 0.010033822
We have 0 correlation on sold_at
, so if we query using a WHERE
on
sold_at
, we’ll get a Bitmap Heap Scan
, which will be much slower
than an Index Scan
.
Let’s cluster on our sold_at
index.
db= CLUSTER sale_fact USING sale_fact_sold_at_ix;
CLUSTER
db= ANALYZE sale_fact;
ANALYZE
db= SELECT tablename, attname, correlation
db- FROM pg_stats
db- WHERE tablename = 'sale_fact';
tablename | attname | correlation
-----------+----------+--------------
sale_fact | sold_at | 1
sale_fact | id | -0.002239401
sale_fact | username | 0.013389298
Clustering affects the correlatio of other columns, like id
, which
went from being a correlation of 1
to 0
. Try to avoid this by
inserting data sorted, by inserting the data with an ORDER BY
.
Compare that table above with a correlation table using an ORDER BY
on
an insert:
db= SELECT tablename, attname, correlation
db- FROM pg_stats
db= WHERE tablename = 'sale_fact';
tablename | attname | correlation
-----------+----------+----------------
sale_fact | id | 1
sale_fact | username | -0.00041992788
sale_fact | sold_at | 1
Queries that use the id
index and sold_at
index will be much faster,
whereas with a cluster, only queries on sold_at
will be fast.