Indexes On Columns With Low Selectivity
Assume you have a users
table:
= CREATE TABLE users (
db- id serial,
db- username text,
db- activated boolean
db-);
dbCREATE TABLE
Seed the table:
= INSERT INTO users (username, activated)
db- SELECT
db- md5(random()::text) AS username,
db- random() < 0.9 AS activated
db- FROM
db- generate_series(1, 1000000);
dbINSERT 0 1000000
= SELECT activated, count(*) FROM users GROUP BY activated;
dbcount
activated | -----------+--------
102567
f | 897433
t |
= VACUUM ANALYZE users;
db VACUUM
And let's index it:
= CREATE INDEX users_activated_ix ON users(activated);
dbCREATE INDEX
When querying unactivated users the database uses the index:
= EXPLAIN SELECT * FROM users WHERE NOT activated;
dbQUERY PLAN
--------------
Bitmap Heap Scan on users (cost=1923.32..11282.99 rows=102567 width=38)
Filter: (NOT activated)
-> Bitmap Index Scan on users_activated_ix (cost=0.00..1897.68 rows=102567 width=0)
Index Cond: (activated = false)
When you query for activated users, it doesn't use the index:
= EXPLAIN SELECT * FROM users WHERE activated;
dbQUERY PLAN
---------------------------------------------------------------
Scan on users (cost=0.00..18334.00 rows=897433 width=38)
Seq Filter: activated
This is because if Postgres deems that reading the index + the table is worse than reading the whole table, it won't use the index.
If you have a table that is 10MB and an index that is 1MB, and 90% of users are signed up, (10MB * .9) = 9MB + 1MB index vs. 10MB table read is about the same. So Postgres may choose to read the whole table instead.