Covering Indexes in Postgres
Covering indexes are where all the columns needed are in the index and, in theory, only the index needs to be accessed. This seemingly straight forward strategy is much tricker on Postgres.
Let’s take TPC-H query 6:
select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date '1996-01-01' and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date) and l_discount between 0.06 - 0.01 and 0.06 + 0.01 and l_quantity < 24;
we’ve got 3 predicate filter columns
None of them are that selective but all three are pretty good
select count(*) from lineitem where l_shipdate >= date '1996-01-01' and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
select count(*) from lineitem where l_discount between 0.06 - 0.01 and 0.06 + 0.01;
select count(*) from lineitem where l_quantity < 24;
select count(*) from lineitem where l_shipdate >= date '1996-01-01' and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date) and l_discount between 0.06 - 0.01 and 0.06 + 0.01 and l_quantity < 24;
select count(*) from lineitem;
228,160/11,997,996 = 1.9%
With the index added on all 3 fields, the optimizer doesn’t even use it!
I forced the optimizer to use the index (set enable_seqscan=off;) , then yes, the improvement is small, from ~800+ ms to 600+ ms mainly because the index results have to go back to the table.
I made a covering index with all the fields in the query and it still goes back to the table!
Now, if I do a vacuum and sure enough, the covering index works and get a 6x speedup! ( 870ms to 140ms )
What does the vacuum have to do with the covering index?
From Jim Nasby:
The issue with covering indexes is that you still have to visit the heap page, unless that page happens to be marked as being all visible. Pages can be marked as all visible by autovacuum (by default runs once 10% of rows in a table have been updated or deleted) or a manual vacuum. They can only be marked visible if all live rows on the page are older than the oldest running transaction in the system.
In other words, effectiveness of this technique is highly dependent on the workload.
TL;DR postgres keeps undo in the data blocks. Vacuum clears out old unnecessary rows that are “dead”
“Because there is no information for version in an index tuple, it is impossible to judge whether a tuple is readable in the current transaction or not”
Replica + hot_standby_feedback can affect index only scan