Covering indexes are where all the columns needed are in the index and, in theory, only the index needs to be accessed. This seemingly straight forward strategy is much tricker on Postgres.
Let’s take TPC-H query 6:
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= date '1996-01-01'
and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 24;
we’ve got 3 predicate filter columns
l_shipdate
l_discount
l_quantity
None of them are that selective but all three are pretty good
select count(*) from
lineitem
where
l_shipdate >= date '1996-01-01'
and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
→ 1,823,373
select count(*) from
lineitem
where l_discount between 0.06 - 0.01 and 0.06 + 0.01;
→ 3,272,056
select count(*) from
lineitem
where l_quantity < 24;
→ 5,517,389
select count(*) from
lineitem
where
l_shipdate >= date '1996-01-01'
and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 24;
→ 228,160
select count(*) from lineitem;
→ 11,997,996
228,160/11,997,996 = 1.9%
With the index added on all 3 fields, the optimizer doesn’t even use it!
I forced the optimizer to use the index (set enable_seqscan=off;) , then yes, the improvement is small, from ~800+ ms to 600+ ms mainly because the index results have to go back to the table.
I made a covering index with all the fields in the query and it still goes back to the table!
Now, if I do a vacuum and sure enough, the covering index works and get a 6x speedup! ( 870ms to 140ms )
What does the vacuum have to do with the covering index?
From Jim Nasby:
The issue with covering indexes is that you still have to visit the heap page, unless that page happens to be marked as being all visible. Pages can be marked as all visible by autovacuum (by default runs once 10% of rows in a table have been updated or deleted) or a manual vacuum. They can only be marked visible if all live rows on the page are older than the oldest running transaction in the system.
In other words, effectiveness of this technique is highly dependent on the workload.
Reference
TL;DR postgres keeps undo in the data blocks. Vacuum clears out old unnecessary rows that are “dead”
“Because there is no information for version in an index tuple, it is impossible to judge whether a tuple is readable in the current transaction or not”
Replica + hot_standby_feedback can affect index only scan
Comments