Summary
Goal in this post is to layout a methodology that can be implemented to scan the explain plans and identify opportunities to optimize the sql execution with additional indexes.
The missing index opportunities outlined here are
Indexes that would avoid full table scan with a predicate filter that filters out most of the returned rows
Joins that are missing index on the join field
Index range scans that could be made more efficient by addition of a field to existing index
Also for future consideration, but not yet outline here
Recommending covering index where the addition of fields to an index avoids table access all together
Indexes that support the order by clause
Basic procedure is to take the output from “explain” command. In the explain output only look at nodes that are at least say 10-30% or more of total cost of the whole query. Look for full table scans , these are nodes with the following entry : “Node Type”: “Seq Scan” If it’s in the second node of a nested loop, suggest an index on the join field If it’s is the most expensive node of a Hash join suggest an index on the join field If it’s the most expense node of Merge Join suggest an index on the join field. If is followed by a filter, suggest an index on the filter and join column.
If if the filter is not the most expensive node in a HJ or it’s the first node in a NL, and that node is expensive and the filter is expected to return X% ( for example 10% ) or less of the table after applying the filter, then suggest an index on that filter column. Additionally Index range scans that could be improved by adding fields to the index will require using “explain analyze”. With explain analyze we can see how many rows were filtered out after the index access.
Get plan with “Explain” and text from SQL statement
EXPLAIN (COSTS true, FORMAT json, verbose true)
select id from hypg1 where id = 2006;
QUERY PLAN
-------------------------------------------
[ +
{ +
"Plan": { +
"Node Type": "Gather", +
"Parallel Aware": false, +
"Startup Cost": 1000.00, +
"Total Cost": 10633.43, +
"Plan Rows": 1, +
"Plan Width": 4, +
"Output": ["id"], +
"Workers Planned": 2, +
"Single Copy": false, +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer",+
"Parallel Aware": true, +
"Relation Name": "hypg1", +
"Schema": "public", +
"Alias": "hypg1", +
"Startup Cost": 0.00, +
"Total Cost": 9633.33, +
"Plan Rows": 1, +
"Plan Width": 4, +
"Output": ["id"], +
"Filter": "(hypg1.id = 2006)" +
} +
] +
} +
} +
]
Cost in Explain Plan
In all cases we should sort nodes by cost and only look at nodes that take up a certain percentage or more of the cost. Each node has a “cost”. We need to extract the “total cost” at each node in the plan and compare that to the “total cost” of the whole query (top node) and calculate % of the “total cost” at a node compared whole query and only look at nodes that represent some value say 10% or more of total cost. Costs include the sum of all indented nodes, so to get the per node time or cost we have subtract out the child nodes.
If we can run “explain analyze” we can get the actual time per node. The command “explain analyze” actually runs the query . I would say we should plan on “explain analyze” with certain constraints. If the query is already being run multiple times a minute, then running it once more for tuning information is probably worth it. Time is just like cost where time includes times of child nodes, so we have subtract out the child nodes to see the time elapsed at every node.
Here is an example of the costs at nodes and showing the child nodes are subcomponents of the parent nodes cost
[ +
{ +
"Plan": { +
"Total Cost": 22308.54, +
"Plans": [ +
{ +
"Total Cost": 11675.10, +
{ +
"Total Cost": 10675.00, +
} +
}, +
{ +
"Total Cost": 10633.43, +
{ +
"Total Cost": 9633.33, +
} +
] +
} +
] +
} +
} +
]
for example, subtracting out the child node costs to get the per node costs
[ +
{ +
"Plan": { +
"Total Cost": 22308.54, -- node cost 22308.54 - (11675.10+10633.43) = 0.01
"Plans": [ +
{ +
"Total Cost": 11675.10, -- node cost 11675.10 - 10675.00 = 1000.1
{ +
"Total Cost": 10675.00, +
} +
}, +
{ +
"Total Cost": 10633.43, -- node cost 10633.43 - 9633.33 = 1000.1
{ +
"Total Cost": 9633.33, +
} +
] +
} +
] +
} +
} +
]
Predicate filters missing indexes
For predicate filters that can use indexes scan the explain plan for the following string
"Node Type": "Seq Scan", +
and see if they are followed by a “Filter”.
"Total Cost": 9633.33, +
"Plan Rows": 1, +
"Filter": "(hypg1.id = 2006)" +
Verify that the node is a significant cost of overall cost of query. This node takes up most to the cost of the plan 9633.33 out of 10633.43 so it is the most important node to optimize. We should only look at nodes that take up a certain percentage of total cost, say 10% or more. ( value should be configurable in our code, as it is debatable where the cut off point is)
We can compare rows returned (“Plan Rows”) to total rows in the table statistics If the filter eliminates a large amount of rows, then it is a candidate. The following query had to be run in the database that the table is in, thus have to collect database as part of the sampling of the load data.
SELECT reltuples FROM pg_class WHERE relname = 'hypg1';
reltuples
-----------
1e+06
% filtered = 100 * ( 1e+06 – 1 ) / 1e+06 = 99.99% rows are filtered out, so it’s the perfect candidate.
We should only recommend index the filter returns 10% or less of the table (i.e. 90% of rows returned from full table scanned are filtered out) and in the advisor recommendation for the index we should indicate much of the table is returned via the filter. We can give some recommendations based on the % for example low recommendation 5%-10% of table returned after apply filter medium recommendation 1%-5% of table returned after apply filter high recommendation < 1% of table returned after apply filter
The filter column is hypg1.id thus we suggest an index on hypg1.id
Joins missing indexes
Using the following query in the examples below
explain (FORMAT json)
select max (t1.data)
from t1, t2
where t1.id = t2.id
and t1.clus = 10
and t2.val > 1001
;
HASH JOIN
Look for nodes of type “Hash Cond”: ” that are followed by a “Seq Scan” If we find a “Seq Scan” is preceded by a Hash JOIN and cost is a significant amount of total query, then suggest an index on the join column If both nodes of the hash join have Seq Scans , suggest the index join column from the more expensive node. If that node also contains a filter, suggest including the filter in the index.
There is “Node Type”: “Seq Scan” table “Relation Name”: “t2″, cost “Total Cost”: 13512.67″ looking the hash join above the Seq Scan node, the join condition is
"Hash Cond": "(t2.id = t1.id)",
suggest index on t2.id
"Plans": [ +
{ +
"Node Type": "Hash Join", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Join Type": "Inner", +
"Startup Cost": 11.40, +
"Total Cost": 15086.97, +
"Plan Rows": 41, +
"Plan Width": 33, +
"Output": ["t1.data"], +
"Inner Unique": false, +
"Hash Cond": "(t2.id = t1.id)", +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer", +
"Parallel Aware": true, +
"Relation Name": "t2", +
"Schema": "public", +
"Alias": "t2", +
"Startup Cost": 0.00, +
"Total Cost": 13512.67, +
"Plan Rows": 416667, +
"Plan Width": 4, +
"Output": ["t2.id", "t2.clus", "t2.val", "t2.data"]+
},
Full explain
total cost of query is 41534.23 HJ node cost is 40532.27 on condition “Hash Cond”: “(t2.id = t1.id)”, Most expensive node is second node of hash join, “Total Cost”: 24346.00 Second node , the expensive node, is doing full scan on “Relation Name”: “t1″ Thus suggest an index on t1.id, the join condition of the hash join. The T1 node also contains a filter that returns “Plan Rows”: 5000,” i.e. 5000 rows of 1M in the table ( we look up total rows in pg_class) . The filter is on “Filter”: “((clus) = 10)” so we suggest an index on the combine id & clus
recommend index on t1 ( id, clus)
+
[ +
{ +
"Plan": { +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Finalize", +
"Parallel Aware": false, +
"Startup Cost": 41534.23, +
"Total Cost": 41534.24, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Gather", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 41534.01, +
"Total Cost": 41534.22, +
"Plan Rows": 2, +
"Plan Width": 32, +
"Workers Planned": 2, +
"Single Copy": false, +
"Plans": [ +
{ +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Partial", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 40534.01, +
"Total Cost": 40534.02, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Hash Join", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Join Type": "Inner", +
"Startup Cost": 24408.50, +
"Total Cost": 40532.27, +
"Plan Rows": 695, +
"Plan Width": 33, +
"Inner Unique": false, +
"Hash Cond": "(t2.id = t1.id)", +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer", +
"Parallel Aware": true, +
"Relation Name": "t2", +
"Alias": "t2", +
"Startup Cost": 0.00, +
"Total Cost": 15596.00, +
"Plan Rows": 138889, +
"Plan Width": 4, +
"Filter": "((val + 0) > 1001)" +
}, +
{ +
"Node Type": "Hash", +
"Parent Relationship": "Inner", +
"Parallel Aware": false, +
"Startup Cost": 24346.00, +
"Total Cost": 24346.00, +
"Plan Rows": 5000, +
"Plan Width": 37, +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer",+
"Parallel Aware": false, +
"Relation Name": "t1", +
"Alias": "t1", +
"Startup Cost": 0.00, +
"Total Cost": 24346.00, +
"Plan Rows": 5000, +
"Plan Width": 37, +
"Filter": "((clus) = 10)" +
} +
] +
} +
] +
} +
] +
} +
] +
} +
] +
} +
} + +
Nested Loops
Similar we find a Nest Loops join where the second node has a Seq Scan, look to see if an index can be added to support the join. Only look at Nest Loop nodes that are a significant % of total query cost.
Most of the cost of the query is the next loops:
"Node Type": "Nested Loop", +
...
"Total Cost": 1320240.33, +
The first node in the NL select rows that are looked up in the second node. The second node is doing a full table scan:
"Node Type": "Seq Scan", +
"Relation Name": "t2", +
"Plan Rows": 1000000, +
We want to avoid Full table scans on the second node of NL. Every row in the first node will do a full table scan on the second node. The join is on
"Join Filter": "(t1.id = t2.id)", +
So we suggest an index on t2.id
[ +
{ +
"Plan": { +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Finalize", +
"Parallel Aware": false, +
"Startup Cost": 1321240.65, +
"Total Cost": 1321240.66, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Gather", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 1321240.44, +
"Total Cost": 1321240.65, +
"Plan Rows": 2, +
"Plan Width": 32, +
"Workers Planned": 2, +
"Single Copy": false, +
"Plans": [ +
{ +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Partial", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 1320240.44, +
"Total Cost": 1320240.45, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Nested Loop", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Join Type": "Inner", +
"Startup Cost": 0.00, +
"Total Cost": 1320240.33, +
"Plan Rows": 41, +
"Plan Width": 33, +
"Inner Unique": false, +
"Join Filter": "(t1.id = t2.id)", +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer",+
"Parallel Aware": true, +
"Relation Name": "t1", +
"Alias": "t1", +
"Startup Cost": 0.00, +
"Total Cost": 14554.33, +
"Plan Rows": 41, +
"Plan Width": 37, +
"Filter": "(clus = 10)" +
}, +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Inner",+
"Parallel Aware": false, +
"Relation Name": "t2", +
"Alias": "t2", +
"Startup Cost": 0.00, +
"Total Cost": 19346.00, +
"Plan Rows": 1000000, +
"Plan Width": 4 +
} +
] +
} +
] +
} +
] +
} +
] +
} +
} +
Merge Join
similar for a merge join
-- to create a merge join example
set enable_nestloop to 'off';
SET enable_hashjoin TO off;
explain (FORMAT json)
select max (t1.data)
from t1, t2
where t1.id = t2.id
and t1.clus = 10
and t2.val > 1001
;
We want to turn the merge join into a NL or HJ. In order to do so efficiently we need an index on the join on one table. We want to have the index on the table with the more rows to analyze.
t1 returns 5000 rows
"Relation Name": "t1", +
"Plan Rows": 5000, +
t2 returns 138889
"Relation Name": "t2", + "Plan Rows": 138889, +
and join on T2 is on
"Merge Cond": "(t2.id = t1.id)", +
thus suggest index on t2.id
[ +
{ +
"Plan": { +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Finalize", +
"Parallel Aware": false, +
"Startup Cost": 55741.63, +
"Total Cost": 55741.64, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Gather", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 55741.42, +
"Total Cost": 55741.63, +
"Plan Rows": 2, +
"Plan Width": 32, +
"Workers Planned": 2, +
"Single Copy": false, +
"Plans": [ +
{ +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Partial", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 54741.42, +
"Total Cost": 54741.43, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Merge Join", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Join Type": "Inner", +
"Startup Cost": 54013.29, +
"Total Cost": 54739.68, +
"Plan Rows": 695, +
"Plan Width": 33, +
"Inner Unique": false, +
"Merge Cond": "(t2.id = t1.id)", +
"Plans": [ +
{ +
"Node Type": "Sort", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 29360.10, +
"Total Cost": 29707.32, +
"Plan Rows": 138889, +
"Plan Width": 4, +
"Sort Key": ["t2.id"], +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer",+
"Parallel Aware": true, +
"Relation Name": "t2", +
"Alias": "t2", +
"Startup Cost": 0.00, +
"Total Cost": 15596.00, + "Plan Rows": 138889, +
"Plan Width": 4, +
"Filter": "((val ) > 1001)" +
} +
] +
}, +
{ +
"Node Type": "Sort", +
"Parent Relationship": "Inner", +
"Parallel Aware": false, +
"Startup Cost": 24653.19, +
"Total Cost": 24665.69, +
"Plan Rows": 5000, +
"Plan Width": 37, +
"Sort Key": ["t1.id"], +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer",+
"Parallel Aware": false, +
"Relation Name": "t1", +
"Alias": "t1", +
"Startup Cost": 0.00, +
"Total Cost": 24346.00, +
"Plan Rows": 5000, +
"Plan Width": 37, +
"Filter": "((clus + 0) = 10)" +
} +
] +
} +
] +
} +
] +
} +
] +
} +
] +
} +
} +
3. Mixed cases – filter and join indexing opportunities
If a filter is missing an index causing a full scan (i.e. “Seq Scan”) and the table being scanned is also used in a join (hash join, merge join, nested loops join), suggest an index on both the filter column and In the following there is a Seq Scan on T2 followed by a filter. The filter returns an estimated 375K rows or about 1/3 of the table , i.e. not a good candidate for an index The Seq Scan is part of a HASH JOIN that represents most of the cost of the query. The hash join is on t2.id.
cost total query 16973.03 t2 access for the hash join cost: 14554.33 t2 has 1M rows ( have to collect this from table stats) access is on t2.id , “Hash Cond”: “(t2.id = t1.id)”,
suggest index on t2 id and val to avoid the full table scan.
[ +
{ +
"Plan": { +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Finalize", +
"Parallel Aware": false, +
"Startup Cost": 16973.02, +
"Total Cost": 16973.03, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Gather", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 16972.81, +
"Total Cost": 16973.02, +
"Plan Rows": 2, +
"Plan Width": 32, +
"Workers Planned": 2, +
"Single Copy": false, +
"Plans": [ +
{ +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Partial", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Startup Cost": 15972.81, +
"Total Cost": 15972.82, +
"Plan Rows": 1, +
"Plan Width": 32, +
"Plans": [ +
{ +
"Node Type": "Hash Join", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Join Type": "Inner", +
"Startup Cost": 11.40, +
"Total Cost": 15972.72, +
"Plan Rows": 37, +
"Plan Width": 33, +
"Inner Unique": false, +
"Hash Cond": "(t2.id = t1.id)", +
"Plans": [ +
{ +
"Node Type": "Seq Scan", +
"Parent Relationship": "Outer", +
"Parallel Aware": true, +
"Relation Name": "t2", +
"Alias": "t2", +
"Startup Cost": 0.00, +
"Total Cost": 14554.33, + "Plan Rows": 375098, +
"Plan Width": 4, +
"Filter": "(val > 1001)" +
}, +
{ +
"Node Type": "Hash", +
"Parent Relationship": "Inner", +
"Parallel Aware": false, +
"Startup Cost": 10.16, +
"Total Cost": 10.16, +
"Plan Rows": 99, +
"Plan Width": 37, +
"Plans": [ +
{ +
"Node Type": "Index Scan", +
"Parent Relationship": "Outer",+
"Parallel Aware": false, +
"Scan Direction": "Forward", +
"Index Name": "t1_clus", +
"Relation Name": "t1", +
"Alias": "t1", +
"Startup Cost": 0.42, +
"Total Cost": 10.16, +
"Plan Rows": 99, +
"Plan Width": 37, +
"Index Cond": "(clus = 10)" +
} +
] +
} +
] +
} +
] +
} +
] +
} +
] +
} +
} +
4. Index Range scans
To see if an index needs more columns to be efficient looks like ti will require “explain analyze” and not just “explain”
select count(*)
from t1
where t1.clus >= 10 and t1.clus < 15
and t1.val = 1001
;
There is an index on t1.clus and the explain shows it is used but we don’t know how many rows were retrieved by the index vs filtered out by the Filter
[ +
{ +
"Plan": { +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Simple", +
"Parallel Aware": false, +
"Startup Cost": 27.14, +
"Total Cost": 27.15, +
"Plan Rows": 1, +
"Plan Width": 8, +
"Plans": [ +
{ +
"Node Type": "Index Scan", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Scan Direction": "Forward", +
"Index Name": "t1_clus", +
"Relation Name": "t1", +
"Alias": "t1", +
"Startup Cost": 0.42, +
"Total Cost": 27.13, +
"Plan Rows": 1, +
"Plan Width": 0, +
"Index Cond": "((clus >= 10) AND (clus < 15))",+
"Filter": "(val = 1001)" +
} +
] +
} +
} +
but it doesn’t show how many rows are retrieved by the index and how many are post processed filtered out by the Filter. Using explain analyze does. In the following output from “explain analyze” Only 1 run is returned after index access and filter but filter takes out 499 rows
"Filter": "(val = 1001)", +
"Rows Removed by Filter": 499 +
Thus suggest to add val to the existing index
"Index Name": "t1_clus", +
[ +
{ +
"Plan": { +
"Node Type": "Aggregate", +
"Strategy": "Plain", +
"Partial Mode": "Simple", +
"Parallel Aware": false, +
"Startup Cost": 27.14, +
"Total Cost": 27.15, +
"Plan Rows": 1, +
"Plan Width": 8, +
"Actual Startup Time": 0.279, +
"Actual Total Time": 0.279, +
"Actual Rows": 1, +
"Actual Loops": 1, +
"Plans": [ +
{ +
"Node Type": "Index Scan", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Scan Direction": "Forward", +
"Index Name": "t1_clus", +
"Relation Name": "t1", +
"Alias": "t1", +
"Startup Cost": 0.42, +
"Total Cost": 27.13, +
"Plan Rows": 1, +
"Plan Width": 0, +
"Actual Startup Time": 0.053, +
"Actual Total Time": 0.273, +
"Actual Rows": 1, +
"Actual Loops": 1, +
"Index Cond": "((clus >= 10) AND (clus < 15))",+
"Rows Removed by Index Recheck": 0, +
"Filter": "(val = 1001)", +
"Rows Removed by Filter": 499 +
} +
] +
}, +
"Planning Time": 0.270, +
"Triggers": [ +
], +
"Execution Time": 0.388 +
} +
Check if INDEX we suggest already exists
look up existence of index ( https://stackoverflow.com/questions/2204058/list-columns-with-indexes-in-postgresql)
select
t.relname as table_name,
i.relname as index_name,
array_to_string(array_agg(a.attname), ', ') as column_names
from
pg_class t,
pg_class i,
pg_index ix,
pg_attribute a
where
t.oid = ix.indrelid
and i.oid = ix.indexrelid
and a.attrelid = t.oid
and a.attnum = ANY(ix.indkey)
and t.relkind = 'r'
and t.relname like 'auth%'
group by
t.relname,
i.relname
order by
t.relname,
i.relname;
table_name | index_name | column_names
------------+------------+--------------
authors | authors_id | id
Creating data for test cases above
drop table t1;
create table t1 (
id int,
clus int,
val int,
data VARCHAR(40)
);
insert into t1 (
id, clus , val, data
)
select
i,
trunc(i/100),
mod(i,10000),
left(md5(random()::text), 40)
from generate_series(1, 1000000) s(i);
select count(*) from t1 where clus = 1 ;
100
select count(*) from t1 where val =1 ;
100
create table t2 as select * from t1;
explain (analyze,verbose,buffers)
select max(t1.data) from t1, t2
where t1.id = t2.id
and t1.clus = 1
;
Parameters that affect the explain plan
SET max_parallel_workers_per_gather = 0;
set enable_mergejoin to 'off';
set enable_nestloop to 'off';
SET enable_hashjoin TO off;
Comentarii