Indexes on Joins for MySQL

kyle Hailey
Mar 24, 2020
5 min read

MySQL is simpler than the Oracle example because MySQL only has Nested Loops and doesn’t have Hash Join.

Here is the query, like on the Oracle example from the last blog post

select max(t1.data) 
from  t1, t2  
where t1.id = t2.id 
and t1.clus  = 1
;

So there are only 4 examples below which all do NL joins

full table scan on T1 and T2
7.83 secs
index on T1 predicate filter column
7.39 secs
index on T2 join column
0.49 secs
index on both T2 join column and T1 predicate filter column
0.06 secs

There isn’t an idea of “explain analyze” on MySQL until MySQL 8.0.18 and I did my testing on 8.0.17, so the explain plans costs are estimates not actual values, but the elasped time of the query is an actual value.

drop table seed;
CREATE TABLE seed ( id INT AUTO_INCREMENT PRIMARY KEY, val INT);
 insert into seed(val) values (1);
insert into seed(val) select val from seed;  /*  2 */
insert into seed(val) select val from seed;  /*  4 */
insert into seed(val) select val from seed;  /*  8 */
insert into seed(val) select val from seed;  /*  16 */
insert into seed(val) select val from seed;  /*  32 */
insert into seed(val) select val from seed;  /*  64 */
insert into seed(val) select val from seed;  /*  128 */
insert into seed(val) select val from seed;  /*  256 */
insert into seed(val) select val from seed;  /*  512 */
insert into seed(val) select val from seed;  /*  1024 */
insert into seed(val) select val from seed;  /*  2048 */
insert into seed(val) select val from seed;  /*  4096 */
insert into seed(val) select val from seed;  /*  8192 */
insert into seed(val) select val from seed;  /*  16384 */
insert into seed(val) select val from seed;  /*  32768 */
insert into seed(val) select val from seed;  /*  131072 */
insert into seed(val) select val from seed;  /*  262144 */
insert into seed(val) select val from seed;  /*  524288 */
insert into seed(val) select val from seed;  /*  1048576 */
 select count(*) from seed;

 drop table dummy;
 
  create table dummy ( 
    id  int,
    clus int,
    val int,
    data VARCHAR(40)
 );
 
 insert into dummy (
    id, clus , val, data
)
 select 
 id,
 truncate(id/169,0),
 mod(id,10000), 
 CONV(FLOOR(RAND() * 99999999999999), 10, 36) 
from seed
;

select count(*) from dummy where clus = 1;
+----------+
| count(*) |
+----------+
|      106 |
+----------+
select count(*) from dummy where val =1;
+----------+
| count(*) |
+----------+
|      106 |
+----------+

drop table t1;
drop table t2;

create table t1 as select * from dummy;
create table t2 as select * from dummy;

Strange thing is that there are gaps in the sequence # in the seed table, thus the strange value in the truncate to get 106 values, i.e. truncate(id/169,0).

First we run the query with no indexes ( I already created the indexes so force them off in the example with “USE INDEX()” )

both nodes have access type

"access_type": "ALL"

so they are doing full table scans.

explain format=json select max(t1.data) 
from 
t1 USE INDEX ()
, t2 USE INDEX ()
where t1.id = t2.id 
and t1.clus = 1
;

1 row in set (7.83 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "18102538.86"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.02",
          "cost_info": {
            "read_cost": "105376.24",
            "eval_cost": "17.21",
            "prefix_cost": "105393.45",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "(`kyle`.`t1`.`clus` = 1)"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "917.35",
            "eval_cost": "17.21",
            "prefix_cost": "18102538.86",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id"
          ],
          "attached_condition": "(`kyle`.`t2`.`id` = `kyle`.`t1`.`id`)"
        }
      }
    ]
  }
}

We can see that the filter on T1 should filter most of the table (172 rows out of 1045682) so now add index on t1.clus the predicate filter ( I remove the “USE INDEX () ” on t1 so the index on T1 is now used in the explain ) . The execution time is a little faster but not that much 7.83 vs 7.39. We see that T1 is now accessed by index

explain format=json select max(t1.data) 
from  t1, t2  
USE INDEX ()
where t1.id = t2.id 
and t1.clus    = 1
;
 row in set (7.39 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "11085148.28"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ref",
          "possible_keys": [
            "t1_clus"
          ],
          "key": "t1_clus",
          "used_key_parts": [
            "clus"
          ],
          "key_length": "5",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 106,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "26.50",
            "eval_cost": "10.60",
            "prefix_cost": "37.10",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ]
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 105,
          "filtered": "0.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "881.98",
            "eval_cost": "10.60",
            "prefix_cost": "11085148.28",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id"
          ],
          "attached_condition": "(`kyle`.`t2`.`id` = `kyle`.`t1`.`id`)"
        }
      }
    ]
  }
}

On T2 we see that we access 105 row after examining 1045682, thus a good opportunity for an index. Now add an index or the join column and we get a significant improvement in execution time. We go from 7.39 secs to 0.49 or 7.39/0.49 or a 15x improvement

explain format=json select max(t1.data) 
from  t1 USE INDEX () , t2  
where t1.id = t2.id 
and t1.clus   = 1
;
1 row in set (0.49 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "105582.64"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.02",
          "cost_info": {
            "read_cost": "105376.24",
            "eval_cost": "17.21",
            "prefix_cost": "105393.45",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "((`kyle`.`t1`.`clus` = 1) and (`kyle`.`t1`.`id` is not null))"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "t2_id"
          ],
          "key": "t2_id",
          "used_key_parts": [
            "id"
          ],
          "key_length": "5",
          "ref": [
            "kyle.t1.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 172,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "171.98",
            "eval_cost": "17.21",
            "prefix_cost": "105582.64",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id"
          ]
        }
      }
    ]
  }
}

Now use both indexes, which gives us a small 0.43 reduction in execution time, but as a ratio from the previous 0.49/0.06, i.e. an 8x improvement, still pretty good

explain format=json select max(t1.data) 
from  t1, t2  
where t1.id = t2.id 
and t1.clus  = 1
;
1 row in set (0.06 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "153.62"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ref",
          "possible_keys": [
            "t1_clus"
          ],
          "key": "t1_clus",
          "used_key_parts": [
            "clus"
          ],
          "key_length": "5",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 106,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "26.50",
            "eval_cost": "10.60",
            "prefix_cost": "37.10",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "(`kyle`.`t1`.`id` is not null)"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "t2_id"
          ],
          "key": "t2_id",
          "used_key_parts": [
            "id"
          ],
          "key_length": "5",
          "ref": [
            "kyle.t1.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "105.93",
            "eval_cost": "10.60",
            "prefix_cost": "153.63",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id"
          ]
        }
      }
    ]
  }

Indexes on Joins for MySQL

Recent Posts

Comments