Saturday, 24 January 2015

optimizer in MySQL (2)

this is a continuation of the topic of the previou post. In the previous post I showed how the optimizer in MySQL handled a special SQL-statement and chooses the wrong path (I also showed this for MariaDB). In the case of MariaDB I played a bit with the tables and added some indexes and showed how this changed the behaviour of the optimizer in MariaDB. In this post I want to do the same and show how the SQL-statement will be treated in MySQL. The environment will be the same as before.

index on Hersteller

As I did in MariaDB I will add an index on the column Hersteller on both tables:
mysql> use TestOpt;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> create index Hersteller on TestSmall(Hersteller);
Query OK, 301036 rows affected (4,00 sec)
Records: 301036  Duplicates: 0  Warnings: 0

mysql> show index from TestSmall;
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestSmall |          1 | PZN        |            1 | PZN         | A         |      301036 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestSmall |          1 | Hersteller |            1 | Hersteller  | A         |        2813 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0,00 sec)

mysql> create index Hersteller on TestBig(Hersteller);
Query OK, 10000000 rows affected (2 min 58,16 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

mysql> show index from TestBig;
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN        |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0,00 sec)

mysql> 

Now the database-server has some more information for finding a good path and indeed MySQL uses these indexes for executing the statement (same as in MariaDB). Here is the result (including the EXPLAIN):
mysql> select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0,02 sec)

mysql> explain select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: PZN,Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 33
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN,Hersteller
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0,00 sec)

mysql>

As you can see the execution-time is very short. The execution starts with the table TestSmall (=A) and the server fetches the records with the condition Hersteller = '00020' using the index on Hersteller. For every record found it searches the corresponding record in the table TestBig (=B) using the index PZN on TestBig and searches for all records with the value given in the column PZN of the record of TestSmall. On the records found in TestBig the WHERE-condition is applied.

The result looks as expected. So why and how has the server chosen this plan?

function-hierarchy

Let's look into the code of MySQL 5.6.22 and restrict our view to the optimizer-stage:
mysql_execute_command()
    execute_sqlcom_select()
        handle_select()
            mysql_select()
                mysql_prepare_select()
                    JOIN::prepare()
                        setup_tables_and_check_access()
                            setup_tables()
                                setup_natural_join_row_types()
                        setup_fields()
                mysql_execute_select()
                    JOIN::optimize()
                        simplify_joins()
                            simplify_joins()
                        optimize_cond()
                            build_equal_items()
                            propagate_cond_constants()
                            remove_eq_conds()
                                internal_remove_eq_conds()
                       opt_sum_query()
                       make_join_statistics()
                           update_ref_and_keys()
                               add_key_fields()
                                   add_key_field()
                               add_key_part()
                           make_select()                                              twice: TestSmall, TestBig
                           get_quick_record_count()                                   twice: TestSmall, TestBig
                               SQL_SELECT::test_quick_select()
                                   get_best_group_min_max()
                                   get_key_scans_params()
                                       check_quick_select()
                                           eq_ranges_exceeds_limit()
                                           ha_myisam::multi_range_read_info_const()
                                               DsMrr_impl::dsmrr_info_const()
                                                   handler::multi_range_read_info_const()
                                                       ha_myisam::records_in_range()
                                                           mi_records_in_range()         for min_key and max_key
                                                               _mi_record_pos()
                                                                   _mi_pack_key()
                                                                   _mi_search_pos()      recursively
                                                                       _mi_fetch_keypage()
                                                                           key_cache_read()
                                                                               find_key_block()
                           optimize_keyuse()
                           Optimize_table_order::Optimize_table_order()
                           Optimize_table_order::choose_table_order()
                               Optimize_table_order::greedy_search()
                                   Optimize_table_order::best_extension_by_limited_search()
                                       Optimize_table_order::best_access_path()
                                       Optimize_table_order::best_extension_by_limited_search()
                                           Optimize_table_order::best_access_path()
                                           Optimize_table_order::consider_plan()
                                       Optimize_table_order::best_access_path()
                           JOIN::refine_best_rowcount()
                           JOIN::get_best_combination()
                       substitute_for_best_equal_field()
                           eliminate_item_equal()
                       Item_cond::update_used_tables()
                       JOIN::set_access_methods()
                       update_depend_map()
                       make_join_select()
                           make_cond_for_table()
                               make_cond_for_table_from_pred()
                           pushdown_on_conditions()
                       make_join_readinfo()
                           push_index_cond()
                               make_cond_for_index()
                           setup_join_buffering()
                       ha_make_pushed_joins()
                           plugin_foreach_with_mask()
                       pick_table_access_method()
                    JOIN::exec()           
                        do_select()
                            sub_select()
                                join_read_always_key()
                                     JOIN_TAB::use_order()
                                    handler::ha_index_init()
                                        ha_myisam::index_init()
                                    handler::ha_index_read_map()
                                        ha_myisam::index_read_map()
                                join_read_next_same()
                                    handler::ha_index_next_same()
                                        ha_myisam::index_next_same()
and so on record after record

some hints

The hierarchy shown above is valid for the SQL-statement used in this post. For other SQL-statements this can differ and if you look of the last post you will indeed see that this hierarchy is similar but not identical.

I hope the hierarchy shown is correct. I omitted some functions e.g. those who were entered and after a single check the code returns to the caller. In the case of a recursion I did include only the first call (for example _mi_search_pos()). Some functions are called twice, because the SQL-statement references two tables so this function and every function beneath it is called once for each table, I didn't include the second call.
As humans make errors maybe I made an error or overlooked a call. If you find such an error please drop me a line.

This looks similar to the code I presented for MariaDB 10.0.10 except for some rearrangement of code, some classes introduced in this version of MySQL and some minor modification which I already mentioned in the former post.

calculations

The calculation of the cost associated to a plan is almost identical to the calculation in MariaDB. So I will only give some hints, you can find the details in the other post.

For the plan currently under consideration the cost will be calculated in Optimize_table_order::best_extension_by_limited_search():
      /* Compute the cost of extending the plan with 's' */
      current_record_count= record_count * position->records_read;
      current_read_time=    read_time
                            + position->read_time
                            + current_record_count * ROW_EVALUATE_COST;

In the first iteration the optimizer starts with the table TestSmall, computes the cost for accessing this table and then extends this plan by accessing the table TestBig. So the cost for this plan is evaluated via:1)
first try: TestSmall --> TestBig
current_record_count= 14,784 = 33 * 448
current_read_time=    17,780 =  39 + 14,784 + 14,784 * 0.20                            

The next possible plan starts with the table TestBig, the cost for accessing this table is computed and the plan is extended by the table TestSmall. Now these costs are:
second try: TestBig --> TestSmall
current_record_count= 455,132 = 1 * 455,132
current_read_time=    546,158 = 0 + 455,132 + 455,132 * 0.20

Let me omit the comparison of the costs of both plans for some seconds. The numbers shown here are almost identical to the numbers in the post about the optimizer in MariaDB. There are some minor differences which I would like to explain. You have seen the number 39 some lines above and if you look at the numbers in MariaDB you will find a number 40 in the same position. In the case of MySQL this value is computed via 0 + 33 + 33*0.20, in the case of MariaDB this value is computed via the formula 0 + (33+1) + 33/5, they differ simply by a value of 1.
The same statement is valid for the value 17,780 in MySQL and 17,814 in MariaDB. In MySQL this value is computed via 39 + 14,784 + 14,784 * 0.20, in MariaDB as 40 + 14,817 + 14,784 / 5.
Other values like 33 and 455,132 are computed identical and have the same meaning as described in the other post (please look for the topics numbers and more numbers).

decision

After calculating the cost of the first plan this value and the plan is given to the function Optimize_table_order::consider_plan(). There the code looks like this:
  const bool chosen= read_time < join->best_read;
  trace_obj->add("chosen", chosen);
  if (chosen)
  {
As the best plan so far has not been computed (value join->best_read contains the initial value DBL_MAX) and the cost just computed is 17,780 so this plan is taken.

In the next iteration the cost of the other plan is computed and compared against the best plan so far. You will find the comparison in Optimize_table_order::best_extension_by_limited_search():
      /* Expand only partial plans with lower cost than the best QEP so far */
      if (current_read_time >= join->best_read)
      {
which translates into
      if ( 546,158 >= 17,780 )
      {
so the code in the braces is executed which means that this plan is no longer inspected.

Normally the next iteration will start with a different plan but there are no mor possibilities so the search for the best plan finishes here.

the wrong path

In the older post I showed why the optimizer has chosen the wrong path, in this case it has chosen a fast path. But what about the other path? Let's use some force and look at the result by executing this path:
mysql> select SQL_NO_CACHE count(B.PZN) from TestBig B straight_join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (11,53 sec)

mysql> explain select SQL_NO_CACHE count(B.PZN) from TestBig B straight_join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN,Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 455132
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: PZN,Hersteller
          key: PZN
      key_len: 8
          ref: TestOpt.B.PZN
         rows: 1
        Extra: Using where
2 rows in set (0,00 sec)

mysql> 
The number is convincing, the optimizer has chosen the better path.


correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.


some notes:
1) as I did in the other posts: most of the calculations are done as floating-point numbers, but I will present only the integer-part.