Thursday 1 January 2015

optimizer (2) aka JOIN (5)

This is my fifth post about the various aspects of this topic: JOIN.

In the former posts I described:
  • what's happening in the storage-engine (in this case: MyISAM) by executing the query: JOIN
  • what's happening in the level above the storage-engine: JOIN (2)
  • after reading a record from a table the WHERE-clause is applied to this record: JOIN (3)
  • where are different paths evaluated and one path chosen: optimizer

Be careful when you generalize the results presented in my posts. A lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like values for keys, file-offsets, distribution of keys etc.).

index on Hersteller

Let me continue playing with the statement I've used before so I added indexes on the column Hersteller:
MariaDB [TestOpt]> create index Hersteller on TestSmall(Hersteller);
Query OK, 301036 rows affected (2.85 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index Hersteller on TestBig(Hersteller);
Query OK, 10000000 rows affected (2 min 18.44 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show index from TestSmall;
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestSmall |          1 | PZN        |            1 | PZN         | A         |      301036 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestSmall |          1 | Hersteller |            1 | Hersteller  | A         |        2813 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.00 sec)

MariaDB [TestOpt]> show index from TestBig;  
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN        |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.00 sec)

MariaDB [TestOpt]> 

With these additional information available the database can execute the statement in a different way (I included the explain):
MariaDB [TestOpt]>MariaDB [TestOpt]> select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.01 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: PZN,Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 33
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN,Hersteller
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]> 
As you can see the execution of this statement is much faster, even faster than the execution of the statement in the last post (20 secs., 0.4 secs.).

If you look at the explain-statement you can see that the optimizer has chosen the table TestSmall (=A) as the driving table and uses the index on Hersteller to search for entries with the value '00020' in this column. For every record found it looks for the corresponding record in the table TestBig (=B) using the value of the column PZN and searching through the index on the column PZN.

The last post inspected the 2 possible paths for executing this query: TestSmall -> TestBig and TestBig -> TestSmall. With these alternatives the optimizer had chosen the path TestBig -> TestSmall because the internal cost-calculations showed this as the bet path (only the index on PZN did exist). By adding the additional index the optimizer chooses the path that starts with TestSmall and I will show how the corresponding costs are calculated.

In the last post I had to use some force to start the execution with the table TestSmall, now let me use the same force to start the execution with the table TestBig:
MariaDB [TestOpt]> select SQL_NO_CACHE count(B.PZN) from TestBig B straight_join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (12.68 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(B.PZN) from TestBig B straight_join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN,Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 455132
        Extra: Using index condition; Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: PZN,Hersteller
          key: PZN
      key_len: 8
          ref: TestOpt.B.PZN
         rows: 1
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]> 
As you can see I've changed the SQL-statement a bit to let the table TestBig appear at the first position and for this to happen I used the keyword STRAIGHT_JOIN (without this it will start the execution with the table TestSmall). The result is clear: this way is sloooow.

It looks like the optimizer has chosen the right ordering of the tables to execute this statement. Let me now describe how this is done. The explanations are valid for MariaDB in the version 10.0.10, other versions can handle these things slightly diffrent. And MySQL may handle this different too.

function hierarchy

So where does the server calculate the cost of the SQL-statement given above? Let me directly jump into the code and show the function-hierarchy in MariaDB:
mysql_execute_command()
    execute_sqlcom_select()
        handle_select()
            mysql_select()
                JOIN::optimize()
                    JOIN::optimize_inner()
                        make_join_statistics()
                            JOIN_TAB::scan_time()
                                handler::scan_time()
                            make_select()
                            update_ref_and_keys()      
                            get_quick_record_count()             // executed twice: TestSmall and TestBig
                                SQL_SELECT::test_quick_select()
                                    get_key_scans_params()
                                        check_quick_select()
                                            ha_myisam::multi_range_read_info_const()            // starting from here it's specific to MyISAM
                                                DsMrr_impl::dsmrr_info_const()
                                                    handler::multi_range_read_info_const()
                                                        ha_myisam::records_in_range()
                                                             mi_records_in_range()
                                                                 _mi_record_pos()
                                                                     _mi_search_pos()           // with pos=0x7a2c00, key='00020'
                                                                         _mi_fetch_keypage()    // with page=root
                                                                 _mi_record_pos()
                                                                     _mi_search_pos()           // with pos=0x7a2c00, key='00020'
                                                                         _mi_fetch_keypage()    // with page=root
                            optimize_semijoin_nests()
                            choose_plan()
                                greedy_search()
                                    best_extension_by_limited_search()
                                        best_access_path()
                                        best_extension_by_limited_search()
                                            best_access_path()
                                        best_access_path()
                        make_join_readinfo()                
                        JOIN::save_explain_data()
                JOIN::exec()
                    JOIN::exec_inner()
                    JOIN::save_explain_data()
a hint: the function get_quick_record_count() (and all functions beneath it) is executed twice, but I didn't include this in the hierarchy.

best plan

I would like to omit all the intermediate steps in calculating the cost of a path but give an overview of the steps involved. The description is valid for the standard SQL-statement, the fast one, not for the statement using the STRAIGHT_JOIN keyword.

The calculation starts with the table TestSmall. The cost for reading this table is calculated and the result is a value of 40. Then this (partial) plan is extended by inspecting the table TestBig. The cost for reading both tables is calculated to the number 17,814. As this plan is completely inspected the computations stops and this value and the plan is stored because it's the best plan so far.

Next the calculation starts from the beginning again but with the table TestBig. Here the cost-value is calculated and the number is a value 546,159. As this number is greater than the value for the plan before the inspection of this plan ist stopped.

Here is the code that does the comparison/decision (in best_extension_by_limited_search() in sql/sql_select.cc):
      /* Expand only partial plans with lower cost than the best QEP so far */
      if (current_read_time >= join->best_read)
      { 
        DBUG_EXECUTE("opt", print_plan(join, idx+1,
                                       current_record_count,
                                       read_time,
                                       current_read_time,
                                       "prune_by_cost"););
        restore_prev_nj_state(s);
        restore_prev_sj_state(remaining_tables, s, idx);
        continue;
      }
join->best_read contains the cost of the best plan so far, current_read_time is the cost of the plan actually inspected, so the expression is:
      if ( 546,159 >= 17,814 )
      { 
        // stop extending this plan!
      }
and this decision is easy.

numbers

Let me do some refinement and explain where these numbers come from.

Here is the code that computes these numbers (in sql/sql_select.cc some lines above the if-statement shown):
      /* Compute the cost of extending the plan with 's' */

      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time +
                        current_record_count / (double) TIME_FOR_COMPARE;

Please keep in mind that the numbers I will show you soon are estimations, as the calculation of these numbers should be done in a short time.

Here are the numbers for the first step (inspecting table TestSmall alone):
      current_record_count= record_count * position->records_read;
           current_record_count:            33       (1 * 33)              number of records in TestSmall
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE;
           current_read_time:               40       (0 + 34 + 33 / 5)     
Next we add the table TestBig to our plan and now we get these numbers:
      current_record_count= record_count * position->records_read;
           current_record_count:            14,784   (33 * 448)           number of records in TestSmall * number of records in TestBig
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE;
           current_read_time:               17,814   (40 + 14,817 + 14,784 / 5)
So the cost for our first plan is computed to a value of 17,814.

Let's switch over to the alternate plan: we will start with the table TestBig and will get these numbers:
      current_record_count= record_count * position->records_read;
           current_record_count:            455,132    (1 * 455,132)    number of records in TestBig
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE;
           current_read_time:               546,159    (0 + 455,133 + 455,132 / 5)
and the computation stops here, because: 546,159 > 17,814.

By comparing these numbers it's easy to find the best plan. And the optimizer chooses the correct plan as I showed above.

more numbers

There are still some numbers in these computations which I didn't explain until now. These are estimations made from the index-file and as this example is using the MyISAM-engine you will find the indexes in the files TestSmall.MYI and TestBig.MYI. You can find an introduction into the structure of such an index-file here.

The numbers 33 is the estimated number of records in TestSmall with the property Hersteller = '00020', and 455,132 is the number of records in TestBig with the property Hersteller = '36367'. So how are these numbers computed?

If you look at the function-hierarchy above you will see a call of a function in the engine, in this case it's ha_myisam::multi_range_read_info_const(). This code is an engine-specific part in the optimizer, in this case it's MyISAM-specific. In this function 2 values are computed for each table: one value for the first and one value for the last entry in the tree with a given value, and the (estimated) number of records is the diffferenc of these numbers. To dive a bit deeper into the code let's begin with the table TestSmall: the function _mi_search_pos() starts at the root of the tree and it searches for the first occurrence of the value '00020'. This is done by recursively calling _mi_search_pos() again inspecting node after node. When the leaf is reached and the entry is found the code takes the position of the entry within this leaf and divides it by the total number of entries in this leaf. When it returns it does this again and again. In the end a number x is computed with the property: 0 <= x <= 1.0. In the function _mi_record_pos() this number is multiplied with the number of records in this table and this is the estimated position of the entry in the tree.

More details? OK, let's look in my example:
for min_key:
calling _mi_search_pos() with search_flag = HA_READ_KEY_EXACT
  0x7a2c001) (=root)            1 entry,   found 0 (the value in this node is > '00020')
  0x749400                     55 entries, found 0
  0x42b000                     55 entries, found 0
  0x41cc00                     returns 84, found 0 (in this case this describes the first entry)
  _mi_search_pos:    (0 + 1) / (84 + 1)                      ==> 0.011764705882352941
                     (0 + 0.011764705882352941) / (55 + 1)   ==> 0.00021008403361344536
                     (0 + 0.00021008403361344536) / (55 + 1) ==> 3.7515006002400958e-06
                     (0 + 3.7515006002400958e-06) / (1 + 1)  ==> 1.8757503001200479e-06
  _mi_record_pos:    1.8757503001200479e-06 * 301036 + 0.5   ==> 1
So the value for the starting-position is 1 (variable start_pos). Now we need the last occurrence of this value:
for max key:
calling _mi_search_pos() with search_flag = HA_READ_AFTER_KEY
  0x7a2c00 (=root)              1 entry,   found 0
  0x749400                     55 entries, found 0
  0x42b000                     55 entries, found 0
  0x41cc00                     84 entries, found the last value '00020' on position 60
  _mi_search_pos:    (60 + 1) / (84 + 1)                     ==> 0.71764705882352942
                     (0 + 0.71764705882352942) / (55 + 1)    ==> 0.012815126050420168 
                     (0 + 0.012815126050420168) / (55 + 1)   ==> 0.00022884153661464584
                     (0 + 0.00022884153661464584) / (55 + 1) ==> 0.00011442076830732292
  _mi_record_pos:    0.00011442076830732292 * 301036 + 0.5   ==> 34
the value end_point is 34. In the function mi_records_in_range() the estimation results in end_point - start_point = 34 -1 = 33 = estimated number of records in TestSmall with Hersteller = '00020', which is the number you've seen above.

So let's switch over to TestBig and do the comparable computation again:
for min key:
calling _mi_search_pos() with search_flag = HA_READ_KEY_EXACT
  0xfd7d4002) (=root)          37 entries, found '36367' on position 26
  0xdcc1800                    55 entries, found on 25
  0xdb07800                    55 entries, found on 49
  0xdb05800                    84 entries, found on 28
  _mi_search_pos:     (28 + 1) / (84 + 1)                   ==> 0.3411764705882353
                      (49 + 0.3411764705882353) / (55 + 1)  ==> 0.88109243697479001
                      (25 + 0.88109243697479001) / (55 + 1) ==> 0.4621623649459784
                      (26 + 0.4621623649459784) / (37 + 1)  ==> 0.69637269381436784
  _mi_record_pos:     0.69637269381436784 * 10 mio + 0,5    ==> 6,963,727

and for max_key:     
calling _mi_search_pos() with search_flag = HA_READ_AFTER_KEY
  0xfd7d400 (=root)            37 entries, found on 28
  0xe2fe000                    55 entries, found on 10
  0xe06e400                    55 entries, found on 41
  0xe06a400                    84 entries, found the last entry '36367' on position 4
  _mi_search_pos:      (4 + 1) / (84 + 1)                      ==> 0.058823529411764705
                       (41 + 0.058823529411764705) / (55 + 1)  ==> 0.73319327731092443
                       (10 + 0.73319327731092443) / (55 + 1)   ==> 0.19166416566626651
                       (28 + 0.19166416566626651) / (37 + 1)   ==> 0.74188589909648073
  _mi_record_pos:      0.74188589909648073 * 10 mio + 0,5      ==> 7,418,859
These values result in 7,418,859 - 6,963,727 = 455,132 = estimated number of records in TestBig with the property Hersteller = '36367'.

As written before these values are estimations, the exact numbers are 60 (for TestSmall) and 461,016 (for TestBig), so the estimation is not bad. Please keep in mind that the computations should be done fast, getting better estimations means to inspect much more nodes/leafs of the tree which will consume additional time.

The estimated values are stored in the struct TABLE at the end of the function check_quick_select(), there you can find this expression param->table->quick_rows[keynr], which stores the values computed. Later in the function best_access_path() these values are read from table->quick_rows[key] and at the end of this function stored in pos->records_read. Later they appear as shown in the computations above as position->records_read.

Almost done. But there are 3 numbers left in the computation: 34, 455,133 and 14,817. For the first two numbers it is easy to detect: as you guessed it these are the estimated values just explained but incremented by one. This incrementation is don in the function handler::read_time() (found in handler.h). The result of this addition is stored in the variable position->read_time.
The remaining number follows the same schema: 14,817 = 33 * (448 + 1). For every record found in TestSmall there exist (estimated) 448 records in TestBig. These 448 records are read via the index but an additional access is needed for returning HA_ERR_END_OF_FILE. At least this is my interpretation of the incrementation, and thi interpretation also applies to the other 2 numbers. As MyISAM has not implemented this function the function of the base-class hanlder is called, other engines implement this and therefore handle it with their own interpretation.

Now we are through. This is the algorithm that computes the cost-values for the statement given above.

MyISAM

A lot of my comments here in this post are specific for the MyISAM-engine because the tables used for the tests are of type MyISAM. Other storage-engines can handle things differently so don't apply the results presented here schematically to other engines.

And as a second argument a lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like values for keys, file-offsets, distribution of keys etc.).

The explanations given here are valid for MariaDB in the version 10.0.10, other versions can handle these things slightly different. And MySQL may handle this different too.

index-files

It looks like adding one or more indexes to a table is a good thing, so why don't we create an index for every column of a table? Please keep in mind that maintaining an index does consume time: inserting a new record means updating all existing indexes for this table, this is also valid for deletes and maybe for updates too. So what you win for (some) selects you may loose in the other operations. Look carefully at what you need.


correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.


some notes:
1) the value 0x7a2c00 is an offset in the file TestSmall.MYI. At this position the index-tree on the column PZN starts, all numbers under this value are the file-offsets where the corresponding nodes or finally a leaf can be found.
2) the value 0xfd7d400 is an offset in the file TestBig.MYI, analog to 1)