an unsystematic journey through the source-code of MySQL and MariaDB: optimizer continued

I would like to continue my playing with the optimizer. For this and coming postings I will go back to MariaDB.

environment

The environment is the same as before, you will find the details here: JOIN. The database, tables, data in the tables etc. are the same as before. The only modifications here will be the indexes on the tables.

modifications

In the tests before I had an index on the column PZN of both tables and presented the results of a SQL-statement, in particular I was interested in the decision of the optimizer for one special execution-plan. In another posst I added an index on the column Hersteller on both tables and looked at the optimizer, how this additional information influences the decision.
Now I want to continue playing with indexes on these tables and look what the server does with my SQL-statement. So let me create some indexes:

MariaDB [(none)]> use TestOpt;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [TestOpt]> create index Hersteller on TestSmall(Hersteller);
Query OK, 301036 rows affected (2.12 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index Hersteller on TestBig(Hersteller);
Query OK, 10000000 rows affected (1 min 28.10 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show index from TestSmall;
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestSmall |          1 | Hersteller |            1 | Hersteller  | A         |        2813 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> show index from TestBig;  
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.01 sec)

MariaDB [TestOpt]>

Please do not search for a meaning in this environment or statement or ...., it's just a game in a sandbox.

results

With this data let's execute our standard SQL-statement.
Here is the statement using the JOIN-keyword:

MariaDB [TestOpt]> select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (2 min 0.25 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 30
        Extra: Using index condition
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 458262
        Extra: Using index condition; Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

Looking at the EXPLAIN it tells me that the server starts executing this statement with the table TestSmall and uses the index Hersteller to fetch records with the condition A.Hersteller = '00020'. It expects to find 30 records¹⁾ in this table. For each record fetched it switches over to the table TestBig and reads records via the index Hersteller with the condition B.Hersteller = '36367'. The optimizer expects to find approx. 450,000 records¹⁾ in this table. It then tries to match the JOIN-condition.

What about the other way? In the statement above the server starts the execution with the table TestSmall and from this it switches over to TestBig. Let's try it the other way, start with TestBig and switch over to TestSmall, so here is the statement using STRAIGHT_JOIN:

MariaDB [TestOpt]> select SQL_NO_CACHE count(B.PZN) from TestBig B straight_join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (1 min 54.42 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(B.PZN) from TestBig B straight_join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 458262
        Extra: Using index condition
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 30
        Extra: Using index condition; Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

Looking at the EXPLAIN you can see that it indeed started with table TestBig (= B).

The difference in execution-time is about 5%, but starting with the much bigger table is indeed faster. So let's look into the decision-making code, aka the optimizer.

optimizer

The question is: when the optimizer analyzes the statement and the possible plans where and why does it choose the wrong path?.

I've already posted some descriptions of the code of the optimizer²⁾ and I don't want to repeat this here. And in this case the difference is not so big³⁾ so I omit a lot of code-excerpts here. Let me jump into the calculations:

      /* Compute the cost of extending the plan with 's' */

      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time +
                        current_record_count / (double) TIME_FOR_COMPARE;

You will find this code-snippet in the file sql/sql_select.cc in the function best_extension_by_limited_search().

Here are the results of this computation for the first plan inspected:

Testmall:
current_record_count= 30
current_read_time   = 37
TestSmall -> TestBig:
current_record_count= 13,747,860
current_read_time   = 16,497,499

The optimizer starts with the table Testsmall and estimates the number of records as 30. The cost for accessing these records is calculated to 37. Then it extends the plan by adding the table TestBig and computes the cost for this plan as 16,497,499. Done with this plan. It then keeps this plan as this plan is the best plan so far:

        if (current_read_time < join->best_read)
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions,
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }

join->best_positions contains the best plan so far; after all computations and comparisons are done it contains the best plan.

Now it switches to the second plan which starts with the table TestBig:

TestBig:
current_record_count= 458,262
current_read_time   = 549,915

In the first step it's a big number: 549,915. So the decision is to stop further examining this plan:

      /* Expand only partial plans with lower cost than the best QEP so far */
      if (current_read_time >= join->best_read)

The if() expands to: if ( 549915 >= 30 ) and this expression is true so the code following in curly braces is executed: further inspection of this plan is stopped.

As no more possibilities exist the optimizer stops and the execution of the plan found starts.

So we've seen the cost of the complete plan starting with TestSmall, but not the cost of the complete plan starting with TestBig. Let's rewrite the statement so that the order of the tables is switched:

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(B.PZN) from TestBig B join TestSmall A on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 30
        Extra: Using index condition
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: const
         rows: 458262
        Extra: Using index condition; Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

I've showed directly the EXPLAIN. In this listing you will see that it still starts the execution with the table TestSmall (= A), as before. Stepping into the code you will see that the calculation of costs indeed start with TestSmall and the plan that starts with TestBig is only partially inspected, same as before.

So why this?

ordering of tables

Before starting with the inspection of possible plans the optimizer-code rearranges the tables it has to inspect: it sorts them. By looking into the function choose_plan() you will see this function-call:

  my_qsort2(join->best_ref + join->const_tables,
            join->table_count - join->const_tables, sizeof(JOIN_TAB*),
            jtab_sort_func, (void*)join->emb_sjm_nest);

The function my_qsort2() is an extended version of the qsort-algorithm with some treatment of special cases. You will find the code in mysys/mf_qsort.c.

The variable jtab_sort_func contains the address of the comparison-function. In the case of the SQL-statement used here it's the address of the function join_tab_cmp()⁴⁾. In this function the comparison is done based on the estimated number of records in the table, so the table TestSmall with it's number of (estimated) 30 records to fetch will come first. After modifying the function join_tab_cmp() a bit to return a different (=wrong) answer the table-order will not be changed. Now we can look into the cost-calculation again and see the costs for both possible paths computed up to the end. These are the values:

plan #1:
TestBig:
current_record_count= 458,262
current_read_time   = 549,915
TestBig -> TestSmall:
current_record_count= 13,747,860
current_read_time   = 17,505,609

plan #2:
TestSmall:
current_record_count= 30
current_read_time   = 37
TestSmall -> TestBig:
current_record_count= 13,747,860
current_read_time   = 16,497,499

After the computation of the second plan is finished the resulting valuess are compared. Some lines above I presented some code-lines:

        if (current_read_time < join->best_read)

With the values given above this translates into

        if ( 16497499 < 17505609 )

This computes to true and explains why the plan starting with table TestSmall is chosen.

verifying

All the comments above are based on inspecting the code during execution-time. Let's look at it from a different point of view for verifying these statements.

First I want to tell the facts:


TestSmall:      60 records with Hersteller = '00020'
TestBig:   461,016 records with Hersteller = '36367'

In the post table-scan or index-scan I introduced some counter-variables and I want to use them to verify what I've written in this post so far.

So let's start with the statement using a JOIN. With the counter-variables added to the MyISAM-engine and the statement executed the log-file contained these lines:

case JOIN:
table = <testsmall> counter table-scan: 0   index-scan: 61         = 60 + 1
table = <testbig>   counter table-scan: 0   index-scan: 27,661,020 = 60 * (461,016+1)

In the case of the statement with the STRAIGHT_JOIN I found these lines:

case STRAIGHT_JOIN:
table = <testbig>   counter table-scan: 0   index-scan: 461,017    = 461,016 + 1
table = <testsmall> counter table-scan: 0   index-scan: 28,121,976 = 461,016 * (60+1)

I've modified the output a bit to show where the numbers come from.

You see that no table-scan was done, the index is used instead. In the JOIN-case the server fetched 60 records from TestSmall and for every record fetched it fetches a lot of records from the table TestBig (including the fetch that detects the end-condition). For the STRAIGHT_JOIN-case it fetches record after record from TestBig and for every record fetched it fetches 60 records (plus one fetch for the end-condition) from TestSmall.

decision

Was the decision of the optimizer correct? Looking at the amount of time needed for executing the two statements one could clearly say: no. But if you look at the number of records fetched in both cases and compare these you will see that these numbers prefer the JOIN-case (the plan TestSmall -> TestBig).

The decision is based on estimations (to reduce the time needed for making the decision). And it's always the same code that will be executed on a variety of machines with different CPUs and hard-disk(s) of different speed. So one cannot expect to find the best execution-plan on all machines of this world. In the case presented here the 'not-so-good-plan' is chosen but the difference to the 'good plan' is negligible.

correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

some notes:

1) you will find a description on how this number is estimated here. The number differs a bit because the index was dropped and recreated and dropped and recreated and .....

2) you will find the descriptions here and here.

3) here you will find a much bigger discrepancy: optimizer aka JOIN (4).

4) there are other compariosn-functions too. If you look at the code of the given function you will find the other functions beneath it.

Saturday, 24 January 2015

optimizer continued