an unsystematic journey through the source-code of MySQL and MariaDB: December 2014

This is my fourth post about the various aspects of this topic: JOIN.

In the former posts I described:

what's happening in the storage-engine (in this case: MyISAM) by executing the query: JOIN
what's happening in the level above the storage-engine: JOIN (2)
after reading a record from a table this record is inspected: does it fulfill the conditions given in the WHERE-clause? Here you'll find more: JOIN (3)

Everything started with an SQL-statement and an observation I made: why does the database-server choose the wrong path? I will soon explain what "wrong path" means. And all posts describe what's going on in the database-server by executing this statement. Be careful when you generalize the results presented in my posts.

testbed

For this series of posts I've created the tables TestSmall and TestBig of identical structure and put some data in it.
TestSmall:

MariaDB [TestOpt]> select count(*) from TestSmall;
+----------+
| count(*) |
+----------+
|   301036 |
+----------+
1 row in set (0.04 sec)

MariaDB [TestOpt]> desc TestSmall;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| Id          | char(8)      | YES  |     | NULL    |       |
| PZN         | char(7)      | YES  | MUL | NULL    |       |
| EVP         | decimal(7,2) | YES  |     | NULL    |       |
| HAP         | decimal(7,2) | YES  |     | NULL    |       |
| ArtikelBez  | varchar(40)  | YES  |     | NULL    |       |
| ArtikelText | varchar(26)  | YES  |     | NULL    |       |
| Hersteller  | char(5)      | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
7 rows in set (0.01 sec)

MariaDB [TestOpt]> show index from TestSmall;
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestSmall |          1 | PZN      |            1 | PZN         | A         |      301036 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

TestBig:

MariaDB [TestOpt]> select count(*) from TestBig;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> desc TestBig;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| Id          | char(8)      | YES  |     | NULL    |       |
| PZN         | char(7)      | YES  | MUL | NULL    |       |
| EVP         | decimal(7,2) | YES  |     | NULL    |       |
| HAP         | decimal(7,2) | YES  |     | NULL    |       |
| ArtikelBez  | varchar(40)  | YES  |     | NULL    |       |
| ArtikelText | varchar(26)  | YES  |     | NULL    |       |
| Hersteller  | char(5)      | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
7 rows in set (0.00 sec)

MariaDB [TestOpt]> show index from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

observation

The investigation started with an SQL-statement with this result:

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+----------+
| count(*) |
+----------+
|       14 |
+----------+
1 row in set (21.35 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+----------+
| count(*) |
+----------+
|       14 |
+----------+
1 row in set (0.38 sec)

MariaDB [TestOpt]>

You can see the difference in the execution-time of these two statements: 21 secs. vs. 0.4 secs. And you can also see that these two statements are identical except for the usage of the keywords JOIN and STRAIGHT_JOIN.

STRAIGHT_JOIN forces the database to use the tables in this order: TestSmall -> TestBig, whereas by using JOIN in the statement the database does the ordering by itself and it chooses this order: TestBig -> TestSmall.
I verified this by the use of the counter-variables I've introduced here. And also the database tells us this by using EXPLAIN:

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(*) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.B.PZN
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

As you can see the optimizer chooses the table B (=TestBig) as the leading table and scans the full table (10 mio. records). If a match is found it switches over to A (=TestSmall) and does an access via the index using the value found in the column PZN of the record from TestBig.

And here is the EXPLAIN of the second statement:

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(*) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

In this case it starts with the table A (=TestSmall) and scans the full table (300K records). If a match is found it switches over to table B (=TestBig) and in it searches for the given PZN using the index.

the question

Without some force to use the tables in a special order the optimizer chooses an order, which leads to a result taking much more time then the other way. But why? Why does the optimizer think it's plan for executing the query is faster?

optimizer

Simplified this happens in a database-server: a user sends such an SQL-statement to the database-server. The database-server accepts the stream of characters and does the following steps:

checks the syntax
checks the access-rights of the user who sent this
searches for a good path to execute the query
and finally executes the query following the path found

This is a very short and rough description of what's going on inside a database-server when it handles our statement.

In this post I would like to describe what happens in the optimizer-stage of the server when it inspects this statement. This is not a general description of this optimizer nor a discussion of possible ways an optimizer can work. I will follow the route of this statement in the optimizer-stage in this server (I did this in MariaDB). And I want to come to the point where the decision for the wrong path is made and also show why this decision is made this way by the existing software.

book

To repeat myself: it would help us if we can find an introductory text about the optimizer used in MySQL/MariaDB. Unfortunately I didn't find much except this book: here and here. You will find a short introduction into this topic on page 170 of Sasha's book. The book was published in 2007 so keep in mind that the code in MySQL/MariaDB has evolved in the meantime but the book is still helpful (and this is valid not only for this topic).

possible paths

In our statement there are two paths possible for executing the query: start with the table TestSmall and for all records found search for the corresponding records in TestBig Or you can start with the Table TestBig and look for all corresponding records in TestSmall. Sounds easy to find a good path because we have only 2 paths to choose from. But our SQL-statement is simple, think of a statement with 3 tables (6 possible paths) or 4 tables (24 possible paths) in it. And I've worked with statements including 30 tables (not all different) in it. The number of possible orderings of the tables can be calculated by this formula: n!, nicely the page contains some values to show how this number explodes .

So it can be a bit of work to do the calculations and comparisons. And we have to keep in mind what this little drawing tells us.

costs

The server computes a value for a path. This value is called a cost but this is not what we have to pay for executing the statement. This value is simply a number; it is assumed that a higher workload for a path is equivalent to a higher cost. Let's say α and β are possible paths then we assume that this statement is true:

cost( path α ) < cost( path β ) is equivalent to execution( path α ) is faster then execution( path β )
I will soon show how costs are calculated in MariaDB.

functions

Let's look at the code doing this job. You will find most of the functions in the file sql/sql_select.cc (and sql_select.h). Here is the hierarchy:

mysql_execute_command()
    execute_sqlcom_select()
        handle_select()
            mysql_select()
                JOIN::optimize()
                    JOIN::optimize_inner()
                        make_join_statistics()
                            choose_plan()
                                greedy_search()                         // old (or alternate) way: find_best()
                                    best_extension_by_limited_search()
                                        best_access_path()
                                            matching_candidates_in_table
                                        best_extension_by_limited_search()
                                            best_access_path()
                                                matching_candidates_in_table
                                        best_access_path()
                                            matching_candidates_in_table
                                        best_extension_by_limited_search()
                                            best_access_path()
                                                matching_candidates_in_table

In our case most of the work was done in the function-tree starting with best_extension_by_limited_search(). Here the calculation of the cost of a path started with the table TestSmall. As the SQL-statement contains a restriction on the column Hersteller and we do not have an index on this column only a table-scan can be used for this part of the job and therefore the cost of reading the table this way is calculated. Then this path will be extended by adding the table TestBig to the calculations. Here we have to compare the cost of a full table-scan on TestBig with the cost of using the existing index on the column PZN for searching in TestBig for all the records with the given value (from the record in TestSmall) in the column PZN. The final value will be stored (as it's the first path this is the cost of the best path, currently). Then the computation starts with the table TestBig and calculates the cost for reading this table. Then this path will be extended by the table TestSmall and the costs of this (extended) path will be calculated.
As we now have two values we can do a comparison of these two values and find that the last value computed is lower than the first value so the last path will be stored as the better one. And as no other possible paths exist this will be the path executed.

class JOIN

We need some variables for storing intermediate and final results like costs or paths or tables involved in the query and so on. You can find these in the class JOIN defined in sql/sql_select.h. Some important variables in this class are:

table_count: number of tables in the join
table: the tables in the join
join: the tables in the order of the executed QEP (=query execution plan)
best_positions: the best complete QEP so far
best_ref: (partial) QEP, micht be reordered during the search
best_positions: the best QEP so far
positions: current join optimization state
best_read: the cost of the current best QEP (DBL_MAX at start)

There are a lot more in this class. For every table in the SQL-statement there is a struct JOIN_TAB which contains among others the variable read_time which contains the cost of reading a table sequentially.

best plan

I would like to omit all the intermediate steps in calculating the cost of a path so let me jump right to the end of the calculations and show where the comparison/decision is made. It's in the function best_extension_by_limited_search() (in sql/sql_select.cc):

        if (current_read_time < join->best_read)
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions,
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }

Let's start with the if-statement. In inspecting our SQL-statement current_read_time has a value of approx. 24 mio, join->best_read a value of approx. 162 mio. (=the cost of the best plan so far). As the first value is less then the second we've found a better plan and therefore the following code-block is executed: it copies the QEP to join->best_positions and stores the corresponding cost in join->best_read.

numbers

Most of the computations are done in double-precision, but I will show only approximate values.

Here is the code that computes these numbers (in sql/sql_select.cc some lines above the if-statement):

      /* Compute the cost of extending the plan with 's' */

      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time +
                        current_record_count / (double) TIME_FOR_COMPARE;

The old plan started with the table TestSmall and added TestBig to it. These are the computations:

      current_record_count= record_count * position->records_read;
           record_count:            301,036              TestSmall: number of records
           position->records_read:  448                  TestBig: number of records / number of different keys in index = 10 mio. / 22298
           current_record_count =   134,864,128          result: 301,036 * 448
                                                         for every record in TestSmall we have to read 448 records in TestBig via the index
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE; 
           read_time:               66,334               current_read_time of TestSmall
           position->read_time:     135,165,164          301,036 * (448 + 1)
           current_record_count:    134,864,128          see above
           TIME_FOR_COMPARE:        5                    a constant
           current_read_time=       162,204,324          result: 66,334 + 13,516,5164 + 134,864,128 / 5

So we have the cost of this path calculated to a value of approx. 162 mio.

Now we check the second possibility: start with TestBig and add TestSmall to it:

      current_record_count= record_count * position->records_read;
           record_count:            1                    TestSmall: number of records / number of different keys in index
           position->records_read:  10,000,000           TestBig: number of records
           current_record_count =   10,000,000           result
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE; 
           read_time:               2,211,613            current_read_time of TestBig
           position->read_time:     20,000,000           10.000.000 * (1 + 1)
           current_record_count:    10,000,000           see above
           TIME_FOR_COMPARE:        5                    a constant
           current_read_time=       24,211,613           result: 2,211,613 + 20,000,000 + 10,000,000 / 5

So this path has an asscociated cost-valuet of approx. 24 mio.

criticize the numbers

Something is wrong with these calculations. Calculating the costs resulted in numbers 162mio. and 24mio. whereas the execution results in a time-consumption of 0.4sec and 21sec., the costs differ in a factor of approx. 8 and the execution of the paths differ in a factor of 40 but in the opposite direction. The difference in the factors is not the problem, but the faster path has the higher costs (at least for the optimizer).

It looks like the optimizer in MariaDB weights the cost of a table-scan much lower then using an index. Or I found a counter example without any relevance in practice.

conclusion

Finding a good cost-function is not an easy task. I've taken one query as an example and found that in this case the optimizer choose the wrong path (by a factor). I don't want to generalize this example, we must be content to find a good compromise. But the optimizer is a central point in the database-software so we should focus our interest on this topic.

MySQL

I did the same tests with MySQL (version 5.5.8) and got similar results: the statement using the JOIN-keyword was much slower then the statement using the STRAIGHT_JOIN-keyword. And the EXPLAINs looked identical to the lists I posted above. But the MySQL-server was a bit faster then the MariaDB-server, which could have been caused by different compile-options (didn't check this).

I didn't dive into the code of the optimizer in MySQL but the code looks similar (but not identical) to the code in MariaDB. The computations of the values representing costs as shown above look a bit different in MySQL. Here is the line where the decision to change the order of the tables is made:

      if ((search_depth == 1) || (current_read_time < join->best_read))
with these values:
search_depth:                  61
current_read_time:     12,211,613
join->best_read:      161,843,081

As you can see: the values differ but the relation is the same as above.

correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

update

ups, an error happened: as a thousands separator I used a dot, e.g. the number 301036 was written as 301.036. But it's also possible to write the same number in this form: 301,036 using a comma as the separator.

The usage of a character as a separator differs from country to country. As this blog is written in English I had to use the separator-character of this language for the numbers. By mistake I used the conventions used in Germany (the dot as a thousands-separator). I corrected this to use a comma.

Monday 15 December 2014

optimizer aka JOIN (4)