an unsystematic journey through the source-code of MySQL and MariaDB: 2014

Monday, 15 December 2014

optimizer aka JOIN (4)

This is my fourth post about the various aspects of this topic: JOIN.

In the former posts I described:

what's happening in the storage-engine (in this case: MyISAM) by executing the query: JOIN
what's happening in the level above the storage-engine: JOIN (2)
after reading a record from a table this record is inspected: does it fulfill the conditions given in the WHERE-clause? Here you'll find more: JOIN (3)

Everything started with an SQL-statement and an observation I made: why does the database-server choose the wrong path? I will soon explain what "wrong path" means. And all posts describe what's going on in the database-server by executing this statement. Be careful when you generalize the results presented in my posts.

testbed

For this series of posts I've created the tables TestSmall and TestBig of identical structure and put some data in it.
TestSmall:

MariaDB [TestOpt]> select count(*) from TestSmall;
+----------+
| count(*) |
+----------+
|   301036 |
+----------+
1 row in set (0.04 sec)

MariaDB [TestOpt]> desc TestSmall;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| Id          | char(8)      | YES  |     | NULL    |       |
| PZN         | char(7)      | YES  | MUL | NULL    |       |
| EVP         | decimal(7,2) | YES  |     | NULL    |       |
| HAP         | decimal(7,2) | YES  |     | NULL    |       |
| ArtikelBez  | varchar(40)  | YES  |     | NULL    |       |
| ArtikelText | varchar(26)  | YES  |     | NULL    |       |
| Hersteller  | char(5)      | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
7 rows in set (0.01 sec)

MariaDB [TestOpt]> show index from TestSmall;
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestSmall |          1 | PZN      |            1 | PZN         | A         |      301036 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

TestBig:

MariaDB [TestOpt]> select count(*) from TestBig;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> desc TestBig;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| Id          | char(8)      | YES  |     | NULL    |       |
| PZN         | char(7)      | YES  | MUL | NULL    |       |
| EVP         | decimal(7,2) | YES  |     | NULL    |       |
| HAP         | decimal(7,2) | YES  |     | NULL    |       |
| ArtikelBez  | varchar(40)  | YES  |     | NULL    |       |
| ArtikelText | varchar(26)  | YES  |     | NULL    |       |
| Hersteller  | char(5)      | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
7 rows in set (0.00 sec)

MariaDB [TestOpt]> show index from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

observation

The investigation started with an SQL-statement with this result:

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+----------+
| count(*) |
+----------+
|       14 |
+----------+
1 row in set (21.35 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+----------+
| count(*) |
+----------+
|       14 |
+----------+
1 row in set (0.38 sec)

MariaDB [TestOpt]>

You can see the difference in the execution-time of these two statements: 21 secs. vs. 0.4 secs. And you can also see that these two statements are identical except for the usage of the keywords JOIN and STRAIGHT_JOIN.

STRAIGHT_JOIN forces the database to use the tables in this order: TestSmall -> TestBig, whereas by using JOIN in the statement the database does the ordering by itself and it chooses this order: TestBig -> TestSmall.
I verified this by the use of the counter-variables I've introduced here. And also the database tells us this by using EXPLAIN:

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(*) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.B.PZN
         rows: 1
        Extra: Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

As you can see the optimizer chooses the table B (=TestBig) as the leading table and scans the full table (10 mio. records). If a match is found it switches over to A (=TestSmall) and does an access via the index using the value found in the column PZN of the record from TestBig.

And here is the EXPLAIN of the second statement:

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(*) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

In this case it starts with the table A (=TestSmall) and scans the full table (300K records). If a match is found it switches over to table B (=TestBig) and in it searches for the given PZN using the index.

the question

Without some force to use the tables in a special order the optimizer chooses an order, which leads to a result taking much more time then the other way. But why? Why does the optimizer think it's plan for executing the query is faster?

optimizer

Simplified this happens in a database-server: a user sends such an SQL-statement to the database-server. The database-server accepts the stream of characters and does the following steps:

checks the syntax
checks the access-rights of the user who sent this
searches for a good path to execute the query
and finally executes the query following the path found

This is a very short and rough description of what's going on inside a database-server when it handles our statement.

In this post I would like to describe what happens in the optimizer-stage of the server when it inspects this statement. This is not a general description of this optimizer nor a discussion of possible ways an optimizer can work. I will follow the route of this statement in the optimizer-stage in this server (I did this in MariaDB). And I want to come to the point where the decision for the wrong path is made and also show why this decision is made this way by the existing software.

book

To repeat myself: it would help us if we can find an introductory text about the optimizer used in MySQL/MariaDB. Unfortunately I didn't find much except this book: here and here. You will find a short introduction into this topic on page 170 of Sasha's book. The book was published in 2007 so keep in mind that the code in MySQL/MariaDB has evolved in the meantime but the book is still helpful (and this is valid not only for this topic).

possible paths

In our statement there are two paths possible for executing the query: start with the table TestSmall and for all records found search for the corresponding records in TestBig Or you can start with the Table TestBig and look for all corresponding records in TestSmall. Sounds easy to find a good path because we have only 2 paths to choose from. But our SQL-statement is simple, think of a statement with 3 tables (6 possible paths) or 4 tables (24 possible paths) in it. And I've worked with statements including 30 tables (not all different) in it. The number of possible orderings of the tables can be calculated by this formula: n!, nicely the page contains some values to show how this number explodes .

So it can be a bit of work to do the calculations and comparisons. And we have to keep in mind what this little drawing tells us.

costs

The server computes a value for a path. This value is called a cost but this is not what we have to pay for executing the statement. This value is simply a number; it is assumed that a higher workload for a path is equivalent to a higher cost. Let's say α and β are possible paths then we assume that this statement is true:

cost( path α ) < cost( path β ) is equivalent to execution( path α ) is faster then execution( path β )
I will soon show how costs are calculated in MariaDB.

functions

Let's look at the code doing this job. You will find most of the functions in the file sql/sql_select.cc (and sql_select.h). Here is the hierarchy:

mysql_execute_command()
    execute_sqlcom_select()
        handle_select()
            mysql_select()
                JOIN::optimize()
                    JOIN::optimize_inner()
                        make_join_statistics()
                            choose_plan()
                                greedy_search()                         // old (or alternate) way: find_best()
                                    best_extension_by_limited_search()
                                        best_access_path()
                                            matching_candidates_in_table
                                        best_extension_by_limited_search()
                                            best_access_path()
                                                matching_candidates_in_table
                                        best_access_path()
                                            matching_candidates_in_table
                                        best_extension_by_limited_search()
                                            best_access_path()
                                                matching_candidates_in_table

In our case most of the work was done in the function-tree starting with best_extension_by_limited_search(). Here the calculation of the cost of a path started with the table TestSmall. As the SQL-statement contains a restriction on the column Hersteller and we do not have an index on this column only a table-scan can be used for this part of the job and therefore the cost of reading the table this way is calculated. Then this path will be extended by adding the table TestBig to the calculations. Here we have to compare the cost of a full table-scan on TestBig with the cost of using the existing index on the column PZN for searching in TestBig for all the records with the given value (from the record in TestSmall) in the column PZN. The final value will be stored (as it's the first path this is the cost of the best path, currently). Then the computation starts with the table TestBig and calculates the cost for reading this table. Then this path will be extended by the table TestSmall and the costs of this (extended) path will be calculated.
As we now have two values we can do a comparison of these two values and find that the last value computed is lower than the first value so the last path will be stored as the better one. And as no other possible paths exist this will be the path executed.

class JOIN

We need some variables for storing intermediate and final results like costs or paths or tables involved in the query and so on. You can find these in the class JOIN defined in sql/sql_select.h. Some important variables in this class are:

table_count: number of tables in the join
table: the tables in the join
join: the tables in the order of the executed QEP (=query execution plan)
best_positions: the best complete QEP so far
best_ref: (partial) QEP, micht be reordered during the search
best_positions: the best QEP so far
positions: current join optimization state
best_read: the cost of the current best QEP (DBL_MAX at start)

There are a lot more in this class. For every table in the SQL-statement there is a struct JOIN_TAB which contains among others the variable read_time which contains the cost of reading a table sequentially.

best plan

I would like to omit all the intermediate steps in calculating the cost of a path so let me jump right to the end of the calculations and show where the comparison/decision is made. It's in the function best_extension_by_limited_search() (in sql/sql_select.cc):

        if (current_read_time < join->best_read)
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions,
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }

Let's start with the if-statement. In inspecting our SQL-statement current_read_time has a value of approx. 24 mio, join->best_read a value of approx. 162 mio. (=the cost of the best plan so far). As the first value is less then the second we've found a better plan and therefore the following code-block is executed: it copies the QEP to join->best_positions and stores the corresponding cost in join->best_read.

numbers

Most of the computations are done in double-precision, but I will show only approximate values.

Here is the code that computes these numbers (in sql/sql_select.cc some lines above the if-statement):

      /* Compute the cost of extending the plan with 's' */

      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time +
                        current_record_count / (double) TIME_FOR_COMPARE;

The old plan started with the table TestSmall and added TestBig to it. These are the computations:

      current_record_count= record_count * position->records_read;
           record_count:            301,036              TestSmall: number of records
           position->records_read:  448                  TestBig: number of records / number of different keys in index = 10 mio. / 22298
           current_record_count =   134,864,128          result: 301,036 * 448
                                                         for every record in TestSmall we have to read 448 records in TestBig via the index
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE; 
           read_time:               66,334               current_read_time of TestSmall
           position->read_time:     135,165,164          301,036 * (448 + 1)
           current_record_count:    134,864,128          see above
           TIME_FOR_COMPARE:        5                    a constant
           current_read_time=       162,204,324          result: 66,334 + 13,516,5164 + 134,864,128 / 5

So we have the cost of this path calculated to a value of approx. 162 mio.

Now we check the second possibility: start with TestBig and add TestSmall to it:

      current_record_count= record_count * position->records_read;
           record_count:            1                    TestSmall: number of records / number of different keys in index
           position->records_read:  10,000,000           TestBig: number of records
           current_record_count =   10,000,000           result
      current_read_time=read_time + position->read_time + current_record_count / (double) TIME_FOR_COMPARE; 
           read_time:               2,211,613            current_read_time of TestBig
           position->read_time:     20,000,000           10.000.000 * (1 + 1)
           current_record_count:    10,000,000           see above
           TIME_FOR_COMPARE:        5                    a constant
           current_read_time=       24,211,613           result: 2,211,613 + 20,000,000 + 10,000,000 / 5

So this path has an asscociated cost-valuet of approx. 24 mio.

criticize the numbers

Something is wrong with these calculations. Calculating the costs resulted in numbers 162mio. and 24mio. whereas the execution results in a time-consumption of 0.4sec and 21sec., the costs differ in a factor of approx. 8 and the execution of the paths differ in a factor of 40 but in the opposite direction. The difference in the factors is not the problem, but the faster path has the higher costs (at least for the optimizer).

It looks like the optimizer in MariaDB weights the cost of a table-scan much lower then using an index. Or I found a counter example without any relevance in practice.

conclusion

Finding a good cost-function is not an easy task. I've taken one query as an example and found that in this case the optimizer choose the wrong path (by a factor). I don't want to generalize this example, we must be content to find a good compromise. But the optimizer is a central point in the database-software so we should focus our interest on this topic.

MySQL

I did the same tests with MySQL (version 5.5.8) and got similar results: the statement using the JOIN-keyword was much slower then the statement using the STRAIGHT_JOIN-keyword. And the EXPLAINs looked identical to the lists I posted above. But the MySQL-server was a bit faster then the MariaDB-server, which could have been caused by different compile-options (didn't check this).

I didn't dive into the code of the optimizer in MySQL but the code looks similar (but not identical) to the code in MariaDB. The computations of the values representing costs as shown above look a bit different in MySQL. Here is the line where the decision to change the order of the tables is made:

      if ((search_depth == 1) || (current_read_time < join->best_read))
with these values:
search_depth:                  61
current_read_time:     12,211,613
join->best_read:      161,843,081

As you can see: the values differ but the relation is the same as above.

correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

update

ups, an error happened: as a thousands separator I used a dot, e.g. the number 301036 was written as 301.036. But it's also possible to write the same number in this form: 301,036 using a comma as the separator.

The usage of a character as a separator differs from country to country. As this blog is written in English I had to use the separator-character of this language for the numbers. By mistake I used the conventions used in Germany (the dot as a thousands-separator). I corrected this to use a comma.

Friday, 14 November 2014

JOIN (3)

this is the third post with my observations when I inspect a simple SQL-statement. You can find the other posts here: JOIN and JOIN (2).
And this is the SQL-statement I'm looking at:

MariaDB [TestOpt]> select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.48 sec)

MariaDB [TestOpt]>

The first post described what is happening in the storage-engine (in my case this is of type MyISAM), the second post described what happened at the level(s) above. And in this post I will narrow my focus and look at a very tiny point: applying the WHERE-clause.

This is a specialized description, it's only valid for the SQL-statement given above. Don't generalize this description, but it should give you a point to start with your own examinations.

applying WHERE

As I've written in my last post you will find the code for handling this SQL-statement in the function sub_select() (in sql/sql_select.cc). The WHERE-clause is inspected in the function evaluate_join_record() (called from sub_select()). The whole WHERE-clause is handled by this single line in this function:


    select_cond_result= MY_TEST(select_cond->val_int());

As written before the execution of this query starts with the table TestSmall, reading record by record and applying the WHERE-clause for each record. When a match is found it switches over to the table TestBig and searches the corresponding records in this table. For each record found in TestBig the WHERE-clause is applied too.

So let's look at this code-line.

MY_TEST is simply a macro defined in my_global.h:


#define MY_TEST(a) ((a) ? 1 : 0)

it converts the result into a boolean-type.

select_cond is a pointer to an object of type COND, so let's look at this.

code

If we want to know what this COND contains I suggest to modify the code of MariaDB a bit (I did this in MariaDB but you can do this in MySQL too, it should be a similar task). You have to add these functions to the code in ha_myisam.cc:

display_functype()
ha_myisam::cond_push()
handle_Item()

These functions are described here: WHERE. Please do not forget to modify the header-file accordingly, and add the declaration of the 2 internal functions in the top of the cc-file.
Additionally I modified the function evaluate_join_record() (located in sql/sql_select.cc) and added the following 3 lines (marked in bold):

  DBUG_ENTER("evaluate_join_record");
  fprintf(stderr, "evaluate_join_record() for table: %s\n", join_tab->table->s->table_name.str);
  join_tab->table->file->cond_push( select_cond) ;
  fprintf(stderr, "end of COND.\n"); 
  DBUG_PRINT("enter", .............

The DBUG-statements should show you where to add these lines. The action happens in the call of the function cond_push() which writes the WHERE-clause to the file /var/log/mysql/error.log.
So please add these lines and functions, compile everything and start the server. We will now look what's happening when we execute the SQL-statement given above.

TestSmall

Let's inspect the first record from the table TestSmall. This is what this record looks like:

MariaDB [TestOpt]> select *  from TestSmall limit 1;
+---------+---------+------+------+--------------------------+---------------------+------------+
| Id      | PZN     | EVP  | HAP  | ArtikelBez               | ArtikelText         | Hersteller |
+---------+---------+------+------+--------------------------+---------------------+------------+
| 1000001 | 4972059 | 9.70 | 4.32 | PASCOSEDON TROPFEN 20 ML | PASCOSEDON TROPFEN  | 36350      |
+---------+---------+------+------+--------------------------+---------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

When the SQL-statement from the entry of this post is executed the function evaluate_join_record() is called and in the case of the first record evaluated the code added produces this output::

table_name = <TestSmall>    <select SQL_NO_CACHE count(*) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'>
COND-ITEM   args: 0 type=[COND_AND_FUNC]    
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [A] [Hersteller]  str_value=<36350> name=<Hersteller>
STRING-ITEM     str_value=<00020>   name=<00020>
FUNC-ITEM   [isnotnull] args: 1 type=[ISNOTNULL_FUNC]   
FIELD-ITEM  [TestOpt] [A] [PZN] name=<PZN>
end of COND.

Maybe this form is more intuitv for you:

As you can see the WHERE-condition was modified by the server:

it contains only the part relevant for checking the table TestSmall
the check for PZN IS NOT NULL was added

Let's look into the function-hierarchy of the evaluation of this record:

TestSmall: check record with Id = 1000001 (compare: 36350 and  00020)
evaluate_join_record(): select_cond_result= MY_TEST(select_cond->val_int());
    Item_cond_and::val_int()
        Item::val_bool()
            Item_func_eq::val_int()
                Arg_comparator::compare()
                    Arg_comparator::compare_string()
                        Item_field::val_str()
                        Item_string::val_str()
                        sortcmp()
                            my_strnncollsp_simple()

The last function returns a non-zero-value which says that the two arguments compared are not equal so the whole expression is false (at the top-level there is an AND-operation. When one of the arguments is false then the whole AND results in a false and therefore the right branch of the tree is not evaluated).

With this result the record is thrown away and the next record from TestSmall is read. This process will be repeated until this record is read:

MariaDB [TestOpt]> select * from TestSmall where Hersteller = '00020' limit 1;
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| Id      | PZN    | EVP  | HAP  | ArtikelBez                     | ArtikelText                | Hersteller |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 178324 | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND  | 00020      |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
1 row in set (0.01 sec)

MariaDB [TestOpt]>

When the WHERE-clause is applied to this record this is the function-hierarchy in this case:

TestSmall: check the record with Id = 1002100
evaluate_join_record(): select_cond_result= MY_TEST(select_cond->val_int());
    Item_cond_and::val_int()
        Item::val_bool()
            Item_func_eq::val_int()
                Arg_comparator::compare()
                    Arg_comparator::compare_string()
                        Item_field::val_str()
                        Item_string::val_str()
                        sortcmp()
                            my_strnncollsp_simple()
            Item::val_bool()
                Item_func_isnotnull::val_int()
                    Item_field::is_null()
                        Field::is_null()

As you can see the right branch of the tree is also checked for this record as the column Hersteller contains the correct value. So the server switches over to the table TestBig and searchess for records with the value '178324' in the column PZN (the value is taken from the current record from TestSmall).

TestBig

Again I will only look at the code that handles the WHERE-clause. So here is the first record from TestBig to evaluate:

MariaDB [TestOpt]> select * from TestBig where PZN = 178324 limit 1;
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| 01046189 | 178324 |  10.45 | 7.99 | MARLENE MS K1 AD K BLACK M      2 ST | BASILIKUM AETHERISCHES OEL | 20860      |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
1 rows in set (0.00 sec)

MariaDB [TestOpt]>

This is the WHERE-statement as printed by the code shown above in the case of this record from the table TestBig:

table_name = <TestBig>  <select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'>
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [B] [Hersteller]  str_value=<20860>   name=<Hersteller>
STRING-ITEM     str_value=<36367>   name=<36367>
end of COND.

And here it is in a different form:

This hierarchy shows how this WHERE-clause is applied:

TestBig: check record with Id = 01046189 (compare: 20860 and 36367)
evaluate_join_record()
    Item_func_eq::val_int()
        Arg_comparator::compare()
            Arg_comparator::compare_string()
                sortcmp()
                    my_strnncollsp_simple()

The last function returns a non-zero value so this record is thrown away and the next record is read from the table TestBig (the next record with the value '178324' in the column PZN).

COND

On entering the function evaluate_join_record() the (partial) WHERE-clause is extracted, a variable of type COND * points to this object.

COND is nothing more than an alias for the class Item which is defined in the file sql/item.h. Item is the base-class for a lot of classes used for bringing the WHERE-clause into an internal form. You can find a class-hierrachy here: Item Class Reference. Don't let this page confuse you. As you've seen in the hierarchies above the server used some of them to construct our WHERE-clause. And thanks to inheritance and object-orientation it was fairly easy to apply this WHERE-clause to a record.

correctness

Wednesday, 5 November 2014

JOIN (2)

here I'm presentiung some more information about how MySQL/MariaDB handles a JOIN.

In my post JOIN I described what the storage-engine does in executing a simple (but non-trivial) SQL-statement. Here I take the same statement and describe what the database-server does. So in my last post my description started when the server calls a function in the engine, in this post my description will end when such a function is called. The last post described function-hierarchies in the MyISAM-engine, this post will describe hierarchies above the engine-level and therefore valid for all storage-engines.

environment

The environment is the same as in the post JOIN. The are two tables: TestSmall with approx. 300K records in it and TestBig with 10mio. records in it. The structure of the tables is the same as before. And on both tables there is an index on the column PZN.

statement

And here is the statement I want to inspect:

MariaDB [TestOpt]> select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.48 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

Also included is the EXPLAIN for this SQL-statement.

pseudocode

Here is what the server does (in pseudocode): the server scans the table TestSmall and fetches record after record (table-scan). For each record the WHERE-clause will be applied and if a match is found the server extracts from the current record the value of the column PZN, switches over to the table TestBig and searches all records with this value. For this operation it uses the index on the column PZN of TestBig. Again for each record found in TestBig the WHERE-clause is applied, in the case of a match this record is handled.
This process is continued in TestBig until all records with this PZN-value are examined. It then switches back to the table TestSmall and continues scanning through this table.
The whole process stops when all records in TestSmall are read.

hierarchy

Instead of long explanations here is the function-hierarchy for executing this statement:

 JOIN::exec()
     JOIN::exec_inner()
         do_select()
             sub_select()
                 // access the first record:
                 (*join_tab->read_first_record)() -> join_init_read_record()
                     init_read_record()
                         table->file->ha_rnd_init_with_error() -> handler::ha_rnd_init_with_error()
                             ha_rnd_init() -> handler::ha_rnd_init(
                                 rnd_init() -> ha_myisam::rnd_init
                         (*tab->read_record.read_record)() -> rr_sequential()
                              info->table->file->ha_rnd_next() -> handler::ha_rnd_next()
                                  rnd_next() -> ha_myisam::rnd_next()
                 evaluate_join_record()

                 // iterate: read all records until match
                info->read_record() -> rr_sequential()
                    info->table->file->ha_rnd_next() -> handler::ha_rnd_next()
                        rnd_next() -> ha_myisam::rnd_next()
                evaluate_join_record()
                // until END_OF_FILE

The whole work is done in the function sub_select(). The function is entered with 2 objects plus a boolean, let's look at the objects: an object of class JOIN and an object of struct JOIN_TAB. The class JOIN contains information about all tables to be accessed in executing this query, the struct JOIN_TAB contains information about one table, the table that is inspected by this call of sub_select().

If you want to look into the data of these parameters please add the following lines in the code of the function sub_select() (found in the file sql/sql_select.cc) after entrance into this function:

  TABLE ** ptrTables = join->table;
  char *ptrTable ;
  fprintf(stderr, "entering sub_select:\n\tstructure JOIN:\n");
  while ( *ptrTables != NULL )
  {
      ptrTable = (*ptrTables)->s->table_name.str;
      fprintf(stderr, "\t\ttable: <%s>\n", ptrTable);
      ++ptrTables;
  }
  char *ptr2ndTable = join_tab->table->s->table_name.str;
  fprintf(stderr, "\tstructure JOIN_TAB:\n\t\ttable: <%s>\n", ptr2ndTable);

And here is the output found in the log-file (on my machine it's in /var/log/mysql/error.log):

entering sub_select: 
    structure JOIN:
        table: <TestSmall>
        table: <TestBig>
    structure JOIN_TAB:
        table: <TestSmall>

As you can see the JOIN contains information about the tables TestSmall and TestBig (in this order), JOIN_TAB contains information about TestSmall and we will start with this table.

Coming back to the function-call-hierarchy given above let's break it into pieces and look at it. As reading the first record may need an extra step for initialization this access is handled separately:

in sub_select():
                 (*join_tab->read_first_record)() -> join_init_read_record()
                     init_read_record()
                         table->file->ha_rnd_init_with_error() -> handler::ha_rnd_init_with_error()
                             ha_rnd_init() -> handler::ha_rnd_init()
                                 rnd_init() -> ha_myisam::rnd_init()
                         (*tab->read_record.read_record)() -> rr_sequential()
                              info->table->file->ha_rnd_next() -> handler::ha_rnd_next()
                                  rnd_next() -> ha_myisam::rnd_next()
                 evaluate_join_record()

The last function in this hierarchy is evaluate_join_record(). In this function the record read is examined (= the WHERE-clause will be applied).

After the first record is read and processed all following records are read by this loop (again I show you the hierarchy of function-calls, not the full code executed):

in sub_select():
                // iterate: read all records until match
                info->read_record() -> rr_sequential()
                    info->table->file->ha_rnd_next() -> handler::ha_rnd_next()
                        rnd_next() -> ha_myisam::rnd_next()
                evaluate_join_record()
                // until END_OF_FILE

That's all.

But I've only shown how the table TestSmall is accessed. So where does the code jump over and accesses TestBig? This happens in the function evaluate_join_record(). Here the WHERE-clause will be applied to the current record (from TestSmall) and in case of a match the function sub_select() is called (a recursion). This is the line of code in evaluate_join_record() that calls sub_select():

      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0);

On entering the function sub_select() again the parameters now look like:

entering sub_select: 
    structure JOIN:
        table: <TestSmall>
        table: <TestBig>
    structure JOIN_TAB:
        table: <TestBig>

You see that the data in JOIN is identical to the first entry but the value of the JOIN_TAB now points to data regarding TestBig.

As described above the code in sub_select() checks for the first record, now in the table TestBig:

in sub_select():
                        // search for the first record:
                        (*join_tab->read_first_record)() -> join_read_always_key()
                             table->file->ha_index_init() -> handler::ha_index_init()
                                 index_init() -> ha_myisam::index_init()
                             table->file->ha_index_read_map() -> handler::ha_index_read_map()
                                 index_read_map() -> ha_myisam::index_read_map()
                         evaluate_join_record()

Seems like an identical hierarchy but the functions called differ because of a different JOIN_TAB.

If there are more records in TestBig these are fetched by these function-calls (still in sub_select()):

in sub_select():
                         // iterate: search all records until a no-match
                         info->read_record() -> join_read_next_same()
                             table->file->ha_index_next_same() -> handler::ha_index_next_same()
                                 index_next_same() -> ha_myisam::index_next_same()
                         evaluate_join_record()
                         // until NESTED_LOOP_NO_MORE_ROWS

And this finishes our journey through the handling of the given SQL-statement.

correctness

Wednesday, 22 October 2014

JOIN

By playing with an SQL-statement two questions arised. After diving into the code I was able to answer at least one question and I would like to show my results here. The question was: how does the database-server handle a JOIN? The other question was: among all possible ways for handling this query why does the database-server choose this way? But this is another story.

Let's come back to the first question and look into the results of my observations.

MyISAM

A lot of my statements here are specific for the MyISAM-engine because the tables used for the tests are of type MyISAM. Other storage-engines can handle things differently so don't apply the results presented here schematically to other engines.

And as a second argument a lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like values for keys, file-offsets etc.).

preparations

Let's create some tables, put some data into them and start with the test.

MariaDB [(none)]> create database TestOpt;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use TestOpt;
Database changed
MariaDB [TestOpt]> create table TestBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  decimal(7,2),
    ->                 HAP  decimal(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestOpt]> create table TestSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  decimal(7,2),
    ->                 HAP  decimal(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestOpt]> insert into TestSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (2.21 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> insert into TestBig select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (1 min 12.97 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index PZN on TestSmall(PZN);
Query OK, 301036 rows affected (2.08 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 28.40 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]>

And this is the statement I would like to inspect:

MariaDB [TestOpt]>  select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.41 sec)

MariaDB [TestOpt]>

This statement is silly, it has no special meaning except to be used for this demonstration here.

Let the database-server tell us how it handles this statement:

MariaDB [TestOpt]> explain  select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 6
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

Looks like it takes the table TestSmall (aka A) to scans all records in it, applies WHERE to the records found and for those records meeting the WHERE-condition it searches the corresponding record(s) from TestBig (aka B) via the index. It then puts everything together and presents the result.

code

To look further into the handling of this statement I used the modifications presented here and counted the number of calls of some routines. Here is the result:

select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36035';
table = <TestSmall> counter table-scan: 301037  index-scan: 0
table = <TestBig>   counter table-scan: 0       index-scan: 311

The assumption above seems to be correct: the table TestSmall has approx. 300K records in it as you can see from the INSERT-statement above, these are read via a table-scan (counted by the variable counterTableScan).
For the table TestBig no table-scan is made (var. counterTableScan is 0) but instead index-related-functions are called, only 311times these functions are called. As this table contains 10mio records this is a wise decision to check only this small amount of records.

Let's inspect the data in these two tables:

MariaDB [TestOpt]> select count(*) from TestSmall where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.38 sec)

MariaDB [TestOpt]> select count(*) from TestBig where Hersteller = '36367';
+----------+
| count(*) |
+----------+
|   461016 |
+----------+
1 row in set (12.80 sec)

MariaDB [TestOpt]>

In the table TestSmall we have 60 records, in the table TestBig we have approx. 460K records fulfilling the WHERE-clause.

scanning TestSmall

There are three functions in the storage-engine for handling a table-scan:

rnd_init() is called once for starting a table-scan
rnd_next() returns one record. This function is called again and again until all records are read. In this case it returns MHA_ERR_END_OF_FILE.
rnd_end() is called once after all records are scanned

Using these functions the table TestSmall is read. Record after record is fetched and the column Hersteller is checked. After 2100 records a record with Hersteller = '00020' is encountered.

MariaDB [TestOpt]> select SQL_NO_CACHE * from TestSmall where Hersteller = '00020' limit 1;
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| Id      | PZN    | EVP  | HAP  | ArtikelBez                     | ArtikelText                | Hersteller |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 178324 | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND  | 00020      |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
1 row in set (0.06 sec)

MariaDB [TestOpt]>

This is the first record with Hersteller = '00020' in TestSmall. In this record the value for PZN is '173824'. Let's look into TestBig for records with this value in the column PZN:

MariaDB [TestOpt]> select SQL_NO_CACHE * from TestBig where PZN = '178324' limit 3;
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| 01046189 | 178324 |  10.45 | 7.99 | MARLENE MS K1 AD K BLACK M      2 ST | BASILIKUM AETHERISCHES OEL | 20860      |
| 01089558 | 178324 | 141.72 | 0.00 | HEVERTOPLEX NR148N POPULUS     30 ML | NOVO HELISEN DEP BIR/ER/HA | 01250      |
| 01240196 | 178324 |  17.90 | 5.22 | BIOCHEMIE 10 NATR SULF D12     80 ST | SUPRIMA BETTUN M EG 50X70  | 36245      |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
3 rows in set (0.08 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestBig where PZN = '178324';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestOpt]>

As you can see there are 60 records in it and I presented only the first 3 records.

TestBig via index

There are a lot of functions for accessing an index, their names start with index_xxxxx(). But only two of them are of interest for our test-case:

index_read_map() starts the search. It looks for the first occurrence of a key (in var. key) in the index. If this key is found it returns the corresponding record via the var. buf, otherwise HA_ERR_KEY_NOT_FOUND is returned.
index_next_same() searches for the next entry in the index and returns the corresponding record in the var. buf. If there is no entry identical to the key given it returns HA_ERR_END_OF_FILE.

But we need more information: how is the index organized?

BTREE

MyISAM stores all index-related data in the file with the extension MYI. In our test-case only one index is created but the following statements are also valid in the case of multiple indexes.

The index in MyISAM is of type BTREE, you will find a description of this here. And you will find the description of the structure of the MYI-file here: 20.2 The .MYI File.

tree

Let's dive a bit deeper into the handling of an index and into our test-case.
The tree is organized in pages of size 1K. There are 2 types of pages: nodes and leafs. A node contains one ore more entries with this structure:

a pointer to another node or leaf containing entries with keys less than the following key (5 bytes)
the key = the value of the column of one record (1+7 bytes)
the offset in MYD of the corresponding record (6 bytes)

After the last entry in a node follow a pointer (5 bytes) that contains the number of a page with key-values greater than the last key in this node.

An entry conists of 19 bytes so up to 53 entries can be found in a node ( because 2 + 53*19 + 5 < 1024).

For a leaf the structure differs a bit:

the key = the value of the column(s) which build the index
the offset in MYD of the corresponding record

A leaf is the end of a path in the tree, nodes occur on the way to the leaf. Or said differently: nodes have successors (nodes or leafs), leafs do not. This explains why a leaf contains no references to other pages. As an entry in a leaf is 14 bytes long up to 73 entries fit into a page.

In MyISAM you can easily detect if a page is a node: in the first byte of the page the MSB ist set (so the first byte is >= 0x80).

You can find a nice picture demonstrating this in the Wikipedia-article.

I will give you examples for this structure soon. As written before these examples are taken from my tests and depend on my data.

unique

In the tree the entries are unique and they are sorted, usually ascending.

In the index the entries are unique meaning that for every record in the table there is one entry in the index. As the entry contains the file-offset from the data-file the uniqueness is guaranteed when you look at the full 14 bytes of an entry. This has nothing to do with the statement CREATE UNIQUE INDEX which guarantees uniqueness in the first 8 bytes (in our test-case).

function-calls

Coming back to the query: a record is found in TestSmall, the PZN is extracted and has the value '178324'. Now the server switches over to the table TestBig and searches for records with this value in the column PZN. As TestBig has an index on the column PZN it will use it. Here is the function-hierarchy:

ha_myisam::index_read_map(() 
    mi_rkey()
        _mi_search() with pos = 0x08859800 (aka: root of tree)
            _mi_search() with pos = 0x5a5400
                _mi_search() with pos = 0x30f800
                    _mi_search() with pos = 0x305000
        (*info->read_record) calls _mi_read_dynamic_record()

index_read_map() is the function the database-server calls for reading the first record to the given PZN-value. From the moment of entering this function up to the moment this function is left everything is the responsibility of the storage-engine. The PZN-value to search for is given in the parameter key and in our case it looks like this: 0x00 (a flag) + '178324 '. As the column PZN is created with a width of 7 bytes the key also contains a trailing space. So the length for the comparisons will be 8. index_read_map() calls mi_rkey() for further processing.

mi_rkey() now searches through the tree and it calls the function _mi_search() to do the walk-through recursively. It starts the search at 0x08859800 which is a file-offset in the MYI-file. As written before this is valid for my test-case, your data may differ. One can find the start of the tree (=the root) in the MYI-file (output is shortened):

august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00    --> 0x08859800

august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01 00 03 01 7e 00 b0 00 64 00 c4 00 01
000010 00 00 01 00 08 01 00 00 00 00 30 ff 00 00 00 00
000020 00 98 96 80 00 00 00 00 00 00 00 00 00 00 00 00
000030 00 98 96 80 ff ff ff ff ff ff ff ff 00 00 00 00
000040 08 85 9c 00 00 00 00 00 33 a9 ba 30 00 00 00 00
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 53 36
000070 00 00 00 28 00 00 00 00 00 00 00 01 00 00 00 00      the offset of the root of the tree
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00      (8 bytes)
000090 00 00 00 00 54 32 ad c8 00 00 00 00 00 00 00 01
       ........

As I've created only one index there is only one entry for the file-offset, there can be multiple offsets. You can also find this value at runtime in the var. info->s->state.key_root[0].

Here is what I found in the MYI-file in the root of the tree (a bit shortened):

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x08859800 --read-bytes=1024 TestBig.MYI
8859800 83 97                               this is a node: 0x80 is set
                                            this page is filled up to 0x0397 bytes

        00 00 00 0b 66                      page-number of another node/leaf
        01 31 37 32 33 31 30 20             PZN = 0x01 + '172310 '
        00 00 0e c1 4a 38                   offset in MYD for this record

        00 00 00 16 95                      page 0x1695 --> offset = 0x1695 * 0x400 = 0x5A5400
        01 32 34 38 32 39 35 20             PZN = 0x01 + '248295 '
        00 00 12 20 9b 6c                   offset in MYD-for this record
........................

We are looking for a value of 0x01 (changed already from 0x00) + '178324 ' and found that it is between the first and the second entry in this page. So from the second entry we take the page-number 0x1695, multiply it with the length of a page (=0x400) and get a new offset: 0x5A5400. Let's look what we will find in MYI at this place:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x5A5400 --read-bytes=1024 TestBig.MYI
5a5400 83 e3                                this is a node
                                            this page contains up to 0x03e3 bytes

       00 00 00 0b 65                       
       01 31 37 33 35 34 35 20              PZN = 0x01 + '173545 '
       00 00 0e 7d 12 9c 

       00 00 00 0b 9c 
       01 31 37 34 37 38 36 20              PZN = 0x01 + '174786 '
       00 00 2b ce 20 40 

       00 00 00 0b d2 
       01 31 37 36 32 35 39 20              PZN = 0x01 + '176259 '
       00 00 2a 32 2a f8 

       00 00 00 0c 08
5a5440 01 31 37 37 39 38 30 20              PZN = 0x01 + '177980 '
       00 00 2b 12 d9 d0                    

       00 00 00 0c 3e                       page 0x0c3e --> file offset = 0x0c3e * 0x400 = 0x30f800
       01 31 37 39 32 30 30 20              PZN = 0x01 + '179200 '
       00 00 11 bc 9b 24 
.........................

Again the software walks through this page comparing the keys. The first key greater than our search key is the fifth entry so we continue our search at 0x30f800. This is what we find there:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x30F800 --read-bytes=1024 TestBig.MYI
30f800 83 e3                                another node

       00 00 00 0c 07 
       01 31 37 38 30 30 35 20              PZN = 0x01 + '178005 '
       00 00 31 10 b1 74 
.................
30f8c0 00 00 00 0c 12 
       01 31 37 38 32 36 34 20              PZN = 0x01 + '178264 '
       00 00 01 a1 a8 b4 

       00 00 00 0c 13 
       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
30f8e0 00 00 09 7d 8d f0 

       00 00 00 0c 14                       page = 0x0c14 --> offset 0x305000
       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '    
       00 00 0d 23 d0 60                    file-offset = 0x0d23d060

       00 00 00 0c 15                       page = 0x0c15 --> fiile-offset = 0x305400
       01 31 37 38 33 35 33 20              PZN = 0x01 + '178353 ' 
       00 00 12 c9 f6 9c 
..........................

Again the software compares the keys and stops when it finds an entry greater or equal to our search-key. As we are interested in the first record with our key we continue our search at file-offset 0x305000:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305000 --read-bytes=1024 TestBig.MYI
305000 03 f2                                this is a leaf!

       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 09 fd 3b 88 

305010 01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 0a 9f ea 2c 
.........
       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 33 58 5e e4 

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '       
3052f0 00 00 00 3d 17 c4                    file-offset = 0x003d17c4

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '
       00 00 00 76 74 20                    file-offset = 0x00767420
.........

Starting the comparison from the first entry in this page the code in _mi_search() finds an entry (marked in bold above). As this is a leaf we can stop here and return (and return and return and return, as this was a recursion).
The record can be read from the MYD-file at offset 0x003d17c4:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x0000003d17c4 --read-bytes=100 TestBig.MYD
3d17c4 03 
       00 5e                                len record = 94d
       02                                   no. of filler-bytes
       00 80                                flags
       30 31 30 34 36 31 38 39              Id = '01046189'
       31 37 38 33 32 34 20                 PZN = '178324 '
       80 00 0a 2d                          EVP = 10 + 45 (0x0a, 0x2d) 
       80 00 07 63                          HAP =  7 + 99 (0x07, 0x63)
       24 4d 41 52 4c 45 4e 45              ArtikelBez (len 0x24)
       20 4d 53 20 4b 31 20 41 
       44 20 4b 20 42 4c 41 43 
       4b 20 4d 20 20 20 20 20 
       20 32 20 53 54 
       1a 42 41 53 49 4c 49 4b              ArtikelText (len 0x1a)
       55 4d 20 41 45 54 48 45 
       52 49 53 43 48 45 53 20 
       4f 45 4c 
       32 30 38 36 30                       Hersteller = 20860'
       00 00                                filler-bytes

You can find an explanation of this structure in my post with the title od.

Finito, in TestBig we got the first record which is given back to the database-server in the var. buf. Some internal values are stored, among others the entry found in MYI is kept for further processing (the key + offset, now 14 bytes).

Record #1 is found but it does not match to the condition of the WHERE-clause. Nevertheless step 1 is done.

So let's go on. The database-server continues to search for more records in TestBig for the record from TestSmall. Here is the function-hierarchy for the next step(s):

ha_myisam::index_next_same()
    mi_rnext_same()
        _mi_search_next()
             check page: info->int_keypos < info->int_maxpos
               --> fetch next entry from this page
        compare the first 8 bytes of this entry
        identical: (*info->read_record) calls _mi_read_dynamic_record()

Now the entry-point into the storage-engine is the function index_next_same(). As the name suggests it searches for the next entry in the tree with the same key (8 bytes). So the last page was checked and as it contains more entries so the next entry is taken from this page.

This step will be repeated until a key is found greater than the search-key. Here this happens:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305400 --read-bytes=1024 TestBig.MYI
305400 03 f2                                this is a leaf
.....................
       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '        record #59
       00 00 32 52 38 f0 

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '        record #60
       00 00 32 6c 11 30                    file-offset = 0x326c1130

       01 31 37 38 33 35 33 20              PZN = 0x01 + '178353 '        greater then '178324 '
       00 00 00 6c b5 0c                    file-offset = 0x6cb50c
.....................

After record #60 the entry with the key '178353 ' is returned but the function mi_rnext_same() detects that this value differs from the previous one and returns HA_ERR_END_OF_FILE to index_next_same() which returns this value to the database-server.
The server now knows that there are no more records with the key '178324 ' in the table TestBig. It switches over to the table TestSmall and continues with the table-scan. For the next record from TestSmall with a value of Hersteller = '00020' it switches over to TestBig and calls ha_myisam::index_read_map(() giving it as a parameter the value of PZN from the then current record from TestSmall.

more explanations
I'm sure you've detected the fact that in my last block of entries I silently switched from the page at offset 0x305000 to the page at offset 0x305400.
In this example above we scanned the index for entries '178324 ' and the SQL-statement somewhere in the beginning of this post showed that there are 60 records in TestBig with this PZN-value so there must be 60 entries with this value in the tree too. These entries are spread over 2 leafs and one node: 19 entries can bei found at the page (leaf) at offset 0x305000, 40 entries can be found in the page (leaf) at offset 0x305400 and one entry is found in the page (node) at offset 0x30F800.
As shown it's easy to continue the search within the current page but what does the storage-engine do when the end of a page is reached?
It does this:

ha_myisam::index_next_same()()
    mi_rnext_same()
        _mi_search_next() with pos = 0x8859800 (=root)
            info->int_keypos >= info->int_maxpos 
                --> reached the end of this page --> search from root of the tree
            _mi_search() with pos = 0x8859800 (= root)
                _mi_fetch_keypage(): root-page (from cache)
                _mi_seq_search() in this page
                    stops at 2nd entry, returns 1
                _mi_search() with pos = 0x5a5400
                    _mi_fetch_keypage() (from cache)
                    _mi_seq_search()
                        stops at 5th entry, returns 1
                    _mi_search() with pos = 0x30f800
                        _mi_fetch_keypage() (from cache)
                        _mi_seq_search()
                            stops at 12th entry, returns 1
                        _mi_search() with pos = 0x305000
                            _mi_fetch_keypage() (from cache)
                            _mi_seq_search()
                                returns -1 (did find only smaller entries)
                            _mi_search(() with -1
                                end of search: continue search one level up
                    _mi_fetch_keypage(): 0x30f800 (from cache)
                    _mi_get_pack_key()
          check key found
          read record at 0xd23d060

As you can see _mi_search() is called recursively to find an entry in a leaf. And in this case there is no entry in a leaf which is greater than the last key (comparing the full 14 bytes) and smaller than the entries in nodes so it returns 2 levels up and takes the entry from the node at offset 0x30f800. It returns with a pointer to this entry to the function _mi_search_next() where the key is compared against the search-key (comparing the first 8 bytes) and as these are identical the offset (the last 6 bytes of this entry) is taken and the record is read from the MYD-file.

In the next call of index_next_same() it works like this:

ha_myisam::index_next_same() 
    mi_rnext_same()
        _mi_search_next() with pos = 0x8859800 (=root)
            uses the last buffer because: info->buff is 0 --> unchanged 
            _mi_search() with pos = 0x305400
                _mi_fetch_keypage()     (read from disk via my_pread())
                _mi_seq_search() 
                    returns +1
                _mi_search() with pos = -1
                     returns with +1 (search at upper level)
        read record at 0xd7e0b58

As the current buffer is a node it takes the page-offset contained in the last entry used. The page with this number (a node or a leaf) will contain entries with key-values greater than the current entry. So with some extra steps the first entry from this page is taken and the corresponding record is read.

Next calls to index_next_same() will take the following entries from this page until an entry is found which has a different key. In this case _mi_search_next() returns a value HA_ERR_END_OF_FILE. The server now switches over to TestSmall as already described above.

correctness

This text is still a simplified presentation of what's going on in the software. More conflicts can arise but the code handles these situations. And there are also some optimizations build into the code that I didn't describe (this text is already a bit longer than planned).

In case something is wrong with my description please use the mail-function of this side and send me a mail describing my error. I will look at it. Thanks.