an unsystematic journey through the source-code of MySQL and MariaDB: execution: with and without indexes

in my last post I showed how the case without any indexes is handled by the optimizer. For this post I want to show how the plan found by the optimizer is executed. Additionally I want to compare the execution of this case with the execution of a case with one index added on one table.

So let me look into the code for the execution of the case without any indexes and start my description when the optimizer-stage is already done. And my description will stop when the storage-engine is entered, so the text describes the code in the database-server. For my tests I've used MariaDB, for MySQL things may look a bit different.

environment

Again I'm using my usual environment which I already described here: JOIN.

In the first step I want to inspect the case without any index but instead of dropping all indexes on the tables involved for my query I added one index to the table TestBig, this simplified the handling of my tests. For testing the case without any index I told the optimizer in my SQL-statement in the form of a hint to not use this existing index. In the end of this text I will explain why I did this.

So here is the environment:

MariaDB [TestOpt]> show indexes from TestSmall;
Empty set (0.00 sec)

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

So you can see there is no index on the table TestSmall and one index on the table TestBig, on the column PZN.

And here is my standard-statement:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

I've included the explain and marked the hint IGNORE INDEX(PZN)¹⁾ in the SQL-statement and in the explain I also marked the access-type in bold in this box.

expected execution

I've already described the behavior of the server in the case with an index: JOIN (2). In the case described the server scans through the table TestSmall, reading and inspecting record after record, and when a matching record is found to switch over to the table TestBig and search for corresponding records in this table taking the WHERE- and ON-clause into account; after no more records can be found in TestBig the server returns to TestSmall and continues with the table-scan.
And for the case without any index I expected the server to behave in the same way. If no index is available there is no other way for searching the table TestBig than by a table-scan o I expected the execution of the statement to be very slow because for every record found in TestSmall a full table-scan of TestBig has to be made. But if you look at the box above you will see that this statement took 23 seconds for the execution and that's not so bad.

In an older post I introduced some counter-variables and by using these I can see this result in the file /var/log/mysql/error.log:

table = <TestSmall> counter table-scan: 301037      index-scan: 0
table = <TestBig>   counter table-scan: 10000001    index-scan: 0

When I look at these numbers I can see that table TestSmall is read once and also table TestBig is read only once²⁾. Good. So how is this handled in the code?

execution

Before I continue with this text I want to issue a warning: the functions and hierarchies shown here are valid for the statement and environment given above. If you do your own tests this may look a bit different.

the case without index

Now I want to start with the execution of the statement without any indexes on the tables. Execution starts with the table TestSmall, so here is the function hierarchy for reading this table:

mysql_select()
    JOIN::exec()
        JOIN::exec_inner()
            do_select()
                sub_select()
                    // read one record from TestSmall
                    evaluate_join_record()

In the next lines of this text I want to inspect the function sub_select() and see what's happening there.
The code in sub_select() distinguishes between a first read and all following reads. For the first record read everything needed is initialized, the first record is read from TestSmall and this record is inspected. This is the code that reads the first record from the table TestSmall:

  if (rc != NESTED_LOOP_NO_MORE_ROWS)
  {
    error= (*join_tab->read_first_record)(join_tab); 
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

Let me omit the function evaluate_join_record() for some seconds. For read_first_record() the function-hierarchy looks like this:

(*join_tab->read_first_record)  -->  join_init_read_record()      
    init_read_record()
        handler::ha_rnd_init_with_error()
            handler::ha_rnd_init()
                ha_myisam::rnd_init()
    rr_sequential()
        handler::ha_rnd_next()
            ha_myisam::rnd_next()

I stopped the presentation of the hierarchy at the engine-level because I didn't want to dive deeper into the code for this post.

After the first record is read and inspected the code in sub_select() continues with a while-loop scanning the table TestSmall until EOF (or NESTED_LOOP_NO_MORE_ROWS). Here is the code:

  while (rc == NESTED_LOOP_OK && join->return_tab >= join_tab)
  {
    ....
    error= info->read_record(info);
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

Here is how the next record is fetched:

    info->read_record()  -->  rr_sequential()
        handler::ha_rnd_next()
            ha_myisam::rnd_next()

Now I will come back to the function evaluate_join_record(). Here is the code executed (simplified):

static enum_nested_loop_state
evaluate_join_record(JOIN *join, JOIN_TAB *join_tab,
                     int error)
{
  ....
  if (select_cond)
  {
    select_cond_result= MY_TEST(select_cond->val_int());
    ....
  }

  if (!select_cond || select_cond_result)
  {
   ....
    if (found)
    {
      enum enum_nested_loop_state rc;
      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0); 
    }
    ......
  }
  DBUG_RETURN(NESTED_LOOP_OK);
}

The record fetched is inspected in the macro MY_TEST³⁾. If this record fulfills the condition the line with (*join_tab->next_select)() calls another function which in this case (the case without any index) is the function sub_select_cache()⁴⁾. In this function the record is stored in an internal structure, a cache. It then returns to sub_select() and the cycle continues with the next fetch from the table TestSmall until all records from this table are read.

now with index

Please allow me to stop this description (I do know that I'm just in the middle of the execution) and switch to the case with an index on the table TestBig. First I want to show the statement:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.40 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

The explain tells me that the execution of this statement starts with a table-scan on TestSmall and the table TestBig is read via the index PZN. To verify this I looked at my counter-variables, these showed this result:

table = <TestSmall> counter table-scan: 301037  index-scan: 0
table = <TestBig>   counter table-scan: 0       index-scan: 311⁵⁾

and you see there is no table-scan on the table TestBig.

The execution starts with a table-scan on TestSmall. This is handled similar to the case without index except for the handling in the function evaluate_join_record(). Now the hierarchy looks like this:

sub_select()
    evaluate_join_record()
        sub_select()                via:  rc= (*join_tab->next_select)(join, join_tab+1, 0); 
            // read a record from TestBig (=join_tab+1)
            evaluate_join_record()
                end_send_group()

The fetching of the record from TestBig is handled in 2 cases (similar to the cases described above):

fetching the first record:
sub_select()
    evaluate_join_record()
        sub_select()
            (*join_tab->read_first_record)() ->  join_read_always_key()
                handler::ha_index_init()
                    ha_myisam::index_init()
                handler::ha_index_read_map()
                    ha_myisam::index_read_map()
            evaluate_join_record()
                end_send_group()

all other records are fetched this way:
sub_select()
    evaluate_join_record()
        sub_select()
            info->read_record() -> join_read_next_same()
                handler::ha_index_next_same()
                    ha_myisam::index_next_same()
            evaluate_join_record()
                end_send_group()

As you can see the structure of the code is similar but the functions called from evaluate_join_record() differs. In this case it extracts the value of the column PZN from the record from TestSmall and reads record after record from TestBig using the index PZN with this value. When no more records are found with this value in the column PZN it returns to TestSmall and continues the table-scan.⁶⁾

in general

In evaluate_join_record() the next level is called by the statement (*join_tab->next_select)(join, join_tab+1, 0). The variable next_select is defined here:

sql_select.h:
typedef struct st_join_table {
  ...
  READ_RECORD::Setup_func read_first_record;
  Next_select_func next_select;
  ...
} JOIN_TAB;

and the type Next_selec_func is defined this way:

typedef enum_nested_loop_state
(*Next_select_func)(JOIN *, struct st_join_table *, bool);

The return-type is an enumeration defined as:

enum enum_nested_loop_state
{
  NESTED_LOOP_KILLED= -2, NESTED_LOOP_ERROR= -1,
  NESTED_LOOP_OK= 0, NESTED_LOOP_NO_MORE_ROWS= 1,
  NESTED_LOOP_QUERY_LIMIT= 3, NESTED_LOOP_CURSOR_LIMIT= 4
};

join_tab+1

You certainly noticed the expression join_tab+1 occurring multiple times as a parameter in the function-calls in the code-boxes. The reason of this is: in the beginning of the optimizer-stage an object of type JOIN is created. This structure contains the variable join_tab which points to a list of entries of type JOIN_TAB. So the statement join_tab+1 points to the next entry in this list.

Let me show you some lines from the comment above the function sub_select():

  @param join      pointer to the structure providing all context info for
                   the query
  @param join_tab  the first next table of the execution plan to be retrieved
  @param end_records  true when we need to perform final steps of retrival

By calling this function the first parameter contains a lot of information about the query, the second parameter points to the JOIN_TAB to handle with this call and the last parameter is a flag.

If you look at the code-hierarchy above you will see that the function sub_select() is called recursively, so by the expression join_tab+1 the entries in this list are handled one after the other.

with index: final steps

Before I go back to the case without index (=my starting point) I want to show the last steps done for the case with index, it's not finished yet.

The central routines as described above were located in the function sub_select(). This function is called from do_select() in this form:

    if (join->outer_ref_cond && !join->outer_ref_cond->val_int())
      error= NESTED_LOOP_NO_MORE_ROWS;
    else
      error= sub_select(join,join_tab,0);    
    if ((error == NESTED_LOOP_OK || error == NESTED_LOOP_NO_MORE_ROWS) &&
        join->thd->killed != ABORT_QUERY)
      error= sub_select(join,join_tab,1);

In this code you can see that the function sub_select() is called twice. When called the first time all the work is done as described some lines above, it's called again with identical parameters except for the last one (above marked in red), which stands for bool end_of_records.

Now with a value of 1 for end_of_records the code executed in the function called looks like this:

in sub_select():
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records);
    DBUG_RETURN(nls);
  }

For this second call of sub_select() the hierarchy looks like this:

do_select()
    sub_select()                 with table TestSmall and a value of 1 for the parameter end_of_records
        sub_select()             with table TestBig (by join_tab+1)
            end_send_group()     with table NULL

This finishes the handling of this query, at least I will stop here with my description.

back to the case without index

Now I want to go back to my starting point which I've left in the middle of the execution.
As described above the table TestSmall is read sequentially, record after record is fetched and the record fetched is inspected in evaluate_join_record(). If it is accepted it is given to the function sub_select_cache() which stores this record in a cache. When the table TestSmall is read to the end the code returns from sub_select() to the calling function do_select(). And here my description stopped, up to this point only the table TestSmall is inspected, the table TestBig is not touched. So let me now describe how the table TestBig is handled.

Some lines above I showed some code in the function do_select(). In these lines in do_select() the function sub_select() is called twice and for the case without an index my description until now shows what's happening only in the first call of this function. For the second call of sub_select() the hierarchy looks like this:

do_select()
    sub_select()     called with table TestSmall, but now with a value of 1 for the parameter end_of_records
        sub_select_cache()     via (*join_tab->next_select), with table TestBig
            JOIN_CACHE::join_records()
                JOIN_CACHE::join_matching_records()
                    JOIN_TAB_SCAN::open()
                        join_init_read_record()
                            init_read_record()
                                handler::ha_rnd_init_with_error()
                                    handler::ha_rnd_init()
                                        ha_myisam::rnd_init()
                            rr_sequential()
                                handler::ha_rnd_next()
                                    ha_myisam::rnd_next()    1st record of table TestBig read

With the last function-call shown in this box the first record is read from TestBig.

Let's go back to the top of this hierarchy. The function do_select() calls sub_select() (the second call) with a value of 1 for the last parameter. In sub_select() this is the code for handling this:

in sub_select():
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records);
    DBUG_RETURN(nls);
  }

As end_of_records has a value of 1 the code following the if()-statement is executed and next_select() calls sub_select_cache() with the next entry from the JOIN_TAB-list (=TestBig). In this function you will find some code similar to the code in sub_select():

in sub_select_cache():
  if (end_of_records)
  {
    rc= cache->join_records(FALSE);
    if (rc == NESTED_LOOP_OK || rc == NESTED_LOOP_NO_MORE_ROWS)
      rc= sub_select(join, join_tab, end_of_records);
    DBUG_RETURN(rc);
  }

As end_of_records is still 1 the code within the braces is executed and the function JOIN_CACHE::join_records() is called. The hierarchy above shows the code-hierarchy executed up to the point where the first record is read from TestBig. With this record the code returns to JOIN_CACHE::join_matching_records(). Now a loop in this function handles reading the records from TestBig and it looks like that:

  while (!(error= join_tab_scan->next()))                               outer loop
  {
    ......
    /* /* Read each possible candidate from the buffer and look for matches *//* 
    while ((rec_ptr= get_next_candidate_for_match()))                   inner loop
    { 
      /* 
        If only the first match is needed, and, it has been already found for
        the next record read from the join buffer, then the record is skipped.
        Also those records that must be null complemented are not considered
        as candidates for matches.
      */
      .........
    }
  }

Record after record is fetched from TestBig by calling join_tab_scan->next():

JOIN_CACHE::join_matching_records()
    JOIN_TAB_SCAN::next()
        rr_sequential()            via info->read_record()
            handler::ha_rnd_next()
                ha_myisam::rnd_next()

With this record from TestBig the JOIN has to be made so the code walks sequentially through all entries of a cache containing the records from TestSmall and tries to find a match (see: inner loop in the box above). When this is done the next record is fetched from TestBig by the statement join_tab_scan->next(), the next iteration of the outer loop in the code-box above.

Two special cases happen when reading a record from TestBig. For the first case the code in JOIN_TAB_SCAN::next() is called with the record already read (as described above). This is handled by this statement:

if (is_first_record)
    is_first_record= FALSE;
  else
    err= info->read_record(info);

In the second case a record is read from TestBig which does not fulfill the (partial) WHERE-clause. This is handled here:

  while (!err && select && (skip_rc= select->skip_record(thd)) <= 0)
  {
    ....
    /* 
      Move to the next record if the last retrieved record does not
      meet the condition pushed to the table join_tab.
    */
    err= info->read_record(info);
    ....
  }

The function SQL_SELECT::skip_record() contains this check and the code looks like:

  inline int skip_record(THD *thd)
  {
    int rc= MY_TEST(!cond || cond->val_int());
    if (thd->is_error())
      rc= -1;
    return rc;
  }

The check is done within the macro MY_TEST⁷⁾ and so the function evaluate_join_record() is not needed.

One piece is missing: in sub_select_cache() the function JOIN_CACHE::join_records() is called and my description above showed that the table TestBig is accessed. But if you look at the lines in sub_select_cache() you will see that a second function is called: sub_select(). This call simply calls the function end_send_group() which finishes the whole activity.

And here I want to stop my description. Looking at my text I think that I've shown how the case without index is handled and also I compared it to the case with one index.
If you think some part is missing please drop me a line using the mail-function on the right side of this blog.

why I used the hint IGNORE INDEX

Initially I wanted to describe the case without any index only. But soon I came upt with the question: how is this handled in the case of an index? So I made a lot of comparisons how the 2 cases are treated in the code. To switch from one case to another I had to

drop the index and do the next test without any index

or for switching back

create the index and start the test-case using the index

Here you can see that dropping the index takes about 100 seonds, creating it again takes about 2 minutes:

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> drop index PZN on TestBig;
Query OK, 10000000 rows affected (1 min 41.39 sec)     
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show indexes from TestBig;
Empty set (0.01 sec)

MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 59.57 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

Instead of waiting 2 minutes before I can start the test of the other case I used the hint, doing it this way was faster.

correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

some notes:

1) also included in this statement is the hint SQL_NO_CACHE. This hint tells the database-server to do the fetches again instead of using the internal cache. This helped me using this statement again and again and ..... So this statement contains two hints.

2) the table TestSmall contains 301.036 and the table TestBig contains 10.000.000 records so each table is indeed scanned only once (add 1 for detecting the EOF-situation).

3) you will find a description of the inspection-process here: JOIN (3).

4) I will stop here, at least for this post.

5) the table TestBig is accessed every time a matching record is found in TestSmall so the number 311 does not tell that 311 records are read from TestBig but that TestBig is accessed 311 times including reporting NESTED_LOOP_NO_MORE_ROWS multiple times.

6) you will find mor details of this behavior here: JOIN (2)

7) I've already described this macro here: JOIN (3)

Tuesday, 5 April 2016

execution: with and without indexes