an unsystematic journey through the source-code of MySQL and MariaDB: September 2015

Tuesday 15 September 2015

optimizer: no indexes

If you look into the blog-archive of this site you will see some posts that describe my observations of the handling of a SQL-statement by the optimizer in MariaDB. The statement is always the same, the structure of the tables is the same, the data in the table is the same - just a few points around this have changed. In this post I would like to add another post to this series: what happens if the tables involved in the query do not have any index at all?

environment

You may look here to find a description of the environment used in this post. Only this has changed:

MariaDB [TestOpt]> show indexes from TestSmall;
Empty set (0.01 sec)

MariaDB [TestOpt]> show indexes from TestBig;
Empty set (0.01 sec)

MariaDB [TestOpt]>

So you see there are no indexes defined on the tables and I want to describe how the database-server handles a statement in this case.

warning

All posts in this series refer to MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for other versions and for MySQL.

The example uses MyISAM as the storage engine for the tables, some of the statements here are only valid for this engine.

Please do not generalize the results presented here. A lot of my statements in this post depend on the data in the tables, and this is my data. If you do your own tests don't expect to see the same results.

the statement

This is the statement I'm looking at:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (24.22 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

As you can see I've included the explain.

Please do not ask me about the meaning of this statement, it is only used for this demonstration, nothing else.

As before there are 301,036 records in the table TestSmall and 10 million records in the table TestBig.

In some posts before I described how the same statement is executed: a table-scan is done on TestSmall. Each record read is inspected and if it satisfies the (partial) WHERE-condition the table TestBig is read, usually via an index using the information from the record just read from TestSmall plus the information in the SQL-statement to fetch the corresponding records from TestBig.

In this post I assume no indexes on the table so following this strategy would mean: read the table TestSmall in a table-scan and when a match is found search the full table TestBig for the corresponding record(s) in TestBig. This means that the big table TestBig will be read by a table-scan multiple times. Or the execution starts with TestBig and for each match it searches the table TestSmall by a table-scan, So it would search the second table multiple times.

Sounds silly. If you look at the statement above you will see that on my PC the server needed about 25 seconds for presenting the result of the query. This look like the server did go a different way to handle the query.

counters

In an older post I introduced some counter-variables to count the number of times a function is called, and to count this at runtime. Using this approach I get the following result in the file /var/log/mysql/error.log after the SQL-statement is executed:

table = <testsmall> counter table-scan: 301,037      index-scan: 0
table = <testbig>   counter table-scan: 10,000,001   index-scan: 0

So you can see for this statement no index-function is accessed as there are no indexes. And each table is accessed once for each record in it (plus one more time to detect the end-situation).

Looks like it scans one table and then the other table and does the JOIN-operation elsewhere. Setting a breakpoint in the function ha_myisam::rnd_next() confirmed this observation.

function hierarchy

In this series I'm mostly interested in the handling of the statement in the optimizer-stage and I will do so again in this post and omit a lot of other aspects. Here is the function-hierarchy in the optimizer in the handling of the query:

 make_join_statistics()
     JOIN_TAB::scan_time()                                          // for table TestSmall
         TABE::stat_records() 
         handler::scan_time()
     JOIN_TAB::scan_time()                                          // for table TestBig
         TABE::stat_records() 
         handler::scan_time()
     choose_plan()
         greedy_search()
             best_extension_by_limited_search()
                 best_access_path()                                 // start with table TestSmall
                     JOIN_TAB::scan_time()
                         TABE::stat_records() 
                         handler::scan_time()
                 // recursion:
                 best_extension_by_limited_search()                 // add table TestBig to the plan
                     best_access_path()
                         JOIN_TAB::scan_time()
                             TABE::stat_records()
                             handler::scan_time()
                 // start a new plan: 
                 best_access_path()                                 // start with table TestBig
                     JOIN_TAB::scan_time()
                         TABE::stat_records()
                         handler::scan_time()

Again I have to write: don't generalize this hierarchy. It depends on the type of statement I'm looking at but it should give you an overview of the process.

pseudo-code

I'll try to explain the handling: we have 2 tables, TestSmall and TestBig, and we have the SQL-statement shown above. So:
the inspection starts with the table TestSmall. The time for reading this table in a table-scan is computed to:

best_access_path():
 records_read = 301,036
 read_time = 6,127

The cost for this (partial) plan is computed to:

best_extension_by_limited_search():
       /* Compute the cost of extending the plan with 's' */
      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time + 
                        current_record_count / (double) TIME_FOR_COMPARE;

and the results are:

 current_record_count = 301,036 = 1 * 301,036
 current_read_time    = 66,334 = 0 + 6,127 + 301,036/5

As we do not have another plan yet this plan is the best plan so far and therefore the values are stored:

best_extension_by_limited_search():
            best_record_count= current_record_count;
            best_read_time=    current_read_time;

Next we extend this partial plan by adding the table TestBig. For computing the cost-values the function best_extension_by_limited_search() is called again (a recursion) and the results are:

best_access_path():
 records_read = 10,000,000
 read_time = 6,348,409

so for accessing these tables the cost is computed to:

best_extension_by_limited_search():
 current_record_count =  3,010,360,000,000 = 301,036 * 10,000,000
 current_read_time    = 602,078,414,743   = 66,334 + 6,348,409 + (3,010,360,000,000 / 5)

Ans as this is the best plan so far the values are stored:

best_extension_by_limited_search():
            best_record_count= current_record_count;  
            best_read_time=    current_read_time;

Now we have to store the plan, not only the results:

best_extension_by_limited_search():
        if (current_read_time < join->best_read)  
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions, 
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }

and as join->best_read was initialized with DBL_MAX the expression in the if() computes to true and the plan is copied to join->best_positions. The variable partial_join_cardinality is computed some lines above in this way:

best_extension_by_limited_search():
      double partial_join_cardinality= current_record_count *
                                        pushdown_cond_selectivity;    // = (301,036 * 10,000,000) * 1

As there are no more tables left to inspect this iteration ends.

In the next iteration it starts fresh with the table TestBig. These are the values for reading this table:

best_access_path():
 records_read = 10,000,000
 read_time = 211,613

so the cost for accessing this table is computed to:

best_extension_by_limited_search():
 current_record_count = 10,000,000 = 1 * 10,000,000
 current_read_time    = 2,211,613 = 0 + 211,613 + 10,000,000/5

Next the optimizer would have to extend this plan with the next table in the list, the table TestSmall. But before doing this it checks whether it can expect a better plan:

best_extension_by_limited_search():
       /*
        Prune some less promising partial plans. This heuristic may miss
        the optimal QEPs, thus it results in a non-exhaustive search.
      */
 
         if (best_record_count > current_record_count ||   
            best_read_time > current_read_time ||
            ....

The if() computes to:

       if ( 301,036 > 10,000,000 ||
            66,334 > 221,1613 ||
            ....

this computes to false so the computation stops here, this plan is not extended.

Now there are no more tables so the whole inspection of plans stops here. And a plan was found: start with TestSmall and extend it by TestBig.

remark

Some lines above I presented this line:

current_record_count =  3,010,360,000,000 = 301,036 * 10,000,000

This line shows the calculation of the number of records fetched for the plan currently inspected. This multiplication tells us that for every record fetched from TestSmall the whole table TestBig must be read (or vice versa). This contradicts the final plan where each table is read only once, so an addition of the two numbers should be used instead of a multiplication. At least this did not affect the decision for a plan.

correctness

This text is a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

Tuesday 1 September 2015

9 records

This text here belongs to the series that describes the handling of a SQL-statements by the optimizer when a table has multiple indexes on it. These are the older posts:

this series started here: optimizer: indexes
one minor piece omitted in the above post was described here: 0xA8
another tiny piece omitted but described in detail here: 0xCA
finding the best plan: optimizer: indexes (2)

In this post I would like to continue this description and describe another tiny point omitted in the last post.

some hints

All posts in this series refer to MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for MySQL.

The example uses MyISAM as the storage engine for the tables and indexes, most of the statements here are only valid for this engine.

Please do not generalize the results presented here. A lot of the data in this post depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers.

9 records

The very tiny point I want to describe in this post is the number 9 which occurred in the last post but wasn't explained there. In these places the number 9 appeared in the last post:

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Ind1,Ind2,PZN,Hersteller,PznHerst,HerstPzn
          key: PznHerst
      key_len: 14
          ref: TestOpt.A.PZN,const
         rows: 9
        Extra: Using where; Using index
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

You will find the number in the description how the second table (=TestBig) is accessed, it's marked in bold.

In the same post you will find the number in these lines:

(7) TestBig:
records_read: 9
read_time:    391,573
key:          6  = PznHerst

And the same number also appeared when the cost of the total plan is calculated:

(8)
current_record_count= 2,709,324 = 301,036 * 9
current_read_time   = 999,772  = 66,334 + 391,573 + 2,709,324/5

So we know it has some relationship with the table TestBig. But what does this number mean?

meaning

In the explain above you can find a hint: 9 rows. So the meaning of this number 9 is: by accessing the table TestBig via the index PznHerst the optimizer estimates to retrieve 9 records for every tuple (PZN, Hersteller) via this index.

And therefore this value is a cost-factor in the calculation of the extended plan, the plan that starts with the table TestSmall and extends this plan by the table TestBig using the index PznHerst.

where does this number come from

So the optimizer uses this number, but where does it get this number from¹⁾?

A short answer will be: from the MYI-file:

august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01                         uchar file_version[4];
       00 03                               uchar options[2];
       03 28                               uchar header_length[2];
       00 b0                               uchar state_info_length[2];
       00 64                               uchar base_info_length[2];
       01 30                               uchar base_pos[2];
       00 0e                               uchar key_parts[2];                 /* Key parts */    
000010 00 00                               uchar unique_key_parts[2];          /* Key parts + unique parts */
       08                                  uchar keys;                         /* number of keys in file */
       00                                  uchar uniques;                      /* number of UNIQUE definitions */ 
       08                                  uchar language;                     /* Language for indexes */ 
       01                                  uchar max_block_size_index;         /* max keyblock size */ 
       00                                  uchar fulltext_keys; 
       00                                  uchar not_used;                     /* To align to 8 */ 

       00 00                               ulong update_count;                   /* Updated for each write lock */ 
       30                                  uint8 changed;                        /* Changed since myisamchk */ 
       ff                                  uint sortkey;                         /* sorted by this key (not used) */ 
       00 00 00 00 00 98 96 80             MI_STATUS_INFO:   ha_rows records;    /* Rows in table */ 
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   ha_rows del;        /* Removed rows */ 
       00 00 00 00 00 98 96 80             ha_rows split;                        /* number of split blocks */ 
       ff ff ff ff ff ff ff ff             my_off_t dellink;                     /* Link to next removed block */
       00 00 00 00 4e 2e cc 00             MI_STATUS_INFO:   my_off_t key_file_length; 
       00 00 00 00 33 a9 ba 30             MI_STATUS_INFO:   my_off_t data_file_length;    
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   my_off_t empty;     /* lost space in datafile */
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   my_off_t key_empty; /* lost space in indexfile */
       00 00 00 00 00 00 00 00             ulonglong auto_increment; 
       00 00 00 00 00 00 00 00             MI_STATE_INFO:   ha_checksum checksum; 
       00 00 28 c6                         ulong process;                        /* process that updated table last */
000070 00 00 00 2e                         ulong unique;                         /* Unique number for this process */ 
       00 00 00 00                         ulong status; 
       00 00 00 01                         ulong update_count;                   /* Updated for each write lock */
# file-offsets in MYI-file 
       00 00 00 00 06 bf b0 00             my_off_t *key_root;                   /* Start of key trees */
                                              first node of index EVP
       00 00 00 00 0e fb cc 00                index Ind1
       00 00 00 00 15 bb 7c 00                index HAP
       00 00 00 00 25 c6 2c 00                index Ind2
       00 00 00 00 2e 4e 50 00                index PZN
       00 00 00 00 35 9e 00 00                index Hersteller
       00 00 00 00 41 e6 64 00                index PznHerst
       00 00 00 00 4e 2e c8 00                index HerstPzn 
#
       ff ff ff ff ff ff ff ff             my_off_t *key_del;                    /* delete links for trees */
       00 00 00 00                         ulong sec_index_changed;              /* Updated when new sec_index */ 
       00 00 00 00                         ulong sec_index_used;                 /* which extra index are in use */ 
       55 20 07 53                         ulong version;                        /* timestamp of create */
0000d0 00 00 00 00 00 00 00 ff             ulonglong key_map;                    /* Which keys are in use */ 
       00 00 00 00 55 20 07 53             time_t create_time;                   /* Time when created database */
0000e0 00 00 00 00 00 00 00 00             time_t recover_time;                  /* Time for last recover */
       00 00 00 00 55 20 09 1d             time_t check_time;                    /* Time for last check */
0000f0 00 00 00 00 00 98 96 80             my_off_t rec_per_key_rows;            /* Rows when calculating rec_per_key */
# for the parts of the indexes: the number of records: 
       00 00 03 71                         ulong rec_per_key_part[]
                                              index EVP, part EVP
       00 00 01 c0                            index Ind1, part PZN
000100 00 00 00 06                            index Ind1, parts PZN + ArtikelText 
       00 00 00 01                            index Ind1, parts PZN + ArtikelText + Hersteller
       00 00 03 d8                            index HAP, part HAP
       00 00 0f 22                            index Ind2, part Hersteller
000110 00 00 00 01                            index Ind2, parts Hersteller + ArtikelText
       00 00 00 01                            index Ind2, parts Hesteller + ArtikeklText + PZN
       00 00 01 c0                            index PZN, part PZN
       00 00 0f 22                            index Hersteller, part Hersteller
# --- this is the interesting part: ----------------------------------------------------
000120 00 00 01 c0                            index PznHerst, part PZN 
       00 00 00 09                            index PznHerst, parts PZN + Hersteller
# --------------------------------------------------------------------------------------
       00 00 0f 22                            index HerstPzn, part Hersteller
       00 00 00 09                            index HerstPzn, parts Hersteller + PZN
#
000130 .............

If you look into the file myisamdef.h you will find the typedef struct st_mi_state_info and I've added the variables of this structure as a description together with some explanations next to the hex-values taken from the MYI-file.

When you scroll down this list to the end you will see the number 9 and this is exactly the number 9 I'm writing about.

We are here in the area of statistics, these numbers are statistics about the distribution of records (=keys) in the index. There are more statistics but I want to focus on the number 9.

Let's look at the last group of lines in the MYI-file above and start with the first entry. As the first index is named EVP and created on one column (=EVP) only so you will see only one number in the list above: 0x0371 (=881d). And therefore the optimizer expects to find 881 records on average²⁾ when accessing this index with a value.
Next there are 3 numbers which belong to the index Ind1, the next index in the list of indexes defined on this table. As before the fist number 0x01C0 (=448d) describes the estimated number of records when accessing this index via a tuple (PZN), the second number 0x06 tells that it expects 6 records to read when it accesses the index with a tuple (PZN, ArtikelText), again an incomplete access to the index. And the 3rd number is 0x01, so when you access this index with a tuple (PZN, Artikeltext, Hersteller) you will find only one record. Please keep in mind that these numbers are estimations as exact numbers are not available.
I want to omit the other entries so I describe only the entries for the index PznHerst as this index is used in the execution of our statement. The first number is 0x01C0 (=448d), so for accessing this index with a tuple (PZN) we can expect to read 448 records on average. The next number is (drum-rolling in the background and a short pause to increase the suspense) 0x09 (end of drum-rolling) and this is the number of interest in this post. So for searching this index with a tuple (PZN, Hersteller) we expect to read 9 records.

Done. No not, how is the number 9 computed?

querying the table

So let's look at the data and check whether these number are correct:

MariaDB [TestOpt]> select count( distinct Pzn) from TestBig;
+----------------------+
| count( distinct Pzn) |
+----------------------+
|                22298 |
+----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select count(distinct Hersteller) from TestBig;
+----------------------------+
| count(distinct Hersteller) |
+----------------------------+
|                       2581 |
+----------------------------+
1 row in set (1.82 sec)

MariaDB [TestOpt]> select count( distinct Pzn, Hersteller) from TestBig;
+----------------------------------+
| count( distinct Pzn, Hersteller) |
+----------------------------------+
|                          1158300 |
+----------------------------------+
1 row in set (24.39 sec)

MariaDB [TestOpt]>

The first query tells us that we will find 22,298 different values in the column PZN. Keep in mind that we have 10 mio records in our table, so 10,000,000 / 22,298 = 448³⁾et voila we have the first of our numbers. If we assume that our entries for PZN are uniformly distributed so we will have 448 records in the table for each value in the column PZN. This is unrealistic, some values will occur more frequently then others but we are making guesses and as long as we don't have exact numbers we will stick with the value 448.

So the value 9 is left. The second query doesn't help, but the third query is useful. Let's take the result from the 3rd query and compute again: 10,000,000 / 1,158,300 = 9⁴⁾ and this is our number! In the 10 mio. records in the table TestBig we assume to get 9 record if we query the table using the index PznHerst with a tuple consisting of a value for PZN and Hersteller.

That's it.

But wehre do these numbers come from?

repair

Let's use an operation that recalculates these numbers:

MariaDB [TestOpt]> repair table TestBig;
+-----------------+--------+----------+----------+
| Table           | Op     | Msg_type | Msg_text |
+-----------------+--------+----------+----------+
| TestOpt.TestBig | repair | status   | OK       |
+-----------------+--------+----------+----------+
1 row in set (7 min 19.10 sec)

MariaDB [TestOpt]>

I've used this operation because it can be done multiple times without affecting the contents of the table. There are other operations that too compute these statistics.

In general the algorithm behind her repair-operation is: it copies the data from the MYD-file to a temporary file. For every index it sorts the contents (maybe sorts them in parts and merges the parts to a single stream), counts them, computes the numbers given above and finally writes them into the MYI-file. It deletes the MYD-file and renames the temp-file to TestBig.MYD.

Let's look a bit deeper into the code. Here is a simplified function-hierarchy:

ha_myisam::repair()
    mi_repair_by_sort()
        _create_index_by_sort()
            find_all_keys()
                sort_key_read()
                    sort_get_next_record()
                        _mi_read_cache()
                            _my_b_read()
                                inline_mysql_file_read()
                        _mi_rec_check()
                        _mi_make_key()
            merge_index()
                merge_buffers()
                    sort_key_write()
                        ha_key_cmp()
        update_key_parts()

For our case I'm mostly interested in the functions sort_key_write() and update_key_parts().

In sort_key_write() this variable is defined:

static int sort_key_write(MI_SORT_PARAM *sort_param, const void *a)
{
  uint diff_pos[2];
  .....

This little array is used in the called function ha_key_cmp(). And there this array is used in this form:

    diff_pos[0]    number of first keypart where values differ, counting from one.
    diff_pos[1]    points to first value in tuple b that is different from corresponding value in tuple a.

Let's look at all possible case and look at the results returned in the var. diff_pos. In the function ha_key_cmp() we compare a tuple like (a,b) with another tuple (c,d) . These are possible results:

comparing (a,b) and (c,b)
case #1: a != c
         set diff_pos to (1, 0)
case #2: a == c, b != d
         set diff_pos to (2,8)
case #3: (a,b) == (c,d)
         set diff_pos to (3,14)

as our index is defined on (PZN, Hersteller) no other cases are possible,

In the calling function sort_key_write() the array diff_pos is interpreted and used for counting:

    sort_param->unique[diff_pos[0]-1]++;

This statement increases the counter, it counts how many sequences of identical records are found.

If the comparisons and countings are done I found these numbers in the var. sort_param.unique:

sort_param.unique[0] = 22297               [0] + 1 --> 22298
sort_param.unique[1] = 1136002             [0] + [1] + 1 --> 1158300 
sort_param.unique[2] = 8841700             [0] + [1] + [2] + 1 --> 10000000
rest  of list: 0

You can see that the numbers counted are 1 less then the numbers found via the SQL-statement above. And they add up to the number of records in the table.

In the function update_key_parts() these numbers are used in the computation:

  for (parts=0 ; parts < keyinfo->keysegs  ; parts++)
  {
    count+=unique[parts];
    unique_tuples= count + 1;    
    ......
    else
      tmp= (tuples + unique_tuples/2) / unique_tuples;

Let's do these computations by ourselves:
The var. tuples contains the number of records in the table: 10,000,000.
In the first iteration of the loop the var. unique_tuples contains the value 22,298. The computation results in the value 448 for the var. tmp, and this is the first value you've seen above.
Now we take the second iteration: the var. unique_tuples contains the value 1,158,300 and so the result of the computation is 9.

Done. That's the value this post started with. Some lines later it will be written to the MYI-file.

correctness

1) and the numbers for the other indexes too

2) the number is correct if one assumes evenly distribution of keys, but what else can we assume?

3) these numbers are rounded, up or down

4) more precise: 8.63