Tuesday 15 September 2015

optimizer: no indexes

If you look into the blog-archive of this site you will see some posts that describe my observations of the handling of a SQL-statement by the optimizer in MariaDB. The statement is always the same, the structure of the tables is the same, the data in the table is the same - just a few points around this have changed. In this post I would like to add another post to this series: what happens if the tables involved in the query do not have any index at all?


environment

You may look here to find a description of the environment used in this post. Only this has changed:
MariaDB [TestOpt]> show indexes from TestSmall;
Empty set (0.01 sec)

MariaDB [TestOpt]> show indexes from TestBig;
Empty set (0.01 sec)

MariaDB [TestOpt]> 
So you see there are no indexes defined on the tables and I want to describe how the database-server handles a statement in this case.


warning

All posts in this series refer to MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for other versions and for MySQL.

The example uses MyISAM as the storage engine for the tables, some of the statements here are only valid for this engine.

Please do not generalize the results presented here. A lot of my statements in this post depend on the data in the tables, and this is my data. If you do your own tests don't expect to see the same results.


the statement

This is the statement I'm looking at:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (24.22 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.00 sec)

MariaDB [TestOpt]> 
As you can see I've included the explain.

Please do not ask me about the meaning of this statement, it is only used for this demonstration, nothing else.

As before there are 301,036 records in the table TestSmall and 10 million records in the table TestBig.

In some posts before I described how the same statement is executed: a table-scan is done on TestSmall. Each record read is inspected and if it satisfies the (partial) WHERE-condition the table TestBig is read, usually via an index using the information from the record just read from TestSmall plus the information in the SQL-statement to fetch the corresponding records from TestBig.

In this post I assume no indexes on the table so following this strategy would mean: read the table TestSmall in a table-scan and when a match is found search the full table TestBig for the corresponding record(s) in TestBig. This means that the big table TestBig will be read by a table-scan multiple times. Or the execution starts with TestBig and for each match it searches the table TestSmall by a table-scan, So it would search the second table multiple times.

Sounds silly. If you look at the statement above you will see that on my PC the server needed about 25 seconds for presenting the result of the query. This look like the server did go a different way to handle the query.


counters

In an older post I introduced some counter-variables to count the number of times a function is called, and to count this at runtime. Using this approach I get the following result in the file /var/log/mysql/error.log after the SQL-statement is executed:
table = <testsmall> counter table-scan: 301,037      index-scan: 0
table = <testbig>   counter table-scan: 10,000,001   index-scan: 0
So you can see for this statement no index-function is accessed as there are no indexes. And each table is accessed once for each record in it (plus one more time to detect the end-situation).

Looks like it scans one table and then the other table and does the JOIN-operation elsewhere. Setting a breakpoint in the function ha_myisam::rnd_next() confirmed this observation.


function hierarchy

In this series I'm mostly interested in the handling of the statement in the optimizer-stage and I will do so again in this post and omit a lot of other aspects. Here is the function-hierarchy in the optimizer in the handling of the query:
 make_join_statistics()
     JOIN_TAB::scan_time()                                          // for table TestSmall
         TABE::stat_records() 
         handler::scan_time()
     JOIN_TAB::scan_time()                                          // for table TestBig
         TABE::stat_records() 
         handler::scan_time()
     choose_plan()
         greedy_search()
             best_extension_by_limited_search()
                 best_access_path()                                 // start with table TestSmall
                     JOIN_TAB::scan_time()
                         TABE::stat_records() 
                         handler::scan_time()
                 // recursion:
                 best_extension_by_limited_search()                 // add table TestBig to the plan
                     best_access_path()
                         JOIN_TAB::scan_time()
                             TABE::stat_records()
                             handler::scan_time()
                 // start a new plan: 
                 best_access_path()                                 // start with table TestBig
                     JOIN_TAB::scan_time()
                         TABE::stat_records()
                         handler::scan_time()

Again I have to write: don't generalize this hierarchy. It depends on the type of statement I'm looking at but it should give you an overview of the process.


pseudo-code

I'll try to explain the handling: we have 2 tables, TestSmall and TestBig, and we have the SQL-statement shown above. So:
the inspection starts with the table TestSmall. The time for reading this table in a table-scan is computed to:
best_access_path():
 records_read = 301,036
 read_time = 6,127
The cost for this (partial) plan is computed to:
best_extension_by_limited_search():
       /* Compute the cost of extending the plan with 's' */
      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time + 
                        current_record_count / (double) TIME_FOR_COMPARE; 
and the results are:
 current_record_count = 301,036 = 1 * 301,036
 current_read_time    = 66,334 = 0 + 6,127 + 301,036/5
As we do not have another plan yet this plan is the best plan so far and therefore the values are stored:
best_extension_by_limited_search():
            best_record_count= current_record_count;
            best_read_time=    current_read_time;

Next we extend this partial plan by adding the table TestBig. For computing the cost-values the function best_extension_by_limited_search() is called again (a recursion) and the results are:
best_access_path():
 records_read = 10,000,000
 read_time = 6,348,409
so for accessing these tables the cost is computed to:
best_extension_by_limited_search():
 current_record_count =  3,010,360,000,000 = 301,036 * 10,000,000
 current_read_time    = 602,078,414,743   = 66,334 + 6,348,409 + (3,010,360,000,000 / 5)
Ans as this is the best plan so far the values are stored:
best_extension_by_limited_search():
            best_record_count= current_record_count;  
            best_read_time=    current_read_time;
Now we have to store the plan, not only the results:
best_extension_by_limited_search():
        if (current_read_time < join->best_read)  
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions, 
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }
and as join->best_read was initialized with DBL_MAX the expression in the if() computes to true and the plan is copied to join->best_positions. The variable partial_join_cardinality is computed some lines above in this way:
best_extension_by_limited_search():
      double partial_join_cardinality= current_record_count *
                                        pushdown_cond_selectivity;    // = (301,036 * 10,000,000) * 1
As there are no more tables left to inspect this iteration ends.

In the next iteration it starts fresh with the table TestBig. These are the values for reading this table:
best_access_path():
 records_read = 10,000,000
 read_time = 211,613
so the cost for accessing this table is computed to:
best_extension_by_limited_search():
 current_record_count = 10,000,000 = 1 * 10,000,000
 current_read_time    = 2,211,613 = 0 + 211,613 + 10,000,000/5
Next the optimizer would have to extend this plan with the next table in the list, the table TestSmall. But before doing this it checks whether it can expect a better plan:
best_extension_by_limited_search():
       /*
        Prune some less promising partial plans. This heuristic may miss
        the optimal QEPs, thus it results in a non-exhaustive search.
      */
 
         if (best_record_count > current_record_count ||   
            best_read_time > current_read_time ||
            ....
The if() computes to:
       if ( 301,036 > 10,000,000 ||
            66,334 > 221,1613 ||
            ....
this computes to false so the computation stops here, this plan is not extended.

Now there are no more tables so the whole inspection of plans stops here. And a plan was found: start with TestSmall and extend it by TestBig.

remark

Some lines above I presented this line:
current_record_count =  3,010,360,000,000 = 301,036 * 10,000,000
This line shows the calculation of the number of records fetched for the plan currently inspected. This multiplication tells us that for every record fetched from TestSmall the whole table TestBig must be read (or vice versa). This contradicts the final plan where each table is read only once, so an addition of the two numbers should be used instead of a multiplication. At least this did not affect the decision for a plan.

correctness

This text is a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

Tuesday 1 September 2015

9 records


This text here belongs to the series that describes the handling of a SQL-statements by the optimizer when a table has multiple indexes on it. These are the older posts:
In this post I would like to continue this description and describe another tiny point omitted in the last post.


some hints

All posts in this series refer to MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for MySQL.

The example uses MyISAM as the storage engine for the tables and indexes, most of the statements here are only valid for this engine.

Please do not generalize the results presented here. A lot of the data in this post depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers.


9 records

The very tiny point I want to describe in this post is the number 9 which occurred in the last post but wasn't explained there. In these places the number 9 appeared in the last post:
MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Ind1,Ind2,PZN,Hersteller,PznHerst,HerstPzn
          key: PznHerst
      key_len: 14
          ref: TestOpt.A.PZN,const
         rows: 9
        Extra: Using where; Using index
2 rows in set (0.00 sec)

MariaDB [TestOpt]> 
You will find the number in the description how the second table (=TestBig) is accessed, it's marked in bold.

In the same post you will find the number in these lines:
(7) TestBig:
records_read: 9
read_time:    391,573
key:          6  = PznHerst

And the same number also appeared when the cost of the total plan is calculated:
(8)
current_record_count= 2,709,324 = 301,036 * 9
current_read_time   = 999,772  = 66,334 + 391,573 + 2,709,324/5

So we know it has some relationship with the table TestBig. But what does this number mean?


meaning

In the explain above you can find a hint: 9 rows. So the meaning of this number 9 is: by accessing the table TestBig via the index PznHerst the optimizer estimates to retrieve 9 records for every tuple (PZN, Hersteller) via this index.

And therefore this value is a cost-factor in the calculation of the extended plan, the plan that starts with the table TestSmall and extends this plan by the table TestBig using the index PznHerst.


where does this number come from

So the optimizer uses this number, but where does it get this number from1)?

A short answer will be: from the MYI-file:
august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01                         uchar file_version[4];
       00 03                               uchar options[2];
       03 28                               uchar header_length[2];
       00 b0                               uchar state_info_length[2];
       00 64                               uchar base_info_length[2];
       01 30                               uchar base_pos[2];
       00 0e                               uchar key_parts[2];                 /* Key parts */    
000010 00 00                               uchar unique_key_parts[2];          /* Key parts + unique parts */
       08                                  uchar keys;                         /* number of keys in file */
       00                                  uchar uniques;                      /* number of UNIQUE definitions */ 
       08                                  uchar language;                     /* Language for indexes */ 
       01                                  uchar max_block_size_index;         /* max keyblock size */ 
       00                                  uchar fulltext_keys; 
       00                                  uchar not_used;                     /* To align to 8 */ 

       00 00                               ulong update_count;                   /* Updated for each write lock */ 
       30                                  uint8 changed;                        /* Changed since myisamchk */ 
       ff                                  uint sortkey;                         /* sorted by this key (not used) */ 
       00 00 00 00 00 98 96 80             MI_STATUS_INFO:   ha_rows records;    /* Rows in table */ 
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   ha_rows del;        /* Removed rows */ 
       00 00 00 00 00 98 96 80             ha_rows split;                        /* number of split blocks */ 
       ff ff ff ff ff ff ff ff             my_off_t dellink;                     /* Link to next removed block */
       00 00 00 00 4e 2e cc 00             MI_STATUS_INFO:   my_off_t key_file_length; 
       00 00 00 00 33 a9 ba 30             MI_STATUS_INFO:   my_off_t data_file_length;    
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   my_off_t empty;     /* lost space in datafile */
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   my_off_t key_empty; /* lost space in indexfile */
       00 00 00 00 00 00 00 00             ulonglong auto_increment; 
       00 00 00 00 00 00 00 00             MI_STATE_INFO:   ha_checksum checksum; 
       00 00 28 c6                         ulong process;                        /* process that updated table last */
000070 00 00 00 2e                         ulong unique;                         /* Unique number for this process */ 
       00 00 00 00                         ulong status; 
       00 00 00 01                         ulong update_count;                   /* Updated for each write lock */
# file-offsets in MYI-file 
       00 00 00 00 06 bf b0 00             my_off_t *key_root;                   /* Start of key trees */
                                              first node of index EVP
       00 00 00 00 0e fb cc 00                index Ind1
       00 00 00 00 15 bb 7c 00                index HAP
       00 00 00 00 25 c6 2c 00                index Ind2
       00 00 00 00 2e 4e 50 00                index PZN
       00 00 00 00 35 9e 00 00                index Hersteller
       00 00 00 00 41 e6 64 00                index PznHerst
       00 00 00 00 4e 2e c8 00                index HerstPzn 
#
       ff ff ff ff ff ff ff ff             my_off_t *key_del;                    /* delete links for trees */
       00 00 00 00                         ulong sec_index_changed;              /* Updated when new sec_index */ 
       00 00 00 00                         ulong sec_index_used;                 /* which extra index are in use */ 
       55 20 07 53                         ulong version;                        /* timestamp of create */
0000d0 00 00 00 00 00 00 00 ff             ulonglong key_map;                    /* Which keys are in use */ 
       00 00 00 00 55 20 07 53             time_t create_time;                   /* Time when created database */
0000e0 00 00 00 00 00 00 00 00             time_t recover_time;                  /* Time for last recover */
       00 00 00 00 55 20 09 1d             time_t check_time;                    /* Time for last check */
0000f0 00 00 00 00 00 98 96 80             my_off_t rec_per_key_rows;            /* Rows when calculating rec_per_key */
# for the parts of the indexes: the number of records: 
       00 00 03 71                         ulong rec_per_key_part[]
                                              index EVP, part EVP
       00 00 01 c0                            index Ind1, part PZN
000100 00 00 00 06                            index Ind1, parts PZN + ArtikelText 
       00 00 00 01                            index Ind1, parts PZN + ArtikelText + Hersteller
       00 00 03 d8                            index HAP, part HAP
       00 00 0f 22                            index Ind2, part Hersteller
000110 00 00 00 01                            index Ind2, parts Hersteller + ArtikelText
       00 00 00 01                            index Ind2, parts Hesteller + ArtikeklText + PZN
       00 00 01 c0                            index PZN, part PZN
       00 00 0f 22                            index Hersteller, part Hersteller
# --- this is the interesting part: ----------------------------------------------------
000120 00 00 01 c0                            index PznHerst, part PZN 
       00 00 00 09                            index PznHerst, parts PZN + Hersteller
# --------------------------------------------------------------------------------------
       00 00 0f 22                            index HerstPzn, part Hersteller
       00 00 00 09                            index HerstPzn, parts Hersteller + PZN
#
000130 .............
If you look into the file myisamdef.h you will find the typedef struct st_mi_state_info and I've added the variables of this structure as a description together with some explanations next to the hex-values taken from the MYI-file.

When you scroll down this list to the end you will see the number 9 and this is exactly the number 9 I'm writing about.

We are here in the area of statistics, these numbers are statistics about the distribution of records (=keys) in the index. There are more statistics but I want to focus on the number 9.

Let's look at the last group of lines in the MYI-file above and start with the first entry. As the first index is named EVP and created on one column (=EVP) only so you will see only one number in the list above: 0x0371 (=881d). And therefore the optimizer expects to find 881 records on average2) when accessing this index with a value.
Next there are 3 numbers which belong to the index Ind1, the next index in the list of indexes defined on this table. As before the fist number 0x01C0 (=448d) describes the estimated number of records when accessing this index via a tuple (PZN), the second number 0x06 tells that it expects 6 records to read when it accesses the index with a tuple (PZN, ArtikelText), again an incomplete access to the index. And the 3rd number is 0x01, so when you access this index with a tuple (PZN, Artikeltext, Hersteller) you will find only one record. Please keep in mind that these numbers are estimations as exact numbers are not available.
I want to omit the other entries so I describe only the entries for the index PznHerst as this index is used in the execution of our statement. The first number is 0x01C0 (=448d), so for accessing this index with a tuple (PZN) we can expect to read 448 records on average. The next number is (drum-rolling in the background and a short pause to increase the suspense) 0x09 (end of drum-rolling) and this is the number of interest in this post. So for searching this index with a tuple (PZN, Hersteller) we expect to read 9 records.

Done. No not, how is the number 9 computed?


querying the table

So let's look at the data and check whether these number are correct:
MariaDB [TestOpt]> select count( distinct Pzn) from TestBig;
+----------------------+
| count( distinct Pzn) |
+----------------------+
|                22298 |
+----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select count(distinct Hersteller) from TestBig;
+----------------------------+
| count(distinct Hersteller) |
+----------------------------+
|                       2581 |
+----------------------------+
1 row in set (1.82 sec)

MariaDB [TestOpt]> select count( distinct Pzn, Hersteller) from TestBig;
+----------------------------------+
| count( distinct Pzn, Hersteller) |
+----------------------------------+
|                          1158300 |
+----------------------------------+
1 row in set (24.39 sec)

MariaDB [TestOpt]> 

The first query tells us that we will find 22,298 different values in the column PZN. Keep in mind that we have 10 mio records in our table, so 10,000,000 / 22,298 = 4483) et voila we have the first of our numbers. If we assume that our entries for PZN are uniformly distributed so we will have 448 records in the table for each value in the column PZN. This is unrealistic, some values will occur more frequently then others but we are making guesses and as long as we don't have exact numbers we will stick with the value 448.

So the value 9 is left. The second query doesn't help, but the third query is useful. Let's take the result from the 3rd query and compute again: 10,000,000 / 1,158,300 = 94) and this is our number! In the 10 mio. records in the table TestBig we assume to get 9 record if we query the table using the index PznHerst with a tuple consisting of a value for PZN and Hersteller.

That's it.

But wehre do these numbers come from?


repair

Let's use an operation that recalculates these numbers:
MariaDB [TestOpt]> repair table TestBig;
+-----------------+--------+----------+----------+
| Table           | Op     | Msg_type | Msg_text |
+-----------------+--------+----------+----------+
| TestOpt.TestBig | repair | status   | OK       |
+-----------------+--------+----------+----------+
1 row in set (7 min 19.10 sec)

MariaDB [TestOpt]> 
I've used this operation because it can be done multiple times without affecting the contents of the table. There are other operations that too compute these statistics.

In general the algorithm behind her repair-operation is: it copies the data from the MYD-file to a temporary file. For every index it sorts the contents (maybe sorts them in parts and merges the parts to a single stream), counts them, computes the numbers given above and finally writes them into the MYI-file. It deletes the MYD-file and renames the temp-file to TestBig.MYD.

Let's look a bit deeper into the code. Here is a simplified function-hierarchy:
ha_myisam::repair()
    mi_repair_by_sort()
        _create_index_by_sort()
            find_all_keys()
                sort_key_read()
                    sort_get_next_record()
                        _mi_read_cache()
                            _my_b_read()
                                inline_mysql_file_read()
                        _mi_rec_check()
                        _mi_make_key()
            merge_index()
                merge_buffers()
                    sort_key_write()
                        ha_key_cmp()
        update_key_parts()
For our case I'm mostly interested in the functions sort_key_write() and update_key_parts().

In sort_key_write() this variable is defined:
static int sort_key_write(MI_SORT_PARAM *sort_param, const void *a)
{
  uint diff_pos[2];
  .....
This little array is used in the called function ha_key_cmp(). And there this array is used in this form:
    diff_pos[0]    number of first keypart where values differ, counting from one.
    diff_pos[1]    points to first value in tuple b that is different from corresponding value in tuple a.
Let's look at all possible case and look at the results returned in the var. diff_pos. In the function ha_key_cmp() we compare a tuple like (a,b) with another tuple (c,d) . These are possible results:
comparing (a,b) and (c,b)
case #1: a != c
         set diff_pos to (1, 0)
case #2: a == c, b != d
         set diff_pos to (2,8)
case #3: (a,b) == (c,d)
         set diff_pos to (3,14)
as our index is defined on (PZN, Hersteller) no other cases are possible,

In the calling function sort_key_write() the array diff_pos is interpreted and used for counting:
    sort_param->unique[diff_pos[0]-1]++;
This statement increases the counter, it counts how many sequences of identical records are found.

If the comparisons and countings are done I found these numbers in the var. sort_param.unique:
sort_param.unique[0] = 22297               [0] + 1 --> 22298
sort_param.unique[1] = 1136002             [0] + [1] + 1 --> 1158300 
sort_param.unique[2] = 8841700             [0] + [1] + [2] + 1 --> 10000000
rest  of list: 0
You can see that the numbers counted are 1 less then the numbers found via the SQL-statement above. And they add up to the number of records in the table.

In the function update_key_parts() these numbers are used in the computation:
  for (parts=0 ; parts < keyinfo->keysegs  ; parts++)
  {
    count+=unique[parts];
    unique_tuples= count + 1;    
    ......
    else
      tmp= (tuples + unique_tuples/2) / unique_tuples;
Let's do these computations by ourselves:
The var. tuples contains the number of records in the table: 10,000,000.
In the first iteration of the loop the var. unique_tuples contains the value 22,298. The computation results in the value 448 for the var. tmp, and this is the first value you've seen above.
Now we take the second iteration: the var. unique_tuples contains the value 1,158,300 and so the result of the computation is 9.

Done. That's the value this post started with. Some lines later it will be written to the MYI-file.


correctness

This text is a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.



some notes:
1) and the numbers for the other indexes too
2) the number is correct if one assumes evenly distribution of keys, but what else can we assume?
3) these numbers are rounded, up or down
4) more precise: 8.63

Friday 14 August 2015

Impressum


to be compliant with German law I have to add these lines:

Angaben gemäß § 5 TMG:


August Quint
Gronaustrasse 7
D-65205 Wiesbaden

Kontakt:

Mail: August.Quint@googlemail.com

Verantwortlich für den Inhalt nach § 55 Abs. 2 RStV:

August Quint

Haftungsausschluss (Disclaimer)

Haftung für Inhalte
Als Diensteanbieter sind wir gemäß § 7 Abs.1 TMG für eigene Inhalte auf diesen Seiten nach den allgemeinen Gesetzen verantwortlich. Nach §§ 8 bis 10 TMG sind wir als Diensteanbieter jedoch nicht verpflichtet, übermittelte oder gespeicherte fremde Informationen zu überwachen oder nach Umständen zu forschen, die auf eine rechtswidrige Tätigkeit hinweisen. Verpflichtungen zur Entfernung oder Sperrung der Nutzung von Informationen nach den allgemeinen Gesetzen bleiben hiervon unberührt. Eine diesbezügliche Haftung ist jedoch erst ab dem Zeitpunkt der Kenntnis einer konkreten Rechtsverletzung möglich. Bei Bekanntwerden von entsprechenden Rechtsverletzungen werden wir diese Inhalte umgehend entfernen.
Haftung für Links
Unser Angebot enthält Links zu externen Webseiten Dritter, auf deren Inhalte wir keinen Einfluss haben. Deshalb können wir für diese fremden Inhalte auch keine Gewähr übernehmen. Für die Inhalte der verlinkten Seiten ist stets der jeweilige Anbieter oder Betreiber der Seiten verantwortlich. Die verlinkten Seiten wurden zum Zeitpunkt der Verlinkung auf mögliche Rechtsverstöße überprüft. Rechtswidrige Inhalte waren zum Zeitpunkt der Verlinkung nicht erkennbar. Eine permanente inhaltliche Kontrolle der verlinkten Seiten ist jedoch ohne konkrete Anhaltspunkte einer Rechtsverletzung nicht zumutbar. Bei Bekanntwerden von Rechtsverletzungen werden wir derartige Links umgehend entfernen.
Urheberrecht
Die durch die Seitenbetreiber erstellten Inhalte und Werke auf diesen Seiten unterliegen dem deutschen Urheberrecht. Die Vervielfältigung, Bearbeitung, Verbreitung und jede Art der Verwertung außerhalb der Grenzen des Urheberrechtes bedürfen der schriftlichen Zustimmung des jeweiligen Autors bzw. Erstellers. Downloads und Kopien dieser Seite sind nur für den privaten, nicht kommerziellen Gebrauch gestattet. Soweit die Inhalte auf dieser Seite nicht vom Betreiber erstellt wurden, werden die Urheberrechte Dritter beachtet. Insbesondere werden Inhalte Dritter als solche gekennzeichnet. Sollten Sie trotzdem auf eine Urheberrechtsverletzung aufmerksam werden, bitten wir um einen entsprechenden Hinweis. Bei Bekanntwerden von Rechtsverletzungen werden wir derartige Inhalte umgehend entfernen.

Datenschutzerklärung:

Datenschutz
Die Nutzung unserer Webseite ist in der Regel ohne Angabe personenbezogener Daten möglich. Soweit auf unseren Seiten personenbezogene Daten (beispielsweise Name, Anschrift oder eMail-Adressen) erhoben werden, erfolgt dies, soweit möglich, stets auf freiwilliger Basis. Diese Daten werden ohne Ihre ausdrückliche Zustimmung nicht an Dritte weitergegeben.
Wir weisen darauf hin, dass die Datenübertragung im Internet (z.B. bei der Kommunikation per E-Mail) Sicherheitslücken aufweisen kann. Ein lückenloser Schutz der Daten vor dem Zugriff durch Dritte ist nicht möglich.
Der Nutzung von im Rahmen der Impressumspflicht veröffentlichten Kontaktdaten durch Dritte zur Übersendung von nicht ausdrücklich angeforderter Werbung und Informationsmaterialien wird hiermit ausdrücklich widersprochen. Die Betreiber der Seiten behalten sich ausdrücklich rechtliche Schritte im Falle der unverlangten Zusendung von Werbeinformationen, etwa durch Spam-Mails, vor.

Datenschutzerklärung für die Nutzung von Google Analytics
Diese Website benutzt Google Analytics, einen Webanalysedienst der Google Inc. ("Google"). Google Analytics verwendet sog. "Cookies", Textdateien, die auf Ihrem Computer gespeichert werden und die eine Analyse der Benutzung der Website durch Sie ermöglichen. Die durch den Cookie erzeugten Informationen über Ihre Benutzung dieser Website werden in der Regel an einen Server von Google in den USA übertragen und dort gespeichert. Im Falle der Aktivierung der IP-Anonymisierung auf dieser Webseite wird Ihre IP-Adresse von Google jedoch innerhalb von Mitgliedstaaten der Europäischen Union oder in anderen Vertragsstaaten des Abkommens über den Europäischen Wirtschaftsraum zuvor gekürzt.
Nur in Ausnahmefällen wird die volle IP-Adresse an einen Server von Google in den USA übertragen und dort gekürzt. Im Auftrag des Betreibers dieser Website wird Google diese Informationen benutzen, um Ihre Nutzung der Website auszuwerten, um Reports über die Websiteaktivitäten zusammenzustellen und um weitere mit der Websitenutzung und der Internetnutzung verbundene Dienstleistungen gegenüber dem Websitebetreiber zu erbringen. Die im Rahmen von Google Analytics von Ihrem Browser übermittelte IP-Adresse wird nicht mit anderen Daten von Google zusammengeführt.
Sie können die Speicherung der Cookies durch eine entsprechende Einstellung Ihrer Browser-Software verhindern; wir weisen Sie jedoch darauf hin, dass Sie in diesem Fall gegebenenfalls nicht sämtliche Funktionen dieser Website vollumfänglich werden nutzen können. Sie können darüber hinaus die Erfassung der durch das Cookie erzeugten und auf Ihre Nutzung der Website bezogenen Daten (inkl. Ihrer IP-Adresse) an Google sowie die Verarbeitung dieser Daten durch Google verhindern, indem sie das unter dem folgenden Link verfügbare Browser-Plugin herunterladen und installieren: http://tools.google.com/dlpage/gaoptout?hl=de.

Datenschutzerklärung für die Nutzung von Google Adsense
Diese Website benutzt Google AdSense, einen Dienst zum Einbinden von Werbeanzeigen der Google Inc. ("Google"). Google AdSense verwendet sog. "Cookies", Textdateien, die auf Ihrem Computer gespeichert werden und die eine Analyse der Benutzung der Website ermöglicht. Google AdSense verwendet auch so genannte Web Beacons (unsichtbare Grafiken). Durch diese Web Beacons können Informationen wie der Besucherverkehr auf diesen Seiten ausgewertet werden.
Die durch Cookies und Web Beacons erzeugten Informationen über die Benutzung dieser Website (einschließlich Ihrer IP-Adresse) und Auslieferung von Werbeformaten werden an einen Server von Google in den USA übertragen und dort gespeichert. Diese Informationen können von Google an Vertragspartner von Google weiter gegeben werden. Google wird Ihre IP-Adresse jedoch nicht mit anderen von Ihnen gespeicherten Daten zusammenführen.
Sie können die Installation der Cookies durch eine entsprechende Einstellung Ihrer Browser Software verhindern; wir weisen Sie jedoch darauf hin, dass Sie in diesem Fall gegebenenfalls nicht sämtliche Funktionen dieser Website voll umfänglich nutzen können. Durch die Nutzung dieser Website erklären Sie sich mit der Bearbeitung der über Sie erhobenen Daten durch Google in der zuvor beschriebenen Art und Weise und zu dem zuvor benannten Zweck einverstanden.

Datenschutzerklärung für die Nutzung von Google +1
Erfassung und Weitergabe von Informationen:
Mithilfe der Google +1-Schaltfläche können Sie Informationen weltweit veröffentlichen. über die Google +1-Schaltfläche erhalten Sie und andere Nutzer personalisierte Inhalte von Google und unseren Partnern. Google speichert sowohl die Information, dass Sie für einen Inhalt +1 gegeben haben, als auch Informationen über die Seite, die Sie beim Klicken auf +1 angesehen haben. Ihre +1 können als Hinweise zusammen mit Ihrem Profilnamen und Ihrem Foto in Google-Diensten, wie etwa in Suchergebnissen oder in Ihrem Google-Profil, oder an anderen Stellen auf Websites und Anzeigen im Internet eingeblendet werden.
Google zeichnet Informationen über Ihre +1-Aktivitäten auf, um die Google-Dienste für Sie und andere zu verbessern. Um die Google +1-Schaltfläche verwenden zu können, benötigen Sie ein weltweit sichtbares, öffentliches Google-Profil, das zumindest den für das Profil gewählten Namen enthalten muss. Dieser Name wird in allen Google-Diensten verwendet. In manchen Fällen kann dieser Name auch einen anderen Namen ersetzen, den Sie beim Teilen von Inhalten über Ihr Google-Konto verwendet haben. Die Identität Ihres Google-Profils kann Nutzern angezeigt werden, die Ihre E-Mail-Adresse kennen oder über andere identifizierende Informationen von Ihnen verfügen.



Monday 3 August 2015

optimizer: indexes (2)

some time ago I started to describe what's happening in the code of (mostly) MariaDB when it handles a SQL-statement. The SQL-statement in this series of posts is always the same, the structure of the tables involved and the data stored in the tables are identical, I've changed only a few parts around it and looked how the code reactss on these modifications. With this post I would like to continue this.

This text here belongs to the series that describes the handling of multiple indexes on one table. These are the older posts:
  • this series started here: indexes
  • one minor piece omitted in the above post was described here: 0xA8
  • another tiny piece omitted but described in detail here: 0xCA
In this post I would like to continue this description.


the environment

I want to describe the environment I'm using (again).

For this series of posts I'm looking at how the optimizer in the code of MariaDB1) inspects a SQL-statement and finds a good plan for collecting the data requested in the SQL-statement. In this post I look at how the optimizer treated indexes on tables so I created these indexes:
MariaDB [TestOpt]> show index from TestSmall;
Empty set (0.00 sec)

MariaDB [TestOpt]> show index from TestBig;  
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | EVP        |            1 | EVP         | A         |       11350 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind1       |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind1       |            2 | ArtikelText | A         |     1666666 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind1       |            3 | Hersteller  | A         |    10000000 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | HAP        |            1 | HAP         | A         |       10162 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind2       |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind2       |            2 | ArtikelText | A         |    10000000 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind2       |            3 | PZN         | A         |    10000000 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | PZN        |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | PznHerst   |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | PznHerst   |            2 | Hersteller  | A         |     1111111 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | HerstPzn   |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | HerstPzn   |            2 | PZN         | A         |     1111111 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
14 rows in set (0.00 sec)

MariaDB [TestOpt]> 
As you can see, the table TestSmall contains no indexes at all, the table TestBig contains 8 indexes. Please do not try to see any meaning in these indexes, they are intended for the demonstration here.


the statement

This is the statement I want to follow it's way through the code of MariaDB:
MariaDB [(none)]> use TestOpt;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.45 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Ind1,Ind2,PZN,Hersteller,PznHerst,HerstPzn
          key: PznHerst
      key_len: 14
          ref: TestOpt.A.PZN,const
         rows: 9
        Extra: Using where; Using index
2 rows in set (0.00 sec)

MariaDB [TestOpt]> 
I've included the explain of the statement.

You can see the optimizer chooses to start with the table TestSmall doing a full table-scan (no index available) and reading 300K records. After reading one record from TestSmall it applies the WHERE-clause to this record and when a match is found it switches over to the table TestBig and searches for records matching the value for Hersteller (from the WHERE-clause) and PZN (from the record just read from TestSmall) using the index PznHerst.


warning

As in the last posts I did all my tests with MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanation should also be valid for other versions and for MySQL2).

The example uses MyISAM as the storage engine for the tables and indexes, some (but not all) of the statements here are only valid for this engine.

Please do not generalize the results presented here. A lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers or plans.


prerequisites

In this post I ended with the cost-values for accessing a table alone. Let me repeat the results:
TestSmall:
found_records  =    301,036
read_time      =      6,127
s->worst_seeks =     18,382
TestBig:
found_records  = 10,000,000
read_time      =    211,613
s->worst_seeks =    634,840
As there are indexes defined on the table TestBig the number of records to inspect when this table is accessed via an index are computed:
table TestBig:
index 3 = Ind2       = 204,922 records
index 5 = Hersteller = 455,132 records
index 7 = HerstPzn   = 380,971 records
Please keep in mind that these are estimated values.


choosing a plan

A lot of words until here, so I want to come to the topic of this post: how does the code make the decision for the best plan, the plan described in the explain above? Which alternatives are inspected? What are the arguments for this plan? Why are alternatives rejected?

Here is the function hierarchy of the handling of this statement during inspection and decision-making:
make_join_statistics()
    choose_plan()
        greedy_search()
            best_extension_by_limited_search()
                best_access_path()        --> (1)     will explain this some lines later
                --> (2)
                best_extension_by_limited_search()
                    best_access_path()    --> (3)
                --> (4)
                best_access_path()        --> (5)
                --> (6)
                best_extension_by_limited_search()
                    best_access_path()     --> (7)
                --> (8)
warning: this hierarchy is valid for the SQL-statement given above and for the environment I am working in. If you do your own tests this hierarchy may look differently. And in the hierarchy I've added some hints to steps which I would like to explain in some details later.

Let's start with an overview of the steps involved in the inspection of this statement:
The strategy is to start with a table and calculate the cost of accessing this table alone. This plan is extended by adding the second table to it and calculating the cost of accessing the records of the second table, finally the cost of accessing these 2 tables is calculated. And as no more table exist3) this calculation is finished.
Now the code tries the opposite way: the inspection begins with the second table and calculates the costs of accessing this table alone. Then this plan is extended by adding the first table and calculate the cost of accessing this table is calculated, finally the costs are combined.

As we now have two values a comparison is possible and the plan with the lower cost-value is chosen, the other plan and its values are forgotten. As no other possibilities exists the computation stops and we have a plan to execute. So it starts executing the plan computed.

Let's look a bit deeper into the code and at first I want to show the places where I will show results of computations soon.
The hints above marked as (1), (3), (5) and (7) refer to these lines of code in the function best_access_path():
  /* Update the cost information for the current partial plan */
  pos->records_read= records;
  pos->read_time=    best;
  pos->key=          best_key;
  pos->table=        s;
  pos->ref_depend_map= best_ref_depends_map;
  pos->loosescan_picker.loosescan_key= MAX_KEY;
  pos->use_join_buffer= best_uses_jbuf;
 

the other hints refer to the function best_extension_by_limited_search() and especially these lines of code:
      /* Find the best access method from 's' to the current partial plan */
      POSITION loose_scan_pos;
      best_access_path(join, s, remaining_tables, idx, disable_jbuf,
                       record_count, join->positions + idx, &loose_scan_pos);

      /* Compute the cost of extending the plan with 's' */

      current_record_count= record_count * position->records_read;
      current_read_time=read_time + position->read_time +
                        current_record_count / (double) TIME_FOR_COMPARE;

So let's dive into the code and look what the results of the computations are.

Everything starts from scratch with an empty plan, so the first table inspected is the table TestBig.
Stop! Why do we start with the table TestBig? If you look at the SQL-statement this table is found in the second place and its the table with the most records in it. So why?

If you look into the code of choose_plan() you see this call near the beginning:
  my_qsort2(join->best_ref + join->const_tables,
            join->table_count - join->const_tables, sizeof(JOIN_TAB*),
            jtab_sort_func, (void*)join->emb_sjm_nest);
In this function the entries are sorted using the comparison-function join_tab_cmp() (via the var. jtab_sort_func). And in this function the number of records are compared:
  if (jt1->found_records > jt2->found_records)
    return 1;
and as the values are 301,036 (for jt1 aka. TestSmall) and 204,922 (for jt2 aka. TestBig) it returns a value of 1 which will result in swapping the entries, so the JOIN_TAB representing the table TestBig is in the first position of the list and therefore the code starts the inspection with this table.

And for the table TestBig it calculates the costs, so here is the result of the step marked with (1):
(1) TestBig: 
records_read: 204,922
read_time:    19,858
key:          3 = Ind2

The whole cost for accessing this table alone is computed as:
(2)
current_record_count= 204,922
current_read_time   = 60,843

Let's add the next table to this partial plan. So we inspect the table TestSmall and found this path:
(3) TestSmall:
records_read: 301,036
read_time:    128,679
key;          NULL

Now we put all the numbers together and calculate the cost of accessing both tables by the ways found so far:
(4) TestBig + TestSmall:
current_record_count= 61,688,899,192 = 204,922 * 301,036
current_read_time   = 12,337,969,360 =  60,843 + 128,679 + 61,688,899,192/5

If you look at the numbers you will see that the values involved in the computation are a combination of the numbers taken from step (2) and (3).

Here the inspection of this plan ends and we store it:
        if (current_read_time < join->best_read)  
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions, 
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }

The best plan so far has this (initial) value: join->best_read = DBL_MAX = 1.7976931348623157e+308
so our plan just computed has a better value and is therefore copied to join->best_positions.

So we go back to the beginning but in this case we start with the next table in the list: TestSmall. Here are the results:
(5) TestSmall
records_read: 301036
read_time:    6127
key:          NULL

The whole cost for accessing this table alone is computed as:
(6)
current_record_count= 301,036
current_read_time   = 66,334

We extend this plan by adding the table TestBig. For accessing this table these costs are computed:
(7) TestBig:
records_read: 9
read_time:    391,573
key:          6  = PznHerst

From these numbers we calculate the cost of the total plan:
(8)
current_record_count= 2,709,324 = 301,036 * 9
current_read_time   = 999,772  = 66,334 + 391,573 + 2,709,324/5

As no more table is in our list the inspection stops here and we can compare the cost of this plan with the best plan so far:
        if (current_read_time < join->best_read)  
        {
          memcpy((uchar*) join->best_positions, (uchar*) join->positions,   
                 sizeof(POSITION) * (idx + 1));
          join->record_count= partial_join_cardinality;
          join->best_read= current_read_time - 0.001;
        }

and as 999,772 is less than 12,337,969,360 the code copies the plan to the final position.

We would go back to the beginning but as there are no more tables on the list the computation stops here and we finally have a plan to execute.

That's all. The best plan is found and in the next step it will be executed.


correctness

This text is a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.


some notes:
1) in this post I'm using version 10.0.10
2) I didn't check this, sorry
3) valid for this test-case = for this statement

Friday 10 July 2015

0xCA

In the last post I described a numerical value: 0xA8, which was used in the this post but I didn't explain it - the last post cleared this (I hope). Now if you read the post you will see another value used: 0xCA, again not explained so I want to use this post to explain the meaning of this value and where it comes from.

some hints

As in the last post I did all my tests here with MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for MySQL.

Please do not generalize the results presented here. A lot of the data on this post depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like the values 0xA8 in JOIN_TAB->const_keys).

usage

Where is the value 0xCA used? Let's simply look into the code:
/**
  Find shortest key suitable for full table scan.

  @param table                 Table to scan
  @param usable_keys           Allowed keys

  @return
    MAX_KEY     no suitable key found
    key index   otherwise
*/

uint find_shortest_key(TABLE *table, const key_map *usable_keys)
{
  double min_cost= DBL_MAX;
  uint best= MAX_KEY;
  if (!usable_keys->is_clear_all())
  {
    for (uint nr=0; nr < table->s->keys ; nr++)
    {
      if (usable_keys->is_set(nr))
      {
        double cost= table->file->keyread_time(nr, 1, table->file->records());
        if (cost < min_cost)
        {
          min_cost= cost;
          best=nr;
        }
      }
    }
  }
  return best;
}
As you can see it looks for the (estimated) fastest way to do a full table-scan using an index. For finding the best way it iterates on all indexes defined for a table and the parameter usable_keys with the value 0xCA tells the code which index contains usable information and therefore costs should be calculated.

And this is the function hierarchy:
make_join_statistics()
    get_quick_record_count()
        SQL_SELECT::test_quick_select()        parameter: &head->covering_keys
            find_shortest_key()                parameter: usable_keys
As shown above the parameter hast the name usable_keys in the function find_shortest_key(). One level up in the hierarchy it comes from head->covering_keys:
SQL_SELECT::test_quick_select(
    ........
    /* Calculate cost of full index read for the shortest covering index */
    if (!head->covering_keys.is_clear_all())
    {
      int key_for_use= find_shortest_key(head, &head->covering_keys);
      ....
    }
    .........
}
And head is a variable of type TABLE *, here is the definition:
class SQL_SELECT :public Sql_alloc {
 public:
  ....
  TABLE *head;
  ....
};
So let's look into the definition of a TABLE:
struct TABLE
{
  ....
  /* 
    Map of keys that can be used to retrieve all data from this table 
    needed by the query without reading the row.
  */
  key_map covering_keys;
  ....
};

where does it come from

Where does covering_keys gets it's value? Here this happens:
static void update_field_dependencies(THD *thd,
                                      Field *field,
                                      TABLE *table)
{
    ....
    table->covering_keys.intersect(field->part_of_key);   
    ....
  DBUG_VOID_RETURN;
}
This function is called with the parameters table TestBig and field Hersteller. In this call the operation is 0xFF AND 0xEA, resulting in a value of 0xEA. In a second call the field is PZN and its computing 0xEA AND 0xDA, resulting in a value of 0xCA, our value.

Some lines above I've shown where covering_keys belongs to, and I will soon explain where part_of_key belongs to.

Here is the function hierarchy:
handle_select()
    mysql_select()
        JOIN::prepare()
            setup_without_group()
                setup_conds()
                    Item_cond::fix_fields()
                        Item_func::fix_fields()
                            Item_field::fix_fields()
                                find_field_in_tables()
                                    find_field_in_table_ref()
                                        find_field_in_table()
                                            update_field_dependencies()
Warning:: the function-hierarchy shown here depends on the SQL-statement used. So in your tests this hierarchy may look different.

Now we have this status: I wanted to explain where the value 0xCA comes from and I said: take the values 0xFF, 0xEA and 0xDA and AND them together, this will produce the value 0xCA. So now I have to explain 3 values not one, a remarkable progress.

part_of_key

The member-var. part_of_key is defined in the class Field:
 class Field
{
  ......
  /* Field is part of the following keys */
  key_map key_start, part_of_key, part_of_key_not_clustered;
  ....
};
and it contains as a bitmap the information in which index this field is used. And the var. key_start was explained in the last post.

So here is a list of all indexes defined in my case on the table TestBig:
MariaDB [TestOpt]> show index from TestBig;  
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | EVP        |            1 | EVP         | A         |       11350 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind1       |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind1       |            2 | ArtikelText | A         |     1666666 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind1       |            3 | Hersteller  | A         |    10000000 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | HAP        |            1 | HAP         | A         |       10162 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind2       |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind2       |            2 | ArtikelText | A         |    10000000 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Ind2       |            3 | PZN         | A         |    10000000 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | PZN        |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | PznHerst   |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | PznHerst   |            2 | Hersteller  | A         |     1111111 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | HerstPzn   |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | HerstPzn   |            2 | PZN         | A         |     1111111 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
14 rows in set (0.00 sec)

MariaDB [TestOpt]> 
The index with the name EVP is the first one in the list and in the bitmaps it is represented by the 0th bit. Similar the index with the name Ind1 is the second one defined and therefore represented by bit 1 in the bitmaps and so on. So I can explain the values used:
  • 0xFF = 011111111b: as we have 8 indexes this value represents an inital value: use all indexes
  • 0xEA = 011101010b: one may use indexes Ind1, Ind12, Hersteller, PznHerst and HerstPzn: all indexes which use the field Hersteller
  • 0xDA = 011011010b: one may use indexes Ind1, Ind2, PZN, PznHerst and HerstPzn: all indexes which use the field PZN

looking for part_of_key: You will find this code-excerpt in the function TABLE_SHARE::init_from_binary_frm_image():
    for (uint key=0 ; key < keys ; key++,keyinfo++)                     // iterate over the indexes defined
    {
      .....
      for (i=0; i < key_parts; key_part++, i++)                         // iterate over the parts of this index
      {
        Field *field;
        ....
        field= key_part->field= share->field[key_part->fieldnr-1];      // get the Field-object for this part of this index
        ....
        if (i == 0)                                                     // if this is the first part of the index
          field->key_start.set_bit(key);                                // set the corresponding bit in key_start (see last post)
        if (field->key_length() == key_part->length &&
            !(field->flags & BLOB_FLAG))
        {
          if (handler_file->index_flags(key, i, 0) & HA_KEYREAD_ONLY)
          {
            share->keys_for_keyread.set_bit(key);                       // a little of carelessness: it's called 14 times to set 8 bits
            field->part_of_key.set_bit(key);                         // here these bits are set
            if (i < keyinfo->user_defined_key_parts)
              field->part_of_key_not_clustered.set_bit(key);
          }
          if (handler_file->index_flags(key, i, 1) & HA_READ_ORDER)
            field->part_of_sortkey.set_bit(key);
        }
        ....
      }
      ....
    }

The value in covering all indexes (aka 0xFF) is copied here:
inline void setup_table_map(TABLE *table, TABLE_LIST *table_list, uint tablenr)
{
  ....
  table->covering_keys= table->s->keys_for_keyread;
  ....
}

With these lines I want to stop. I hope I could explain where the value 0xCA comes from and what the meaning of it is.


correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

Thursday 11 June 2015

0xA8

In my last post I showed how the optimizer handles the case that multiple indexes on one table exist. In the text I described how accessing one table via an index-scan is inspected. Here is the code in the function SQL_SELECT::test_quick_select() that decides to inspect or to ignore an index:
int SQL_SELECT::test_quick_select(
                  ....
                  key_map keys_to_use,
                  ....)
{
    ....
    for (idx=0 ; idx < head->s->keys ; idx++, key_info++)
    {
      ....
      if (!keys_to_use.is_set(idx))
         continue;
      ....
    } 
    ....
}

and here is the function-hierarchy:
make_join_statistics()
   get_quick_record_count()
      SQL_SELECT::test_quick_select()

The var. keys_to_use is a bitmap (type key_map) and 64 bits long. Every bit represents an index which can or cannot be used. As the value of this variable was 0xA8 (= 010101000b) during my test only the indexes 3, 5 and 7 are inspected, the others are omitted.

In this post I want to show where this number 0xA8 came from. Here I take only a small piece from the much bigger mosaic but I want to inspects this tiny piece in more detail. And the var. keys_to_use from the code-excerpt above is a parameter in the function-call of SQL_SELECT::test_quick_select(), this function is called from get_quick_record_count() where this parameter is named keys, and this function is called from make_join_statistics() where this parameter is &s->const_keys where s is a pointer to a JOIN_TAB. So we will have to find out where JOIN_TAB.const_keys is set.

some hints

As in the last post I did all my tests with MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid für MySQL.

Please do not generalize the results presented here. A lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like the values 0xA8 in JOIN_TAB->const_keys).

the statement

The whole statement is internally represented by an object of type JOIN, which among other variables contains a pointer to 2 objects 1) of type JOIN_TAB (in sql_select.h):
class JOIN :public Sql_alloc
{
   .... 
public:
  JOIN_TAB *join_tab, **best_ref;
   .... 
}

The struct JOIN_TAB has a member-variable const_keys, defined (in sql_select.h):
typedef struct st_join_table {
  ....
  key_map const_keys;   /**< Keys with constant part */
  ....
} JOIN_TAB;

The JOIN_TAB representing the table TestSmall is not of interest because on this table no indexes were defined, but for the JOIN_TAB representing the table TestBig this member-variable contains the value 0xA8. So here is the way how this variable is filled:
make_join_statistics()
    update_ref_and_keys()
        add_key_fields()            // calls itself recursively
            add_key_field()
                Field::get_possible_keys()

Each column of a table is represented internally by an object of type Field2) (in table.h):
/**
  This structure is shared between different table objects. There is one
  instance of table share per one table in the database.
*/

struct TABLE_SHARE
{
  ....
  /* The following is copied to each TABLE on OPEN */
  Field **field;
  ....
};

And as the value is returned from the function Field::get_possible_keys(), here is an excerpt of the comment and the function (in field.cc):
/*
  @brief
  Return possible keys for a field

  @details
  Return bit map of keys over this field which can be used by the range
  optimizer. For a field of a generic table such keys are all keys that starts
  from this field.
  ....
  @return map of possible keys
*/

key_map Field::get_possible_keys()
{
  DBUG_ASSERT(table->pos_in_table_list);
  return (table->pos_in_table_list->is_materialized_derived() ?
          part_of_key : key_start);
}

For the object representing the column Hersteller the function Field::get_possible_keys() returns the value 0xA8 (from the member-var. key_start3)). Here is an excerpt from add_key_field() to show how this variable is set:
      ....
      JOIN_TAB *stat=field->table->reginfo.join_tab;
      key_map possible_keys=field->get_possible_keys();
      ....
      if (is_const)
      {
        stat[0].const_keys.merge(possible_keys);
         ....
      }

In words: the function Field::get_possible_keys() returns the value 0xA8 for the object representing the column Hersteller, this value is stored in the var. possible_keys and some lines later stored in JOIN_TAB.const_keys.

So we can narrow down the question to: where does the Field-object gets this value?

Field

And the answer is: in handling of USE Database these values are set. Let me describe how this is done.

Here is the function-hierarchy:
do_handle_one_connection()
    do_command()
        dispatch_command()
            mysqld_list_fields()
                open_normal_and_derived_tables()
                    open_tables()
                        open_and_process_table()
                            open_table()
                                tdc_acquire_share()
                                    open_table_def()
                                        TABLE_SHARE::init_from_binary_frm_image()
                                            create_key_infos()
                                            make_field()

So here is the way to the value 0xA8:
in open_table_def() the file "./TestOpt/TestBig.frm" is opened and the first 64 bytes are read from this file. The len of this file is taken from the bytes 10 to 13 of this block. The 64 bytes just read are copied to the var. buf and the rest of the frm-file is appended to the var. buf by reading the rest of the file. The file is closed and the function init_from_binary_frm_image() is called.

In TABLE_SHARE::init_from_binary_frm_image() the internal copy of the frm-file is interpreted and the needed informations extracted. You will find a description of the structure of the frm-file here, so let's look into the file TestBig.frm; all comments are made by me and they are not complete4):
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --read-bytes=1024 TestBig.frm
000000 fe 01 
       0a                             version of frm
       09 
       12 00                          len of extra2-segment (MariaDB)
       56 00                          offset -> disk_buff
       01 00 
       e9 11 00 00                    len of frm-file
       58 0f
000010 61 00                          len of record
       00 00 00 00                    max. rows
       00 00 00 00                    min. rows
       00                             flag: one record db
       02 f9 00 
       09 00                          DB create options
000020 00                             flag: new frm-file
       05 
       00 00 00 00                    avg. row len
       08
       00                             transaction and page checksum: HA_CHOICE_UNDEF
       00                             
       00                             row type: ROW_TYPE_DEFAULT
       00
       00 00                          stats_sample_pages
       00                             HA_STATS_AUTO_RECALC_DEFAULT
       00 00 58
000030 0f 00 00 
       aa 86 01 00                    MySQL version
       10 00 00 00                    len of extra-segment (starts in 0x100F, contains name of engine + more)
       00 00                          
       00 
       00 00                          key block size
# extra2-segment:
000040 00                             type
       10                             len of segment
       c1 a5 3b b2 da e1 11 e4 
       82 e3 bc 5f f4 85
000050 7f 33 
       1f 10 00 00                    var. pos = offset into frm-file  = 0x101F
                                      position of pos = 0x12 (value taken from 4-5) + constant FRM_HEADER_SIZE (=64d)
# disk_buff:
000056 08                             no. of indexes defined
       0e                             no. of columns used in indexes 5)
       00 00 
       35 00                          len of something
# first index: EVP
00005C 41 00                          flags   -> 0x1000 index has comments ???
       04 00                          len of key
       01                             no. of parts of this index
       00                             algorithm: HA_KEY_ALG_UNDEF
       00 00                          size of block
# first part of this index: column EVP
       03 80                          no. of column, ANDed with 0x3ff
       11 00                          offset in record (-1 -> 16d) ???
       00                             flag for key-part
       03 82                          key-type
       04 00                          len
# second index: Ind1
       69 00 
       26 00 
       03 
       00 
       00 00 
# first part of this index: column PZN
       02 80 
       0a 00 
       00 
       00 80 
       07 00 
# second part of this index: column ArtikelText
       06 80 42 00 00 00 80 1a 00 
# third part of this index: column Hersteller
       07 80 5d 00 00 00 80 05 00 
# third index (including description of parts):
       41 00 04 00 01 00 00 00        index HAP
       04 80 15 00 00 03 82 04 00     column HAP
# fourth index:
       69 00 26 00 03 00 00 00        index Ind2
       07 80 5d 00 00 00 80 05 00     column Hersteller
       06 80 42 00 00 00 80 1a 00     column ArtikelText
       02 80 0a 00 00 00 80 07 00     column PZN
# fifth index:
       41 00 07 00 01 00 00 00        index PZN
       02 80 0a 00 00 00 80 07 00     column PZN
# sixth index:
       41 00 05 00 01 00 00 00        index Hersteller
       07 80 5d 00 00 00 80 05 00     column Hersteller
# seventh index:
       41 00 0c 00 02 00 00 00        index PznHerst
       02 80 0a 00 00 00 80 07 00     column PZN
       07 80 5d 00 00 00 80 05 00     column Hersteller
# eigth index:
000100 41 00 0c 00 02 00 00 00        index HerstPzn
       07 80 5d 00 00 00 80 05 00     column Hersteller
       02 80 0a 00 00 00 80 07 00     column PZN

# the name of the indexes defined:
       ff 45 56 50                                EVP
       ff 49 6e 64 31                             Ind1
       ff 48 41 50                                HAP
       ff 49 6e 64 32                             Ind2
       ff 50 5a 4e                                PZN
000130 ff 48 65 72 73 74 65 6c 6c 65 72           Hersteller
       ff 50 7a 6e 48 65 72 73 74                 PznHerst
       ff 48 65 72 73 74 50 7a 6e                 HerstPzn
       ff 00 00
000150 00

Looks confusing. It's not, let me try to explain some parts of this bunch of bytes. On Position 6 of the frm-file we see a value of 0x0056, this is an offset into the frm-file and in the code this area is called disk_buff. The array disk_buff starts with the number of keys (=indexes) defined for our table: 8, followed by the number of columns used in these indexes: 14. A record_offset is computed and the result is 4014d (= 0xfae) which points into a part of the frm-file not shown here. The type of the engine and some details about this engine are stored there.

Let's look into the array disk_buff (starting at file-offset 0x0056). The first bytes of disk_buff are omitted and the function create_key_infos()6) is called with the parameter strpos starting with disk_buff[6]. In this function it iterates over the indexes and in pseudocode it looks like this:
  # iterate over the 8 indexes defined
  for (i=0 ; i < keys ; i++, keyinfo++)        
  {
     interpret the next 8 bytes 
     store the values in keyinfo
     # iterate over all columns this index is defined of
     for (j=keyinfo->user_defined_key_parts ; j-- ; key_part++)
     {
         the next 9 bytes contain information about the column this index is build of
         will be stored in keypart
     }
     keyinfo->key_part points to keypart
   }
The last step is to copy the names of the indexes.

Here is the data it operates on:
august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax --skip-bytes=0x113f  TestBig.frm
column #1: Id
00113f 00 00 00                               
       08 00                                  len of field
       02 00 00                               position of record within buffer
       00 80                                  flag
       00                                     unireg_type 
       00
       00                                     interval-no.
       fe                                     MYSQL_TYPE_STRING
       08 00
00114f 00 
column #2: PZN
       00 00 00 07 00 0a 00 00 00 80 00 00 00 fe 08 00 00 
column #3: EVP
       00 00 00 09 00 11 00 00 03 82 00 00 00 f6 08 00 00
column #4: HAP
       00 00 00 09 00 15 00 00 03 82 00 00 00 f6 08 00 00 
column #5: ArtikelBez
       00 00 00 28 00 19 00 00 00 80 00 00 00 0f 08 00 00 
column #6: ArtikelText
       00 00 00 1a 00 42 00 00 00 80 00 00 00 0f 08 00 00
column #7: Hersteller
       00 00 00 05 00 5d 00 00 00 80 00 00 00 fe 08 00 00
names of the columns:
       ff                                       delimiter
       49 64 ff                                 Id
       50 5a 4e ff                              PZN
       45 56 50 ff                              EVP
       48 41 50 ff                              HAP
       41 72 74 69 6b 65 6c 42 65 7a ff         ArtikelBez
       41 72 74 69 6b 65 6c 54 65 78 74 ff      ArtikelText
       48 65 72 73 74 65 6c 6c 65 72 ff         Hersteller
       00
0011e9

For this function the work is done and its returning.

Back in TABLE_SHARE::init_from_binary_frm_image() some work is done like copying the names of the columns defined for this table, but this is not my topic here so I omit these details. Later again it iterates over all columns defined and calls the function make_field() to create a new object of type Field (or derived) for each column.

With the Filed-objects constructed the code again iterates over all indexes defined and it sets the bits in the var. key_start:
    for (uint key=0 ; key < keys ; key++,keyinfo++)                     // iterate over the indexes defined
    {
      .....
      for (i=0; i < key_parts; key_part++, i++)                         // iterate over the parts of this index
      {
        Field *field;
        ....
        field= key_part->field= share->field[key_part->fieldnr-1];      // get the Field-object for this part of the index
        ....
        if (i == 0)                                                     // if this is the first part of the index
          field->key_start.set_bit(key);                                // set the corresponding bit in key_start
        ....
      }
      ....
    }

By inspecting the indexes 3 (=Ind2), 5 (=Hersteller) and 7 (=HerstPzn) it sets the corresponding bit in the var. key_start of the Field-object representing the column Hersteller as these indexes start with the column Hersteller. So the result will be the value 010101000b) = 0xA8.

And this finishes my description.

correctness

This text is still a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

some notes:
1) the pointer join_tab points to the first of (in our case) 2 JOIN_TABs. You can look at JOIN->table_count for the correct number of tables.
2) these objects are created when you USE a database.
3) here it the definition taken from field.h:
 class Field
{
  ......
  /* Field is part of the following keys */
  key_map key_start, part_of_key, part_of_key_not_clustered;
  ....
};
4) I didn't want to document the structure of the frm-file, I only wanted to know where the value 0xA8 came from
5) up to 512 columns can be used because one bit in the byte before can be used.
6) it's in table.cc