Tuesday 1 September 2015

9 records


This text here belongs to the series that describes the handling of a SQL-statements by the optimizer when a table has multiple indexes on it. These are the older posts:
In this post I would like to continue this description and describe another tiny point omitted in the last post.


some hints

All posts in this series refer to MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for MySQL.

The example uses MyISAM as the storage engine for the tables and indexes, most of the statements here are only valid for this engine.

Please do not generalize the results presented here. A lot of the data in this post depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers.


9 records

The very tiny point I want to describe in this post is the number 9 which occurred in the last post but wasn't explained there. In these places the number 9 appeared in the last post:
MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: Ind1,Ind2,PZN,Hersteller,PznHerst,HerstPzn
          key: PznHerst
      key_len: 14
          ref: TestOpt.A.PZN,const
         rows: 9
        Extra: Using where; Using index
2 rows in set (0.00 sec)

MariaDB [TestOpt]> 
You will find the number in the description how the second table (=TestBig) is accessed, it's marked in bold.

In the same post you will find the number in these lines:
(7) TestBig:
records_read: 9
read_time:    391,573
key:          6  = PznHerst

And the same number also appeared when the cost of the total plan is calculated:
(8)
current_record_count= 2,709,324 = 301,036 * 9
current_read_time   = 999,772  = 66,334 + 391,573 + 2,709,324/5

So we know it has some relationship with the table TestBig. But what does this number mean?


meaning

In the explain above you can find a hint: 9 rows. So the meaning of this number 9 is: by accessing the table TestBig via the index PznHerst the optimizer estimates to retrieve 9 records for every tuple (PZN, Hersteller) via this index.

And therefore this value is a cost-factor in the calculation of the extended plan, the plan that starts with the table TestSmall and extends this plan by the table TestBig using the index PznHerst.


where does this number come from

So the optimizer uses this number, but where does it get this number from1)?

A short answer will be: from the MYI-file:
august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01                         uchar file_version[4];
       00 03                               uchar options[2];
       03 28                               uchar header_length[2];
       00 b0                               uchar state_info_length[2];
       00 64                               uchar base_info_length[2];
       01 30                               uchar base_pos[2];
       00 0e                               uchar key_parts[2];                 /* Key parts */    
000010 00 00                               uchar unique_key_parts[2];          /* Key parts + unique parts */
       08                                  uchar keys;                         /* number of keys in file */
       00                                  uchar uniques;                      /* number of UNIQUE definitions */ 
       08                                  uchar language;                     /* Language for indexes */ 
       01                                  uchar max_block_size_index;         /* max keyblock size */ 
       00                                  uchar fulltext_keys; 
       00                                  uchar not_used;                     /* To align to 8 */ 

       00 00                               ulong update_count;                   /* Updated for each write lock */ 
       30                                  uint8 changed;                        /* Changed since myisamchk */ 
       ff                                  uint sortkey;                         /* sorted by this key (not used) */ 
       00 00 00 00 00 98 96 80             MI_STATUS_INFO:   ha_rows records;    /* Rows in table */ 
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   ha_rows del;        /* Removed rows */ 
       00 00 00 00 00 98 96 80             ha_rows split;                        /* number of split blocks */ 
       ff ff ff ff ff ff ff ff             my_off_t dellink;                     /* Link to next removed block */
       00 00 00 00 4e 2e cc 00             MI_STATUS_INFO:   my_off_t key_file_length; 
       00 00 00 00 33 a9 ba 30             MI_STATUS_INFO:   my_off_t data_file_length;    
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   my_off_t empty;     /* lost space in datafile */
       00 00 00 00 00 00 00 00             MI_STATUS_INFO:   my_off_t key_empty; /* lost space in indexfile */
       00 00 00 00 00 00 00 00             ulonglong auto_increment; 
       00 00 00 00 00 00 00 00             MI_STATE_INFO:   ha_checksum checksum; 
       00 00 28 c6                         ulong process;                        /* process that updated table last */
000070 00 00 00 2e                         ulong unique;                         /* Unique number for this process */ 
       00 00 00 00                         ulong status; 
       00 00 00 01                         ulong update_count;                   /* Updated for each write lock */
# file-offsets in MYI-file 
       00 00 00 00 06 bf b0 00             my_off_t *key_root;                   /* Start of key trees */
                                              first node of index EVP
       00 00 00 00 0e fb cc 00                index Ind1
       00 00 00 00 15 bb 7c 00                index HAP
       00 00 00 00 25 c6 2c 00                index Ind2
       00 00 00 00 2e 4e 50 00                index PZN
       00 00 00 00 35 9e 00 00                index Hersteller
       00 00 00 00 41 e6 64 00                index PznHerst
       00 00 00 00 4e 2e c8 00                index HerstPzn 
#
       ff ff ff ff ff ff ff ff             my_off_t *key_del;                    /* delete links for trees */
       00 00 00 00                         ulong sec_index_changed;              /* Updated when new sec_index */ 
       00 00 00 00                         ulong sec_index_used;                 /* which extra index are in use */ 
       55 20 07 53                         ulong version;                        /* timestamp of create */
0000d0 00 00 00 00 00 00 00 ff             ulonglong key_map;                    /* Which keys are in use */ 
       00 00 00 00 55 20 07 53             time_t create_time;                   /* Time when created database */
0000e0 00 00 00 00 00 00 00 00             time_t recover_time;                  /* Time for last recover */
       00 00 00 00 55 20 09 1d             time_t check_time;                    /* Time for last check */
0000f0 00 00 00 00 00 98 96 80             my_off_t rec_per_key_rows;            /* Rows when calculating rec_per_key */
# for the parts of the indexes: the number of records: 
       00 00 03 71                         ulong rec_per_key_part[]
                                              index EVP, part EVP
       00 00 01 c0                            index Ind1, part PZN
000100 00 00 00 06                            index Ind1, parts PZN + ArtikelText 
       00 00 00 01                            index Ind1, parts PZN + ArtikelText + Hersteller
       00 00 03 d8                            index HAP, part HAP
       00 00 0f 22                            index Ind2, part Hersteller
000110 00 00 00 01                            index Ind2, parts Hersteller + ArtikelText
       00 00 00 01                            index Ind2, parts Hesteller + ArtikeklText + PZN
       00 00 01 c0                            index PZN, part PZN
       00 00 0f 22                            index Hersteller, part Hersteller
# --- this is the interesting part: ----------------------------------------------------
000120 00 00 01 c0                            index PznHerst, part PZN 
       00 00 00 09                            index PznHerst, parts PZN + Hersteller
# --------------------------------------------------------------------------------------
       00 00 0f 22                            index HerstPzn, part Hersteller
       00 00 00 09                            index HerstPzn, parts Hersteller + PZN
#
000130 .............
If you look into the file myisamdef.h you will find the typedef struct st_mi_state_info and I've added the variables of this structure as a description together with some explanations next to the hex-values taken from the MYI-file.

When you scroll down this list to the end you will see the number 9 and this is exactly the number 9 I'm writing about.

We are here in the area of statistics, these numbers are statistics about the distribution of records (=keys) in the index. There are more statistics but I want to focus on the number 9.

Let's look at the last group of lines in the MYI-file above and start with the first entry. As the first index is named EVP and created on one column (=EVP) only so you will see only one number in the list above: 0x0371 (=881d). And therefore the optimizer expects to find 881 records on average2) when accessing this index with a value.
Next there are 3 numbers which belong to the index Ind1, the next index in the list of indexes defined on this table. As before the fist number 0x01C0 (=448d) describes the estimated number of records when accessing this index via a tuple (PZN), the second number 0x06 tells that it expects 6 records to read when it accesses the index with a tuple (PZN, ArtikelText), again an incomplete access to the index. And the 3rd number is 0x01, so when you access this index with a tuple (PZN, Artikeltext, Hersteller) you will find only one record. Please keep in mind that these numbers are estimations as exact numbers are not available.
I want to omit the other entries so I describe only the entries for the index PznHerst as this index is used in the execution of our statement. The first number is 0x01C0 (=448d), so for accessing this index with a tuple (PZN) we can expect to read 448 records on average. The next number is (drum-rolling in the background and a short pause to increase the suspense) 0x09 (end of drum-rolling) and this is the number of interest in this post. So for searching this index with a tuple (PZN, Hersteller) we expect to read 9 records.

Done. No not, how is the number 9 computed?


querying the table

So let's look at the data and check whether these number are correct:
MariaDB [TestOpt]> select count( distinct Pzn) from TestBig;
+----------------------+
| count( distinct Pzn) |
+----------------------+
|                22298 |
+----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select count(distinct Hersteller) from TestBig;
+----------------------------+
| count(distinct Hersteller) |
+----------------------------+
|                       2581 |
+----------------------------+
1 row in set (1.82 sec)

MariaDB [TestOpt]> select count( distinct Pzn, Hersteller) from TestBig;
+----------------------------------+
| count( distinct Pzn, Hersteller) |
+----------------------------------+
|                          1158300 |
+----------------------------------+
1 row in set (24.39 sec)

MariaDB [TestOpt]> 

The first query tells us that we will find 22,298 different values in the column PZN. Keep in mind that we have 10 mio records in our table, so 10,000,000 / 22,298 = 4483) et voila we have the first of our numbers. If we assume that our entries for PZN are uniformly distributed so we will have 448 records in the table for each value in the column PZN. This is unrealistic, some values will occur more frequently then others but we are making guesses and as long as we don't have exact numbers we will stick with the value 448.

So the value 9 is left. The second query doesn't help, but the third query is useful. Let's take the result from the 3rd query and compute again: 10,000,000 / 1,158,300 = 94) and this is our number! In the 10 mio. records in the table TestBig we assume to get 9 record if we query the table using the index PznHerst with a tuple consisting of a value for PZN and Hersteller.

That's it.

But wehre do these numbers come from?


repair

Let's use an operation that recalculates these numbers:
MariaDB [TestOpt]> repair table TestBig;
+-----------------+--------+----------+----------+
| Table           | Op     | Msg_type | Msg_text |
+-----------------+--------+----------+----------+
| TestOpt.TestBig | repair | status   | OK       |
+-----------------+--------+----------+----------+
1 row in set (7 min 19.10 sec)

MariaDB [TestOpt]> 
I've used this operation because it can be done multiple times without affecting the contents of the table. There are other operations that too compute these statistics.

In general the algorithm behind her repair-operation is: it copies the data from the MYD-file to a temporary file. For every index it sorts the contents (maybe sorts them in parts and merges the parts to a single stream), counts them, computes the numbers given above and finally writes them into the MYI-file. It deletes the MYD-file and renames the temp-file to TestBig.MYD.

Let's look a bit deeper into the code. Here is a simplified function-hierarchy:
ha_myisam::repair()
    mi_repair_by_sort()
        _create_index_by_sort()
            find_all_keys()
                sort_key_read()
                    sort_get_next_record()
                        _mi_read_cache()
                            _my_b_read()
                                inline_mysql_file_read()
                        _mi_rec_check()
                        _mi_make_key()
            merge_index()
                merge_buffers()
                    sort_key_write()
                        ha_key_cmp()
        update_key_parts()
For our case I'm mostly interested in the functions sort_key_write() and update_key_parts().

In sort_key_write() this variable is defined:
static int sort_key_write(MI_SORT_PARAM *sort_param, const void *a)
{
  uint diff_pos[2];
  .....
This little array is used in the called function ha_key_cmp(). And there this array is used in this form:
    diff_pos[0]    number of first keypart where values differ, counting from one.
    diff_pos[1]    points to first value in tuple b that is different from corresponding value in tuple a.
Let's look at all possible case and look at the results returned in the var. diff_pos. In the function ha_key_cmp() we compare a tuple like (a,b) with another tuple (c,d) . These are possible results:
comparing (a,b) and (c,b)
case #1: a != c
         set diff_pos to (1, 0)
case #2: a == c, b != d
         set diff_pos to (2,8)
case #3: (a,b) == (c,d)
         set diff_pos to (3,14)
as our index is defined on (PZN, Hersteller) no other cases are possible,

In the calling function sort_key_write() the array diff_pos is interpreted and used for counting:
    sort_param->unique[diff_pos[0]-1]++;
This statement increases the counter, it counts how many sequences of identical records are found.

If the comparisons and countings are done I found these numbers in the var. sort_param.unique:
sort_param.unique[0] = 22297               [0] + 1 --> 22298
sort_param.unique[1] = 1136002             [0] + [1] + 1 --> 1158300 
sort_param.unique[2] = 8841700             [0] + [1] + [2] + 1 --> 10000000
rest  of list: 0
You can see that the numbers counted are 1 less then the numbers found via the SQL-statement above. And they add up to the number of records in the table.

In the function update_key_parts() these numbers are used in the computation:
  for (parts=0 ; parts < keyinfo->keysegs  ; parts++)
  {
    count+=unique[parts];
    unique_tuples= count + 1;    
    ......
    else
      tmp= (tuples + unique_tuples/2) / unique_tuples;
Let's do these computations by ourselves:
The var. tuples contains the number of records in the table: 10,000,000.
In the first iteration of the loop the var. unique_tuples contains the value 22,298. The computation results in the value 448 for the var. tmp, and this is the first value you've seen above.
Now we take the second iteration: the var. unique_tuples contains the value 1,158,300 and so the result of the computation is 9.

Done. That's the value this post started with. Some lines later it will be written to the MYI-file.


correctness

This text is a simplified presentation of what's going on in the software. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.



some notes:
1) and the numbers for the other indexes too
2) the number is correct if one assumes evenly distribution of keys, but what else can we assume?
3) these numbers are rounded, up or down
4) more precise: 8.63