an unsystematic journey through the source-code of MySQL and MariaDB: October 2014

Wednesday, 22 October 2014

JOIN

By playing with an SQL-statement two questions arised. After diving into the code I was able to answer at least one question and I would like to show my results here. The question was: how does the database-server handle a JOIN? The other question was: among all possible ways for handling this query why does the database-server choose this way? But this is another story.

Let's come back to the first question and look into the results of my observations.

MyISAM

A lot of my statements here are specific for the MyISAM-engine because the tables used for the tests are of type MyISAM. Other storage-engines can handle things differently so don't apply the results presented here schematically to other engines.

And as a second argument a lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like values for keys, file-offsets etc.).

preparations

Let's create some tables, put some data into them and start with the test.

MariaDB [(none)]> create database TestOpt;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use TestOpt;
Database changed
MariaDB [TestOpt]> create table TestBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  decimal(7,2),
    ->                 HAP  decimal(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestOpt]> create table TestSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  decimal(7,2),
    ->                 HAP  decimal(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestOpt]> insert into TestSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (2.21 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> insert into TestBig select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (1 min 12.97 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index PZN on TestSmall(PZN);
Query OK, 301036 rows affected (2.08 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 28.40 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]>

And this is the statement I would like to inspect:

MariaDB [TestOpt]>  select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.41 sec)

MariaDB [TestOpt]>

This statement is silly, it has no special meaning except to be used for this demonstration here.

Let the database-server tell us how it handles this statement:

MariaDB [TestOpt]> explain  select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 6
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

Looks like it takes the table TestSmall (aka A) to scans all records in it, applies WHERE to the records found and for those records meeting the WHERE-condition it searches the corresponding record(s) from TestBig (aka B) via the index. It then puts everything together and presents the result.

code

To look further into the handling of this statement I used the modifications presented here and counted the number of calls of some routines. Here is the result:

select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36035';
table = <TestSmall> counter table-scan: 301037  index-scan: 0
table = <TestBig>   counter table-scan: 0       index-scan: 311

The assumption above seems to be correct: the table TestSmall has approx. 300K records in it as you can see from the INSERT-statement above, these are read via a table-scan (counted by the variable counterTableScan).
For the table TestBig no table-scan is made (var. counterTableScan is 0) but instead index-related-functions are called, only 311times these functions are called. As this table contains 10mio records this is a wise decision to check only this small amount of records.

Let's inspect the data in these two tables:

MariaDB [TestOpt]> select count(*) from TestSmall where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.38 sec)

MariaDB [TestOpt]> select count(*) from TestBig where Hersteller = '36367';
+----------+
| count(*) |
+----------+
|   461016 |
+----------+
1 row in set (12.80 sec)

MariaDB [TestOpt]>

In the table TestSmall we have 60 records, in the table TestBig we have approx. 460K records fulfilling the WHERE-clause.

scanning TestSmall

There are three functions in the storage-engine for handling a table-scan:

rnd_init() is called once for starting a table-scan
rnd_next() returns one record. This function is called again and again until all records are read. In this case it returns MHA_ERR_END_OF_FILE.
rnd_end() is called once after all records are scanned

Using these functions the table TestSmall is read. Record after record is fetched and the column Hersteller is checked. After 2100 records a record with Hersteller = '00020' is encountered.

MariaDB [TestOpt]> select SQL_NO_CACHE * from TestSmall where Hersteller = '00020' limit 1;
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| Id      | PZN    | EVP  | HAP  | ArtikelBez                     | ArtikelText                | Hersteller |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 178324 | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND  | 00020      |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
1 row in set (0.06 sec)

MariaDB [TestOpt]>

This is the first record with Hersteller = '00020' in TestSmall. In this record the value for PZN is '173824'. Let's look into TestBig for records with this value in the column PZN:

MariaDB [TestOpt]> select SQL_NO_CACHE * from TestBig where PZN = '178324' limit 3;
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| 01046189 | 178324 |  10.45 | 7.99 | MARLENE MS K1 AD K BLACK M      2 ST | BASILIKUM AETHERISCHES OEL | 20860      |
| 01089558 | 178324 | 141.72 | 0.00 | HEVERTOPLEX NR148N POPULUS     30 ML | NOVO HELISEN DEP BIR/ER/HA | 01250      |
| 01240196 | 178324 |  17.90 | 5.22 | BIOCHEMIE 10 NATR SULF D12     80 ST | SUPRIMA BETTUN M EG 50X70  | 36245      |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
3 rows in set (0.08 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestBig where PZN = '178324';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestOpt]>

As you can see there are 60 records in it and I presented only the first 3 records.

TestBig via index

There are a lot of functions for accessing an index, their names start with index_xxxxx(). But only two of them are of interest for our test-case:

index_read_map() starts the search. It looks for the first occurrence of a key (in var. key) in the index. If this key is found it returns the corresponding record via the var. buf, otherwise HA_ERR_KEY_NOT_FOUND is returned.
index_next_same() searches for the next entry in the index and returns the corresponding record in the var. buf. If there is no entry identical to the key given it returns HA_ERR_END_OF_FILE.

But we need more information: how is the index organized?

BTREE

MyISAM stores all index-related data in the file with the extension MYI. In our test-case only one index is created but the following statements are also valid in the case of multiple indexes.

The index in MyISAM is of type BTREE, you will find a description of this here. And you will find the description of the structure of the MYI-file here: 20.2 The .MYI File.

tree

Let's dive a bit deeper into the handling of an index and into our test-case.
The tree is organized in pages of size 1K. There are 2 types of pages: nodes and leafs. A node contains one ore more entries with this structure:

a pointer to another node or leaf containing entries with keys less than the following key (5 bytes)
the key = the value of the column of one record (1+7 bytes)
the offset in MYD of the corresponding record (6 bytes)

After the last entry in a node follow a pointer (5 bytes) that contains the number of a page with key-values greater than the last key in this node.

An entry conists of 19 bytes so up to 53 entries can be found in a node ( because 2 + 53*19 + 5 < 1024).

For a leaf the structure differs a bit:

the key = the value of the column(s) which build the index
the offset in MYD of the corresponding record

A leaf is the end of a path in the tree, nodes occur on the way to the leaf. Or said differently: nodes have successors (nodes or leafs), leafs do not. This explains why a leaf contains no references to other pages. As an entry in a leaf is 14 bytes long up to 73 entries fit into a page.

In MyISAM you can easily detect if a page is a node: in the first byte of the page the MSB ist set (so the first byte is >= 0x80).

You can find a nice picture demonstrating this in the Wikipedia-article.

I will give you examples for this structure soon. As written before these examples are taken from my tests and depend on my data.

unique

In the tree the entries are unique and they are sorted, usually ascending.

In the index the entries are unique meaning that for every record in the table there is one entry in the index. As the entry contains the file-offset from the data-file the uniqueness is guaranteed when you look at the full 14 bytes of an entry. This has nothing to do with the statement CREATE UNIQUE INDEX which guarantees uniqueness in the first 8 bytes (in our test-case).

function-calls

Coming back to the query: a record is found in TestSmall, the PZN is extracted and has the value '178324'. Now the server switches over to the table TestBig and searches for records with this value in the column PZN. As TestBig has an index on the column PZN it will use it. Here is the function-hierarchy:

ha_myisam::index_read_map(() 
    mi_rkey()
        _mi_search() with pos = 0x08859800 (aka: root of tree)
            _mi_search() with pos = 0x5a5400
                _mi_search() with pos = 0x30f800
                    _mi_search() with pos = 0x305000
        (*info->read_record) calls _mi_read_dynamic_record()

index_read_map() is the function the database-server calls for reading the first record to the given PZN-value. From the moment of entering this function up to the moment this function is left everything is the responsibility of the storage-engine. The PZN-value to search for is given in the parameter key and in our case it looks like this: 0x00 (a flag) + '178324 '. As the column PZN is created with a width of 7 bytes the key also contains a trailing space. So the length for the comparisons will be 8. index_read_map() calls mi_rkey() for further processing.

mi_rkey() now searches through the tree and it calls the function _mi_search() to do the walk-through recursively. It starts the search at 0x08859800 which is a file-offset in the MYI-file. As written before this is valid for my test-case, your data may differ. One can find the start of the tree (=the root) in the MYI-file (output is shortened):

august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00    --> 0x08859800

august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01 00 03 01 7e 00 b0 00 64 00 c4 00 01
000010 00 00 01 00 08 01 00 00 00 00 30 ff 00 00 00 00
000020 00 98 96 80 00 00 00 00 00 00 00 00 00 00 00 00
000030 00 98 96 80 ff ff ff ff ff ff ff ff 00 00 00 00
000040 08 85 9c 00 00 00 00 00 33 a9 ba 30 00 00 00 00
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 53 36
000070 00 00 00 28 00 00 00 00 00 00 00 01 00 00 00 00      the offset of the root of the tree
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00      (8 bytes)
000090 00 00 00 00 54 32 ad c8 00 00 00 00 00 00 00 01
       ........

As I've created only one index there is only one entry for the file-offset, there can be multiple offsets. You can also find this value at runtime in the var. info->s->state.key_root[0].

Here is what I found in the MYI-file in the root of the tree (a bit shortened):

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x08859800 --read-bytes=1024 TestBig.MYI
8859800 83 97                               this is a node: 0x80 is set
                                            this page is filled up to 0x0397 bytes

        00 00 00 0b 66                      page-number of another node/leaf
        01 31 37 32 33 31 30 20             PZN = 0x01 + '172310 '
        00 00 0e c1 4a 38                   offset in MYD for this record

        00 00 00 16 95                      page 0x1695 --> offset = 0x1695 * 0x400 = 0x5A5400
        01 32 34 38 32 39 35 20             PZN = 0x01 + '248295 '
        00 00 12 20 9b 6c                   offset in MYD-for this record
........................

We are looking for a value of 0x01 (changed already from 0x00) + '178324 ' and found that it is between the first and the second entry in this page. So from the second entry we take the page-number 0x1695, multiply it with the length of a page (=0x400) and get a new offset: 0x5A5400. Let's look what we will find in MYI at this place:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x5A5400 --read-bytes=1024 TestBig.MYI
5a5400 83 e3                                this is a node
                                            this page contains up to 0x03e3 bytes

       00 00 00 0b 65                       
       01 31 37 33 35 34 35 20              PZN = 0x01 + '173545 '
       00 00 0e 7d 12 9c 

       00 00 00 0b 9c 
       01 31 37 34 37 38 36 20              PZN = 0x01 + '174786 '
       00 00 2b ce 20 40 

       00 00 00 0b d2 
       01 31 37 36 32 35 39 20              PZN = 0x01 + '176259 '
       00 00 2a 32 2a f8 

       00 00 00 0c 08
5a5440 01 31 37 37 39 38 30 20              PZN = 0x01 + '177980 '
       00 00 2b 12 d9 d0                    

       00 00 00 0c 3e                       page 0x0c3e --> file offset = 0x0c3e * 0x400 = 0x30f800
       01 31 37 39 32 30 30 20              PZN = 0x01 + '179200 '
       00 00 11 bc 9b 24 
.........................

Again the software walks through this page comparing the keys. The first key greater than our search key is the fifth entry so we continue our search at 0x30f800. This is what we find there:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x30F800 --read-bytes=1024 TestBig.MYI
30f800 83 e3                                another node

       00 00 00 0c 07 
       01 31 37 38 30 30 35 20              PZN = 0x01 + '178005 '
       00 00 31 10 b1 74 
.................
30f8c0 00 00 00 0c 12 
       01 31 37 38 32 36 34 20              PZN = 0x01 + '178264 '
       00 00 01 a1 a8 b4 

       00 00 00 0c 13 
       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
30f8e0 00 00 09 7d 8d f0 

       00 00 00 0c 14                       page = 0x0c14 --> offset 0x305000
       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '    
       00 00 0d 23 d0 60                    file-offset = 0x0d23d060

       00 00 00 0c 15                       page = 0x0c15 --> fiile-offset = 0x305400
       01 31 37 38 33 35 33 20              PZN = 0x01 + '178353 ' 
       00 00 12 c9 f6 9c 
..........................

Again the software compares the keys and stops when it finds an entry greater or equal to our search-key. As we are interested in the first record with our key we continue our search at file-offset 0x305000:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305000 --read-bytes=1024 TestBig.MYI
305000 03 f2                                this is a leaf!

       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 09 fd 3b 88 

305010 01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 0a 9f ea 2c 
.........
       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 33 58 5e e4 

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '       
3052f0 00 00 00 3d 17 c4                    file-offset = 0x003d17c4

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '
       00 00 00 76 74 20                    file-offset = 0x00767420
.........

Starting the comparison from the first entry in this page the code in _mi_search() finds an entry (marked in bold above). As this is a leaf we can stop here and return (and return and return and return, as this was a recursion).
The record can be read from the MYD-file at offset 0x003d17c4:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x0000003d17c4 --read-bytes=100 TestBig.MYD
3d17c4 03 
       00 5e                                len record = 94d
       02                                   no. of filler-bytes
       00 80                                flags
       30 31 30 34 36 31 38 39              Id = '01046189'
       31 37 38 33 32 34 20                 PZN = '178324 '
       80 00 0a 2d                          EVP = 10 + 45 (0x0a, 0x2d) 
       80 00 07 63                          HAP =  7 + 99 (0x07, 0x63)
       24 4d 41 52 4c 45 4e 45              ArtikelBez (len 0x24)
       20 4d 53 20 4b 31 20 41 
       44 20 4b 20 42 4c 41 43 
       4b 20 4d 20 20 20 20 20 
       20 32 20 53 54 
       1a 42 41 53 49 4c 49 4b              ArtikelText (len 0x1a)
       55 4d 20 41 45 54 48 45 
       52 49 53 43 48 45 53 20 
       4f 45 4c 
       32 30 38 36 30                       Hersteller = 20860'
       00 00                                filler-bytes

You can find an explanation of this structure in my post with the title od.

Finito, in TestBig we got the first record which is given back to the database-server in the var. buf. Some internal values are stored, among others the entry found in MYI is kept for further processing (the key + offset, now 14 bytes).

Record #1 is found but it does not match to the condition of the WHERE-clause. Nevertheless step 1 is done.

So let's go on. The database-server continues to search for more records in TestBig for the record from TestSmall. Here is the function-hierarchy for the next step(s):

ha_myisam::index_next_same()
    mi_rnext_same()
        _mi_search_next()
             check page: info->int_keypos < info->int_maxpos
               --> fetch next entry from this page
        compare the first 8 bytes of this entry
        identical: (*info->read_record) calls _mi_read_dynamic_record()

Now the entry-point into the storage-engine is the function index_next_same(). As the name suggests it searches for the next entry in the tree with the same key (8 bytes). So the last page was checked and as it contains more entries so the next entry is taken from this page.

This step will be repeated until a key is found greater than the search-key. Here this happens:

august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305400 --read-bytes=1024 TestBig.MYI
305400 03 f2                                this is a leaf
.....................
       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '        record #59
       00 00 32 52 38 f0 

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '        record #60
       00 00 32 6c 11 30                    file-offset = 0x326c1130

       01 31 37 38 33 35 33 20              PZN = 0x01 + '178353 '        greater then '178324 '
       00 00 00 6c b5 0c                    file-offset = 0x6cb50c
.....................

After record #60 the entry with the key '178353 ' is returned but the function mi_rnext_same() detects that this value differs from the previous one and returns HA_ERR_END_OF_FILE to index_next_same() which returns this value to the database-server.
The server now knows that there are no more records with the key '178324 ' in the table TestBig. It switches over to the table TestSmall and continues with the table-scan. For the next record from TestSmall with a value of Hersteller = '00020' it switches over to TestBig and calls ha_myisam::index_read_map(() giving it as a parameter the value of PZN from the then current record from TestSmall.

more explanations
I'm sure you've detected the fact that in my last block of entries I silently switched from the page at offset 0x305000 to the page at offset 0x305400.
In this example above we scanned the index for entries '178324 ' and the SQL-statement somewhere in the beginning of this post showed that there are 60 records in TestBig with this PZN-value so there must be 60 entries with this value in the tree too. These entries are spread over 2 leafs and one node: 19 entries can bei found at the page (leaf) at offset 0x305000, 40 entries can be found in the page (leaf) at offset 0x305400 and one entry is found in the page (node) at offset 0x30F800.
As shown it's easy to continue the search within the current page but what does the storage-engine do when the end of a page is reached?
It does this:

ha_myisam::index_next_same()()
    mi_rnext_same()
        _mi_search_next() with pos = 0x8859800 (=root)
            info->int_keypos >= info->int_maxpos 
                --> reached the end of this page --> search from root of the tree
            _mi_search() with pos = 0x8859800 (= root)
                _mi_fetch_keypage(): root-page (from cache)
                _mi_seq_search() in this page
                    stops at 2nd entry, returns 1
                _mi_search() with pos = 0x5a5400
                    _mi_fetch_keypage() (from cache)
                    _mi_seq_search()
                        stops at 5th entry, returns 1
                    _mi_search() with pos = 0x30f800
                        _mi_fetch_keypage() (from cache)
                        _mi_seq_search()
                            stops at 12th entry, returns 1
                        _mi_search() with pos = 0x305000
                            _mi_fetch_keypage() (from cache)
                            _mi_seq_search()
                                returns -1 (did find only smaller entries)
                            _mi_search(() with -1
                                end of search: continue search one level up
                    _mi_fetch_keypage(): 0x30f800 (from cache)
                    _mi_get_pack_key()
          check key found
          read record at 0xd23d060

As you can see _mi_search() is called recursively to find an entry in a leaf. And in this case there is no entry in a leaf which is greater than the last key (comparing the full 14 bytes) and smaller than the entries in nodes so it returns 2 levels up and takes the entry from the node at offset 0x30f800. It returns with a pointer to this entry to the function _mi_search_next() where the key is compared against the search-key (comparing the first 8 bytes) and as these are identical the offset (the last 6 bytes of this entry) is taken and the record is read from the MYD-file.

In the next call of index_next_same() it works like this:

ha_myisam::index_next_same() 
    mi_rnext_same()
        _mi_search_next() with pos = 0x8859800 (=root)
            uses the last buffer because: info->buff is 0 --> unchanged 
            _mi_search() with pos = 0x305400
                _mi_fetch_keypage()     (read from disk via my_pread())
                _mi_seq_search() 
                    returns +1
                _mi_search() with pos = -1
                     returns with +1 (search at upper level)
        read record at 0xd7e0b58

As the current buffer is a node it takes the page-offset contained in the last entry used. The page with this number (a node or a leaf) will contain entries with key-values greater than the current entry. So with some extra steps the first entry from this page is taken and the corresponding record is read.

Next calls to index_next_same() will take the following entries from this page until an entry is found which has a different key. In this case _mi_search_next() returns a value HA_ERR_END_OF_FILE. The server now switches over to TestSmall as already described above.

correctness

This text is still a simplified presentation of what's going on in the software. More conflicts can arise but the code handles these situations. And there are also some optimizations build into the code that I didn't describe (this text is already a bit longer than planned).

In case something is wrong with my description please use the mail-function of this side and send me a mail describing my error. I will look at it. Thanks.

Wednesday, 1 October 2014

table-scan or index-scan

observation

in one of the tests I made some time ago I made the observation that adding an index to an existing table slowed down a SELECT-statement. That's something I didn't expect so I started looking into this effect. And I would like to present my results so far here.

prerequisites

Instead of creating and dropping the index over and over again I created 2 tables of identical structure and put the same data in these. So I created the tables TestWitoutI (table without index) and TestWithI (table with index). And here they are:

 MariaDB [TestCol]> show create table TestWithI\G
*************************** 1. row ***************************
       Table: TestWithI
Create Table: CREATE TABLE `TestWithI` (
  `Id` char(8) DEFAULT NULL,
  `PZN` char(7) DEFAULT NULL,
  `EVP` char(8) DEFAULT NULL,
  `HAP` char(8) DEFAULT NULL,
  `ArtikelBez` char(40) DEFAULT NULL,
  `ArtikelText` char(26) DEFAULT NULL,
  `Hersteller` char(5) DEFAULT NULL,
  KEY `Hersteller` (`Hersteller`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

MariaDB [TestCol]> show create table TestWithoutI\G
*************************** 1. row ***************************
       Table: TestWithoutI
Create Table: CREATE TABLE `TestWithoutI` (
  `Id` char(8) DEFAULT NULL,
  `PZN` char(7) DEFAULT NULL,
  `EVP` char(8) DEFAULT NULL,
  `HAP` char(8) DEFAULT NULL,
  `ArtikelBez` char(40) DEFAULT NULL,
  `ArtikelText` char(26) DEFAULT NULL,
  `Hersteller` char(5) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

MariaDB [TestCol]>

the observation

and here are the results of my test. At first the version without any index:

 MariaDB [TestCol]> show index from TestWithoutI;
Empty set (0.00 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (10.74 sec)

MariaDB [TestCol]>

And now the same test but the table contains an index on one column (this column does appear in the WHERE-clause of the query):

 MariaDB [TestCol]> show index from TestWithI;
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestWithI |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.01 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where Hersteller <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (18.93 sec)

MariaDB [TestCol]>

Surprise: on the table TestWitoutI the SELECT-statement took approx. 10 seconds, on the table TestWithI it took approx. 18 seconds.

suspicion

Could it be that in the case of the table TetWithI the database-server prefers an index-scan over a table-scan? Let the server explain what it did:

 MariaDB [TestCol]> explain select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller <> 'ABCDE'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: TestWithoutI
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
1 row in set (0.00 sec)

MariaDB [TestCol]> explain select SQL_NO_CACHE count(*) from TestWithI where Hersteller <> 'ABCDE'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: TestWithI
         type: index
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: NULL
         rows: 10000000
        Extra: Using where; Using index
1 row in set (0.01 sec)

MariaDB [TestCol]>

Indeed it chooses the index-scan over the table-scan. To be absolutely sure we will look into the code to verify this.

modifying code

Let's simply count the number of read-accesses of the table and the index-tree.

Therefore we need:

2 counters
initialize them
count every time the next record or the next entry from the index-tree is read
and finally output the values

The tables for testing this behaviour are of type MyISAM so we have to modify the files ha_myisam.h and ha_myisam.cc, you will find both in the directory storage.
In the file ha_myisam.h please add this line at the end of the definition of the class. It should now look like this:

 
  int counterTableScan, counterIndexScan;

};

Now in ha_myiam.cc please add these lines in the functions as described:

 in the function ha_myisam::index_init():
    counterTableScan = counterIndexScan = 0;

in the function ha_myisam::rnd_init():
    counterTableScan = counterIndexScan = 0;


in the function ha_myisam::rnd_next():
    counterTableScan++;

in the function ha_myisam::index_next():
    counterIndexScan++;


in the function ha_myisam::index_end():
  fprintf(stderr, "\ntable = <%s>\tcounter\ttable-scan: %d\tindex-scan: %d\n\n", table_share->table_name.str, counterTableScan, counterIndexScan);

in the function ha_myisam::rnd_end():
  fprintf(stderr, "\ntable = <%s>\tcounter\ttable-scan: %d\tindex-scan: %d\n\n", table_share->table_name.str, counterTableScan, counterIndexScan);

Add these lines, compile your server, start it and you are ready for the tests.

As you can see the counters are set to zero, incremented and finally written to stderr (you will find the output in the error-log which is the file /var/log/mysql/error.log in my installation).
table-scan: the initialisation-function is rnd_init(). Every time a row is fetched the function rnd_next() is called. And finally the scan ends with a call of rnd_end().
index-scan: things are not so easy here because there are a lot more functions for handling this. But our case is easy so we can live with these functions: index_init() for initialization, index_next() for one entry fetched and index_end() for termination.

result

Let' do our tests. The results are the same as above but in the log-file we find these lines:

 
table = <TestWithoutI> counter table-scan: 10000001 index-scan: 0
table = <TestWithI> counter table-scan: 0 index-scan: 10000000

So our suspicion is verified¹⁾: for the table TestWithoutI a table-scan is made because no other way for accessing the data is available. In the case of the table TestWithI two ways are possible: use a full table-scan or go via the index Hersteller. And in this case the way via the index is chosen.

optimizer

In the handling of a SELECT-statement the optimizer is the part responsible for finding a good way for accessing the data. So let's step away from any further testing or playing with code and look at the sources.
It would help us if we can find an introductory text about the optimizer used in MySQL/MariaDB. Unfortunately I didn't find much except this book: here and here. You will find a short introduction into this topic on page 170 of Sasha's book. The book was published in 2007 so keep in mind that the code in MySQL/MariaDB has evolved in the meantime but the book is still helpful (and this is valid not only for this topic).

The optimizer works by finding all possible paths to answer the query²⁾. For every possible path it calculates a number which we can interpret as the 'cost' for this path and finally it compare these numbers to choose the path with the lowest 'cost'. Don't take the 'cost' literally, it's just a number representing the time and the resources needed for handling the query. The more time and resources it needs the higher the number has to be, the numbers have no other meanings.

In the file handler.h you will find these functions for calculating 'costs':

   virtual double scan_time()
  { return ulonglong2double(stats.data_file_length) / IO_SIZE + 2; }

  /**
     The cost of reading a set of ranges from the table using an index
     to access it.
     
     @param index  The index number.
     @param ranges The number of ranges to be read.
     @param rows   Total number of rows to be read.
     
     This method can be used to calculate the total cost of scanning a table
     using an index by calling it using read_time(index, 1, table_size).
  */
  virtual double read_time(uint index, uint ranges, ha_rows rows)
  { return rows2double(ranges+rows); }

  /**
    Calculate cost of 'keyread' scan for given index and number of records.

     @param index    index to read
     @param ranges   #of ranges to read
     @param rows     #of records to read
  */
  virtual double keyread_time(uint index, uint ranges, ha_rows rows);

and in handler.cc:

double handler::keyread_time(uint index, uint ranges, ha_rows rows)
{
  /*
    It is assumed that we will read trough the whole key range and that all
    key blocks are half full (normally things are much better). It is also
    assumed that each time we read the next key from the index, the handler
    performs a random seek, thus the cost is proportional to the number of
    blocks read. This model does not take into account clustered indexes -
    engines that support that (e.g. InnoDB) may want to overwrite this method.
    The model counts in the time to read index entries from cache.
  */
  ulong len= table->key_info[index].key_length + ref_length;
  if (index == table->s->primary_key && table->file->primary_key_is_clustered())
    len= table->s->stored_rec_length;
  double keys_per_block= (stats.block_size/2.0/len+1);
  return (rows + keys_per_block-1)/ keys_per_block +
         len*rows/(stats.block_size+1)/TIME_FOR_COMPARE ;
}

These are the basic functions valid for all storage-engines. If the writer of a new storage-engine has a different opinion then he must overwrite these functions in his derived class.

Back to MyISAM: let me describe these functions as they were used in my test-case:
read_time(): in my tests this function was never called so I omit this one here.
scan_time(): estimating a value for doing a full table-scan on the whole file. In my tests the size of the datafile is about 1GB, IO_SIZE is defined as 4K, so the result was the value 251466.84375. Take it as it is.
keyread_time(): does some computations on the structure of the MYI-file³⁾. the variable len is calculated ass 13 ( = 6 + 7), a leaf on the BTREE-index has the size of 1K so the var. keys_per_block gets the value 40.384615384615387 (this says that there are 40 entries in a block when a block is half filled). In the next step 2 values are computed, added and returned. The first value computes the (estimated) number of blocks in the index by dividing the number of rows in the table by the number of entries in one block; this gives a value of 247620.02285714282. The second value is computed as the len of all entries of the column Hersteller (= 10mio * 13) divided by the size of an index-block, this value is divided by 5: the result was 25365. Add these 2 number and you will get a return value of 272985.02285714285. Take these number as given and do not forget that these numbers are valid only for my test-case.

As you can see: 272985 > 251466 so the 'cost' of an index-scan is estimated to be higher than the 'cost' of a table-scan.

So why does the server choose an index-scan?

overwriting the result

As Sasha described in his book the server breaks a statement into pieces called JOINs. In my test-case a simple statement was used so there is only one JOIN (a degenerated one). You may look into the file sql/sql_select.cc to see how the server handles this statement. There is a structure for handling this: JOIN_TAB (in sql/sql_select.h). This struct contains a member-variable read_first_record which was set with the address of the the function join_init_read_record(). This value will be overwritten later in the function make_join_readinfo() by these lines:

       tab->index=find_shortest_key(table, & table->covering_keys);  
       tab->read_first_record= join_read_first;                          // overwrites: join_init_read_record
       /* Read with index_first / index_next */
       tab->type= tab->type == JT_ALL ? JT_NEXT : JT_HASH_NEXT;

The comment is not included in the original source, I've added it.

The old value of the member-var. was the address of the function join_init_read_record(), a function that starts a full table-scan. The new value is the address of the function join_read_first()., this function does the same thing but for an index-scan.

If you look in the code above the place where the member-var. read_first_record is overwritten you see this statement in an if()-construct: else if ( !table->covering_keys.is_clear_all() &&
covering_keys is a bitmap that contains a list of usable indexes. To simplify it: if this bitmap is not zero then an index exists with usable information so the optimizer decidess to use it. And in this test-case an index on the column Hersteller exists and this column is also referenced in the WHERE-clause.

So the decision is: if there is an index and it contains usable information then prefer the index over a table-scan.

To verify this you can simply change the WHERE-clause of the SQL-statement so that it uses a different column for the comparison, like: WHERE PZN <> 'ABCDE'
You will see that the optimizer chooses a table-scan.

 MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where PZN <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (12.69 sec)

MariaDB [TestCol]> explain select SQL_NO_CACHE count(*) from TestWithI where PZN <> 'ABCDE'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: TestWithI
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
1 row in set (0.01 sec)

MariaDB [TestCol]>

judgment

Is the decision to prefer an index-scan over a table-scan correct? I think so.

Let me show you this:

 MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where Hersteller = 'ABCDE';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller = 'ABCDE';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (8.65 sec)

MariaDB [TestCol]>

This example clearly showed the effect of using an index. The difference between the example in the intro of this post and this one are the number of records inspected. The more entries of the index-tree the software has to fetch the more time it needs for the work.

Let me give you another example:

 MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where Hersteller = '00020'; 
+----------+
| count(*) |
+----------+
|     1903 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|     1903 |
+----------+
1 row in set (7.90 sec)

MariaDB [TestCol]>

Nearly 2000 records (out of 10 mio) are found and this was done fast by using an index. In the example of the test for equality no record was found in the tree Our example of testing for a not-existing value is an extreme example, it's not useful as a counter-example because this test should happen only infrequently.

criticisms

By playing with the code and my examples I found that some functions were called mulitple times and they are called with identical parameters. This is unnecessary.

scan_time(): this function was called 3times (SELECT from TestWithI) or twice (SELECT from TestWithoutI) and in all cases the function did the same calculations returning the same value. To be fair: the computations were simple.

keyread_time(): this function was only called in the case of the SELECT from TestWithI. It contains a paramater ranges that is not used in the computations but contained different values in the calls. This function was called 4times, returning the same value in each call.

a personal word: this looks like I a harsh criticism. But these things happen, nevertheless this is unnecessary. It should be cleaned someday.
If you have a chance to look at code I wrote I'm sure you can find similar effects. Please be lenient toward me.

some notes:

1) in case that you've seen a difference of 1 in these numbers: I've forgotten to count the call of index_first(). A bug, I didn't correct it.

2) this is over-simplified. In his book Sasha describes why this approach is unrealistic and what the optimizer really does. For our case the description is valid so far.

3) you will find more information about the structure of the MYI-file here: 20.2 The .MYI File