Wednesday, 22 October 2014

JOIN

By playing with an SQL-statement two questions arised. After diving into the code I was able to answer at least one question and I would like to show my results here. The question was: how does the database-server handle a JOIN? The other question was: among all possible ways for handling this query why does the database-server choose this way? But this is another story.

Let's come back to the first question and look into the results of my observations.

MyISAM

A lot of my statements here are specific for the MyISAM-engine because the tables used for the tests are of type MyISAM. Other storage-engines can handle things differently so don't apply the results presented here schematically to other engines.

And as a second argument a lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like values for keys, file-offsets etc.).

preparations

Let's create some tables, put some data into them and start with the test.
MariaDB [(none)]> create database TestOpt;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use TestOpt;
Database changed
MariaDB [TestOpt]> create table TestBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  decimal(7,2),
    ->                 HAP  decimal(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestOpt]> create table TestSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  decimal(7,2),
    ->                 HAP  decimal(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestOpt]> insert into TestSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (2.21 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> insert into TestBig select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (1 min 12.97 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index PZN on TestSmall(PZN);
Query OK, 301036 rows affected (2.08 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 28.40 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> 

And this is the statement I would like to inspect:
MariaDB [TestOpt]>  select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.41 sec)

MariaDB [TestOpt]> 
This statement is silly, it has no special meaning except to be used for this demonstration here.

Let the database-server tell us how it handles this statement:
MariaDB [TestOpt]> explain  select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: PZN
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 6
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>
Looks like it takes the table TestSmall (aka A) to scans all records in it, applies WHERE to the records found and for those records meeting the WHERE-condition it searches the corresponding record(s) from TestBig (aka B) via the index. It then puts everything together and presents the result.

code

To look further into the handling of this statement I used the modifications presented here and counted the number of calls of some routines. Here is the result:
select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36035';
table = <TestSmall> counter table-scan: 301037  index-scan: 0
table = <TestBig>   counter table-scan: 0       index-scan: 311
The assumption above seems to be correct: the table TestSmall has approx. 300K records in it as you can see from the INSERT-statement above, these are read via a table-scan (counted by the variable counterTableScan).
For the table TestBig no table-scan is made (var. counterTableScan is 0) but instead index-related-functions are called, only 311times these functions are called. As this table contains 10mio records this is a wise decision to check only this small amount of records.

Let's inspect the data in these two tables:
MariaDB [TestOpt]> select count(*) from TestSmall where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.38 sec)

MariaDB [TestOpt]> select count(*) from TestBig where Hersteller = '36367';
+----------+
| count(*) |
+----------+
|   461016 |
+----------+
1 row in set (12.80 sec)

MariaDB [TestOpt]> 
In the table TestSmall we have 60 records, in the table TestBig we have approx. 460K records fulfilling the WHERE-clause.

scanning TestSmall

There are three functions in the storage-engine for handling a table-scan:
  • rnd_init() is called once for starting a table-scan
  • rnd_next() returns one record. This function is called again and again until all records are read. In this case it returns MHA_ERR_END_OF_FILE.
  • rnd_end() is called once after all records are scanned
Using these functions the table TestSmall is read. Record after record is fetched and the column Hersteller is checked. After 2100 records a record with Hersteller = '00020' is encountered.
MariaDB [TestOpt]> select SQL_NO_CACHE * from TestSmall where Hersteller = '00020' limit 1;
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| Id      | PZN    | EVP  | HAP  | ArtikelBez                     | ArtikelText                | Hersteller |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 178324 | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND  | 00020      |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
1 row in set (0.06 sec)

MariaDB [TestOpt]> 
This is the first record with Hersteller = '00020' in TestSmall. In this record the value for PZN is '173824'. Let's look into TestBig for records with this value in the column PZN:
MariaDB [TestOpt]> select SQL_NO_CACHE * from TestBig where PZN = '178324' limit 3;
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| 01046189 | 178324 |  10.45 | 7.99 | MARLENE MS K1 AD K BLACK M      2 ST | BASILIKUM AETHERISCHES OEL | 20860      |
| 01089558 | 178324 | 141.72 | 0.00 | HEVERTOPLEX NR148N POPULUS     30 ML | NOVO HELISEN DEP BIR/ER/HA | 01250      |
| 01240196 | 178324 |  17.90 | 5.22 | BIOCHEMIE 10 NATR SULF D12     80 ST | SUPRIMA BETTUN M EG 50X70  | 36245      |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
3 rows in set (0.08 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestBig where PZN = '178324';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestOpt]> 
As you can see there are 60 records in it and I presented only the first 3 records.

TestBig via index

There are a lot of functions for accessing an index, their names start with index_xxxxx(). But only two of them are of interest for our test-case:
  • index_read_map() starts the search. It looks for the first occurrence of a key (in var. key) in the index. If this key is found it returns the corresponding record via the var. buf, otherwise HA_ERR_KEY_NOT_FOUND is returned.
  • index_next_same() searches for the next entry in the index and returns the corresponding record in the var. buf. If there is no entry identical to the key given it returns HA_ERR_END_OF_FILE.
But we need more information: how is the index organized?

BTREE

MyISAM stores all index-related data in the file with the extension MYI. In our test-case only one index is created but the following statements are also valid in the case of multiple indexes.

The index in MyISAM is of type BTREE, you will find a description of this here. And you will find the description of the structure of the MYI-file here: 20.2 The .MYI File.

tree

Let's dive a bit deeper into the handling of an index and into our test-case.
The tree is organized in pages of size 1K. There are 2 types of pages: nodes and leafs. A node contains one ore more entries with this structure:
  • a pointer to another node or leaf containing entries with keys less than the following key (5 bytes)
  • the key = the value of the column of one record (1+7 bytes)
  • the offset in MYD of the corresponding record (6 bytes)
After the last entry in a node follow a pointer (5 bytes) that contains the number of a page with key-values greater than the last key in this node.

An entry conists of 19 bytes so up to 53 entries can be found in a node ( because 2 + 53*19 + 5 < 1024).

For a leaf the structure differs a bit:
  • the key = the value of the column(s) which build the index
  • the offset in MYD of the corresponding record
A leaf is the end of a path in the tree, nodes occur on the way to the leaf. Or said differently: nodes have successors (nodes or leafs), leafs do not. This explains why a leaf contains no references to other pages. As an entry in a leaf is 14 bytes long up to 73 entries fit into a page.

In MyISAM you can easily detect if a page is a node: in the first byte of the page the MSB ist set (so the first byte is >= 0x80).

You can find a nice picture demonstrating this in the Wikipedia-article.

I will give you examples for this structure soon. As written before these examples are taken from my tests and depend on my data.

unique

In the tree the entries are unique and they are sorted, usually ascending.

In the index the entries are unique meaning that for every record in the table there is one entry in the index. As the entry contains the file-offset from the data-file the uniqueness is guaranteed when you look at the full 14 bytes of an entry. This has nothing to do with the statement CREATE UNIQUE INDEX which guarantees uniqueness in the first 8 bytes (in our test-case).

function-calls

Coming back to the query: a record is found in TestSmall, the PZN is extracted and has the value '178324'. Now the server switches over to the table TestBig and searches for records with this value in the column PZN. As TestBig has an index on the column PZN it will use it. Here is the function-hierarchy:
ha_myisam::index_read_map(() 
    mi_rkey()
        _mi_search() with pos = 0x08859800 (aka: root of tree)
            _mi_search() with pos = 0x5a5400
                _mi_search() with pos = 0x30f800
                    _mi_search() with pos = 0x305000
        (*info->read_record) calls _mi_read_dynamic_record()
index_read_map() is the function the database-server calls for reading the first record to the given PZN-value. From the moment of entering this function up to the moment this function is left everything is the responsibility of the storage-engine. The PZN-value to search for is given in the parameter key and in our case it looks like this: 0x00 (a flag) + '178324 '. As the column PZN is created with a width of 7 bytes the key also contains a trailing space. So the length for the comparisons will be 8. index_read_map() calls mi_rkey() for further processing.

mi_rkey() now searches through the tree and it calls the function _mi_search() to do the walk-through recursively. It starts the search at 0x08859800 which is a file-offset in the MYI-file. As written before this is valid for my test-case, your data may differ. One can find the start of the tree (=the root) in the MYI-file (output is shortened):
august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00    --> 0x08859800

august@AMD4:~/MariaDB/data/TestOpt$  od -t x1 -Ax  --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01 00 03 01 7e 00 b0 00 64 00 c4 00 01
000010 00 00 01 00 08 01 00 00 00 00 30 ff 00 00 00 00
000020 00 98 96 80 00 00 00 00 00 00 00 00 00 00 00 00
000030 00 98 96 80 ff ff ff ff ff ff ff ff 00 00 00 00
000040 08 85 9c 00 00 00 00 00 33 a9 ba 30 00 00 00 00
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 53 36
000070 00 00 00 28 00 00 00 00 00 00 00 01 00 00 00 00      the offset of the root of the tree
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00      (8 bytes)
000090 00 00 00 00 54 32 ad c8 00 00 00 00 00 00 00 01
       ........
As I've created only one index there is only one entry for the file-offset, there can be multiple offsets. You can also find this value at runtime in the var. info->s->state.key_root[0].

Here is what I found in the MYI-file in the root of the tree (a bit shortened):
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x08859800 --read-bytes=1024 TestBig.MYI
8859800 83 97                               this is a node: 0x80 is set
                                            this page is filled up to 0x0397 bytes

        00 00 00 0b 66                      page-number of another node/leaf
        01 31 37 32 33 31 30 20             PZN = 0x01 + '172310 '
        00 00 0e c1 4a 38                   offset in MYD for this record

        00 00 00 16 95                      page 0x1695 --> offset = 0x1695 * 0x400 = 0x5A5400
        01 32 34 38 32 39 35 20             PZN = 0x01 + '248295 '
        00 00 12 20 9b 6c                   offset in MYD-for this record
........................
We are looking for a value of 0x01 (changed already from 0x00) + '178324 ' and found that it is between the first and the second entry in this page. So from the second entry we take the page-number 0x1695, multiply it with the length of a page (=0x400) and get a new offset: 0x5A5400. Let's look what we will find in MYI at this place:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x5A5400 --read-bytes=1024 TestBig.MYI
5a5400 83 e3                                this is a node
                                            this page contains up to 0x03e3 bytes

       00 00 00 0b 65                       
       01 31 37 33 35 34 35 20              PZN = 0x01 + '173545 '
       00 00 0e 7d 12 9c 

       00 00 00 0b 9c 
       01 31 37 34 37 38 36 20              PZN = 0x01 + '174786 '
       00 00 2b ce 20 40 

       00 00 00 0b d2 
       01 31 37 36 32 35 39 20              PZN = 0x01 + '176259 '
       00 00 2a 32 2a f8 

       00 00 00 0c 08
5a5440 01 31 37 37 39 38 30 20              PZN = 0x01 + '177980 '
       00 00 2b 12 d9 d0                    

       00 00 00 0c 3e                       page 0x0c3e --> file offset = 0x0c3e * 0x400 = 0x30f800
       01 31 37 39 32 30 30 20              PZN = 0x01 + '179200 '
       00 00 11 bc 9b 24 
.........................
Again the software walks through this page comparing the keys. The first key greater than our search key is the fifth entry so we continue our search at 0x30f800. This is what we find there:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x30F800 --read-bytes=1024 TestBig.MYI
30f800 83 e3                                another node

       00 00 00 0c 07 
       01 31 37 38 30 30 35 20              PZN = 0x01 + '178005 '
       00 00 31 10 b1 74 
.................
30f8c0 00 00 00 0c 12 
       01 31 37 38 32 36 34 20              PZN = 0x01 + '178264 '
       00 00 01 a1 a8 b4 

       00 00 00 0c 13 
       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
30f8e0 00 00 09 7d 8d f0 

       00 00 00 0c 14                       page = 0x0c14 --> offset 0x305000
       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '    
       00 00 0d 23 d0 60                    file-offset = 0x0d23d060

       00 00 00 0c 15                       page = 0x0c15 --> fiile-offset = 0x305400
       01 31 37 38 33 35 33 20              PZN = 0x01 + '178353 ' 
       00 00 12 c9 f6 9c 
..........................
Again the software compares the keys and stops when it finds an entry greater or equal to our search-key. As we are interested in the first record with our key we continue our search at file-offset 0x305000:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305000 --read-bytes=1024 TestBig.MYI
305000 03 f2                                this is a leaf!

       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 09 fd 3b 88 

305010 01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 0a 9f ea 2c 
.........
       01 31 37 38 32 38 37 20              PZN = 0x01 + '178287 '
       00 00 33 58 5e e4 

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '       
3052f0 00 00 00 3d 17 c4                    file-offset = 0x003d17c4

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '
       00 00 00 76 74 20                    file-offset = 0x00767420
.........
Starting the comparison from the first entry in this page the code in _mi_search() finds an entry (marked in bold above). As this is a leaf we can stop here and return (and return and return and return, as this was a recursion).
The record can be read from the MYD-file at offset 0x003d17c4:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x0000003d17c4 --read-bytes=100 TestBig.MYD
3d17c4 03 
       00 5e                                len record = 94d
       02                                   no. of filler-bytes
       00 80                                flags
       30 31 30 34 36 31 38 39              Id = '01046189'
       31 37 38 33 32 34 20                 PZN = '178324 '
       80 00 0a 2d                          EVP = 10 + 45 (0x0a, 0x2d) 
       80 00 07 63                          HAP =  7 + 99 (0x07, 0x63)
       24 4d 41 52 4c 45 4e 45              ArtikelBez (len 0x24)
       20 4d 53 20 4b 31 20 41 
       44 20 4b 20 42 4c 41 43 
       4b 20 4d 20 20 20 20 20 
       20 32 20 53 54 
       1a 42 41 53 49 4c 49 4b              ArtikelText (len 0x1a)
       55 4d 20 41 45 54 48 45 
       52 49 53 43 48 45 53 20 
       4f 45 4c 
       32 30 38 36 30                       Hersteller = 20860'
       00 00                                filler-bytes
You can find an explanation of this structure in my post with the title od.

Finito, in TestBig we got the first record which is given back to the database-server in the var. buf. Some internal values are stored, among others the entry found in MYI is kept for further processing (the key + offset, now 14 bytes).

Record #1 is found but it does not match to the condition of the WHERE-clause. Nevertheless step 1 is done.

So let's go on. The database-server continues to search for more records in TestBig for the record from TestSmall. Here is the function-hierarchy for the next step(s):
ha_myisam::index_next_same()
    mi_rnext_same()
        _mi_search_next()
             check page: info->int_keypos < info->int_maxpos
               --> fetch next entry from this page
        compare the first 8 bytes of this entry
        identical: (*info->read_record) calls _mi_read_dynamic_record()
Now the entry-point into the storage-engine is the function index_next_same(). As the name suggests it searches for the next entry in the tree with the same key (8 bytes). So the last page was checked and as it contains more entries so the next entry is taken from this page.

This step will be repeated until a key is found greater than the search-key. Here this happens:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305400 --read-bytes=1024 TestBig.MYI
305400 03 f2                                this is a leaf
.....................
       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '        record #59
       00 00 32 52 38 f0 

       01 31 37 38 33 32 34 20              PZN = 0x01 + '178324 '        record #60
       00 00 32 6c 11 30                    file-offset = 0x326c1130

       01 31 37 38 33 35 33 20              PZN = 0x01 + '178353 '        greater then '178324 '
       00 00 00 6c b5 0c                    file-offset = 0x6cb50c
.....................
After record #60 the entry with the key '178353 ' is returned but the function mi_rnext_same() detects that this value differs from the previous one and returns HA_ERR_END_OF_FILE to index_next_same() which returns this value to the database-server.
The server now knows that there are no more records with the key '178324 ' in the table TestBig. It switches over to the table TestSmall and continues with the table-scan. For the next record from TestSmall with a value of Hersteller = '00020' it switches over to TestBig and calls ha_myisam::index_read_map(() giving it as a parameter the value of PZN from the then current record from TestSmall.

more explanations
I'm sure you've detected the fact that in my last block of entries I silently switched from the page at offset 0x305000 to the page at offset 0x305400.
In this example above we scanned the index for entries '178324 ' and the SQL-statement somewhere in the beginning of this post showed that there are 60 records in TestBig with this PZN-value so there must be 60 entries with this value in the tree too. These entries are spread over 2 leafs and one node: 19 entries can bei found at the page (leaf) at offset 0x305000, 40 entries can be found in the page (leaf) at offset 0x305400 and one entry is found in the page (node) at offset 0x30F800.
As shown it's easy to continue the search within the current page but what does the storage-engine do when the end of a page is reached?
It does this:
ha_myisam::index_next_same()()
    mi_rnext_same()
        _mi_search_next() with pos = 0x8859800 (=root)
            info->int_keypos >= info->int_maxpos 
                --> reached the end of this page --> search from root of the tree
            _mi_search() with pos = 0x8859800 (= root)
                _mi_fetch_keypage(): root-page (from cache)
                _mi_seq_search() in this page
                    stops at 2nd entry, returns 1
                _mi_search() with pos = 0x5a5400
                    _mi_fetch_keypage() (from cache)
                    _mi_seq_search()
                        stops at 5th entry, returns 1
                    _mi_search() with pos = 0x30f800
                        _mi_fetch_keypage() (from cache)
                        _mi_seq_search()
                            stops at 12th entry, returns 1
                        _mi_search() with pos = 0x305000
                            _mi_fetch_keypage() (from cache)
                            _mi_seq_search()
                                returns -1 (did find only smaller entries)
                            _mi_search(() with -1
                                end of search: continue search one level up
                    _mi_fetch_keypage(): 0x30f800 (from cache)
                    _mi_get_pack_key()
          check key found
          read record at 0xd23d060
As you can see _mi_search() is called recursively to find an entry in a leaf. And in this case there is no entry in a leaf which is greater than the last key (comparing the full 14 bytes) and smaller than the entries in nodes so it returns 2 levels up and takes the entry from the node at offset 0x30f800. It returns with a pointer to this entry to the function _mi_search_next() where the key is compared against the search-key (comparing the first 8 bytes) and as these are identical the offset (the last 6 bytes of this entry) is taken and the record is read from the MYD-file.

In the next call of index_next_same() it works like this:
ha_myisam::index_next_same() 
    mi_rnext_same()
        _mi_search_next() with pos = 0x8859800 (=root)
            uses the last buffer because: info->buff is 0 --> unchanged 
            _mi_search() with pos = 0x305400
                _mi_fetch_keypage()     (read from disk via my_pread())
                _mi_seq_search() 
                    returns +1
                _mi_search() with pos = -1
                     returns with +1 (search at upper level)
        read record at 0xd7e0b58
As the current buffer is a node it takes the page-offset contained in the last entry used. The page with this number (a node or a leaf) will contain entries with key-values greater than the current entry. So with some extra steps the first entry from this page is taken and the corresponding record is read.

Next calls to index_next_same() will take the following entries from this page until an entry is found which has a different key. In this case _mi_search_next() returns a value HA_ERR_END_OF_FILE. The server now switches over to TestSmall as already described above.


correctness

This text is still a simplified presentation of what's going on in the software. More conflicts can arise but the code handles these situations. And there are also some optimizations build into the code that I didn't describe (this text is already a bit longer than planned).

In case something is wrong with my description please use the mail-function of this side and send me a mail describing my error. I will look at it. Thanks.