Let's come back to the first question and look into the results of my observations.
MyISAM
A lot of my statements here are specific for the MyISAM-engine because the tables used for the tests are of type MyISAM. Other storage-engines can handle things differently so don't apply the results presented here schematically to other engines.And as a second argument a lot of the data presented here depends on the data in the tables, and this is my data. If you do your own tests don't expect to see the same numbers (like values for keys, file-offsets etc.).
preparations
Let's create some tables, put some data into them and start with the test.MariaDB [(none)]> create database TestOpt;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> use TestOpt;
Database changed
MariaDB [TestOpt]> create table TestBig (
-> Id char(8),
-> PZN char(7),
-> EVP decimal(7,2),
-> HAP decimal(7,2),
-> ArtikelBez varchar(40),
-> ArtikelText varchar(26),
-> Hersteller char(5) )
-> engine=MYISAM
-> ;
Query OK, 0 rows affected (0.04 sec)
MariaDB [TestOpt]> create table TestSmall (
-> Id char(8),
-> PZN char(7),
-> EVP decimal(7,2),
-> HAP decimal(7,2),
-> ArtikelBez varchar(40),
-> ArtikelText varchar(26),
-> Hersteller char(5) )
-> engine=MYISAM
-> ;
Query OK, 0 rows affected (0.04 sec)
MariaDB [TestOpt]> insert into TestSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (2.21 sec)
Records: 301036 Duplicates: 0 Warnings: 0
MariaDB [TestOpt]> insert into TestBig select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (1 min 12.97 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [TestOpt]> create index PZN on TestSmall(PZN);
Query OK, 301036 rows affected (2.08 sec)
Records: 301036 Duplicates: 0 Warnings: 0
MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 28.40 sec)
Records: 10000000 Duplicates: 0 Warnings: 0
MariaDB [TestOpt]>
And this is the statement I would like to inspect:
MariaDB [TestOpt]> select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
| 14 |
+--------------+
1 row in set (0.41 sec)
MariaDB [TestOpt]>
This statement is silly, it has no special meaning except to be used for this demonstration here.Let the database-server tell us how it handles this statement:
MariaDB [TestOpt]> explain select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: A
type: ALL
possible_keys: PZN
key: NULL
key_len: NULL
ref: NULL
rows: 301036
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: B
type: ref
possible_keys: PZN
key: PZN
key_len: 6
ref: TestOpt.A.PZN
rows: 448
Extra: Using where
2 rows in set (0.01 sec)
MariaDB [TestOpt]>
Looks like it takes the table TestSmall (aka A) to scans all records in it, applies WHERE to the records found and for those records meeting the WHERE-condition it searches the corresponding record(s) from TestBig (aka B) via the index. It then puts everything together and presents the result.code
To look further into the handling of this statement I used the modifications presented here and counted the number of calls of some routines. Here is the result:select SQL_NO_CACHE count(A.PZN) from TestSmall A straight_join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36035';
table = <TestSmall> counter table-scan: 301037 index-scan: 0
table = <TestBig> counter table-scan: 0 index-scan: 311
The assumption above seems to be correct: the table TestSmall has approx. 300K records in it as you can see from the INSERT-statement above, these are read via a table-scan (counted by the variable counterTableScan).For the table TestBig no table-scan is made (var. counterTableScan is 0) but instead index-related-functions are called, only 311times these functions are called. As this table contains 10mio records this is a wise decision to check only this small amount of records.
Let's inspect the data in these two tables:
MariaDB [TestOpt]> select count(*) from TestSmall where Hersteller = '00020';
+----------+
| count(*) |
+----------+
| 60 |
+----------+
1 row in set (0.38 sec)
MariaDB [TestOpt]> select count(*) from TestBig where Hersteller = '36367';
+----------+
| count(*) |
+----------+
| 461016 |
+----------+
1 row in set (12.80 sec)
MariaDB [TestOpt]>
In the table TestSmall we have 60 records, in the table TestBig we have approx. 460K records fulfilling the WHERE-clause.scanning TestSmall
There are three functions in the storage-engine for handling a table-scan:- rnd_init() is called once for starting a table-scan
- rnd_next() returns one record. This function is called again and again until all records are read. In this case it returns MHA_ERR_END_OF_FILE.
- rnd_end() is called once after all records are scanned
MariaDB [TestOpt]> select SQL_NO_CACHE * from TestSmall where Hersteller = '00020' limit 1;
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| Id | PZN | EVP | HAP | ArtikelBez | ArtikelText | Hersteller |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 178324 | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND | 00020 |
+---------+--------+------+------+--------------------------------+----------------------------+------------+
1 row in set (0.06 sec)
MariaDB [TestOpt]>
This is the first record with Hersteller = '00020' in TestSmall. In this record the value for PZN is '173824'. Let's look into TestBig for records with this value in the column PZN:
MariaDB [TestOpt]> select SQL_NO_CACHE * from TestBig where PZN = '178324' limit 3;
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| Id | PZN | EVP | HAP | ArtikelBez | ArtikelText | Hersteller |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
| 01046189 | 178324 | 10.45 | 7.99 | MARLENE MS K1 AD K BLACK M 2 ST | BASILIKUM AETHERISCHES OEL | 20860 |
| 01089558 | 178324 | 141.72 | 0.00 | HEVERTOPLEX NR148N POPULUS 30 ML | NOVO HELISEN DEP BIR/ER/HA | 01250 |
| 01240196 | 178324 | 17.90 | 5.22 | BIOCHEMIE 10 NATR SULF D12 80 ST | SUPRIMA BETTUN M EG 50X70 | 36245 |
+----------+--------+--------+------+--------------------------------------+----------------------------+------------+
3 rows in set (0.08 sec)
MariaDB [TestOpt]> select SQL_NO_CACHE count(*) from TestBig where PZN = '178324';
+----------+
| count(*) |
+----------+
| 60 |
+----------+
1 row in set (0.03 sec)
MariaDB [TestOpt]>
As you can see there are 60 records in it and I presented only the first 3 records.TestBig via index
There are a lot of functions for accessing an index, their names start with index_xxxxx(). But only two of them are of interest for our test-case:- index_read_map() starts the search. It looks for the first occurrence of a key (in var. key) in the index. If this key is found it returns the corresponding record via the var. buf, otherwise HA_ERR_KEY_NOT_FOUND is returned.
- index_next_same() searches for the next entry in the index and returns the corresponding record in the var. buf. If there is no entry identical to the key given it returns HA_ERR_END_OF_FILE.
BTREE
MyISAM stores all index-related data in the file with the extension MYI. In our test-case only one index is created but the following statements are also valid in the case of multiple indexes.The index in MyISAM is of type BTREE, you will find a description of this here. And you will find the description of the structure of the MYI-file here: 20.2 The .MYI File.
tree
Let's dive a bit deeper into the handling of an index and into our test-case.The tree is organized in pages of size 1K. There are 2 types of pages: nodes and leafs. A node contains one ore more entries with this structure:
- a pointer to another node or leaf containing entries with keys less than the following key (5 bytes)
- the key = the value of the column of one record (1+7 bytes)
- the offset in MYD of the corresponding record (6 bytes)
An entry conists of 19 bytes so up to 53 entries can be found in a node ( because 2 + 53*19 + 5 < 1024).
For a leaf the structure differs a bit:
- the key = the value of the column(s) which build the index
- the offset in MYD of the corresponding record
In MyISAM you can easily detect if a page is a node: in the first byte of the page the MSB ist set (so the first byte is >= 0x80).
You can find a nice picture demonstrating this in the Wikipedia-article.
I will give you examples for this structure soon. As written before these examples are taken from my tests and depend on my data.
unique
In the tree the entries are unique and they are sorted, usually ascending.In the index the entries are unique meaning that for every record in the table there is one entry in the index. As the entry contains the file-offset from the data-file the uniqueness is guaranteed when you look at the full 14 bytes of an entry. This has nothing to do with the statement CREATE UNIQUE INDEX which guarantees uniqueness in the first 8 bytes (in our test-case).
function-calls
Coming back to the query: a record is found in TestSmall, the PZN is extracted and has the value '178324'. Now the server switches over to the table TestBig and searches for records with this value in the column PZN. As TestBig has an index on the column PZN it will use it. Here is the function-hierarchy:ha_myisam::index_read_map(()
mi_rkey()
_mi_search() with pos = 0x08859800 (aka: root of tree)
_mi_search() with pos = 0x5a5400
_mi_search() with pos = 0x30f800
_mi_search() with pos = 0x305000
(*info->read_record) calls _mi_read_dynamic_record()
index_read_map() is the function the database-server calls for reading the first record to the given PZN-value. From the moment of entering this function up to the moment this function is left everything is the responsibility of the storage-engine. The PZN-value to search for is given in the parameter key and in our case it looks like this: 0x00 (a flag) + '178324 '. As the column PZN is created with a width of 7 bytes the key also contains a trailing space. So the length for the comparisons will be 8. index_read_map() calls mi_rkey() for further processing.mi_rkey() now searches through the tree and it calls the function _mi_search() to do the walk-through recursively. It starts the search at 0x08859800 which is a file-offset in the MYI-file. As written before this is valid for my test-case, your data may differ. One can find the start of the tree (=the root) in the MYI-file (output is shortened):
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --read-bytes=1024 TestBig.MYI
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00 --> 0x08859800
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --read-bytes=1024 TestBig.MYI
000000 fe fe 07 01 00 03 01 7e 00 b0 00 64 00 c4 00 01
000010 00 00 01 00 08 01 00 00 00 00 30 ff 00 00 00 00
000020 00 98 96 80 00 00 00 00 00 00 00 00 00 00 00 00
000030 00 98 96 80 ff ff ff ff ff ff ff ff 00 00 00 00
000040 08 85 9c 00 00 00 00 00 33 a9 ba 30 00 00 00 00
000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 53 36
000070 00 00 00 28 00 00 00 00 00 00 00 01 00 00 00 00 the offset of the root of the tree
000080 08 85 98 00 ff ff ff ff ff ff ff ff 00 00 00 00 (8 bytes)
000090 00 00 00 00 54 32 ad c8 00 00 00 00 00 00 00 01
........
As I've created only one index there is only one entry for the file-offset, there can be multiple offsets. You can also find this value at runtime in the var. info->s->state.key_root[0].Here is what I found in the MYI-file in the root of the tree (a bit shortened):
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x08859800 --read-bytes=1024 TestBig.MYI
8859800 83 97 this is a node: 0x80 is set
this page is filled up to 0x0397 bytes
00 00 00 0b 66 page-number of another node/leaf
01 31 37 32 33 31 30 20 PZN = 0x01 + '172310 '
00 00 0e c1 4a 38 offset in MYD for this record
00 00 00 16 95 page 0x1695 --> offset = 0x1695 * 0x400 = 0x5A5400
01 32 34 38 32 39 35 20 PZN = 0x01 + '248295 '
00 00 12 20 9b 6c offset in MYD-for this record
........................
We are looking for a value of 0x01 (changed already from 0x00) + '178324 ' and found that it is between the first and the second entry in this page. So from the second entry we take the page-number 0x1695, multiply it with the length of a page (=0x400) and get a new offset: 0x5A5400. Let's look what we will find in MYI at this place:august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x5A5400 --read-bytes=1024 TestBig.MYI
5a5400 83 e3 this is a node
this page contains up to 0x03e3 bytes
00 00 00 0b 65
01 31 37 33 35 34 35 20 PZN = 0x01 + '173545 '
00 00 0e 7d 12 9c
00 00 00 0b 9c
01 31 37 34 37 38 36 20 PZN = 0x01 + '174786 '
00 00 2b ce 20 40
00 00 00 0b d2
01 31 37 36 32 35 39 20 PZN = 0x01 + '176259 '
00 00 2a 32 2a f8
00 00 00 0c 08
5a5440 01 31 37 37 39 38 30 20 PZN = 0x01 + '177980 '
00 00 2b 12 d9 d0
00 00 00 0c 3e page 0x0c3e --> file offset = 0x0c3e * 0x400 = 0x30f800
01 31 37 39 32 30 30 20 PZN = 0x01 + '179200 '
00 00 11 bc 9b 24
.........................
Again the software walks through this page comparing the keys. The first key greater than our search key is the fifth entry so we continue our search at 0x30f800. This is what we find there:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x30F800 --read-bytes=1024 TestBig.MYI
30f800 83 e3 another node
00 00 00 0c 07
01 31 37 38 30 30 35 20 PZN = 0x01 + '178005 '
00 00 31 10 b1 74
.................
30f8c0 00 00 00 0c 12
01 31 37 38 32 36 34 20 PZN = 0x01 + '178264 '
00 00 01 a1 a8 b4
00 00 00 0c 13
01 31 37 38 32 38 37 20 PZN = 0x01 + '178287 '
30f8e0 00 00 09 7d 8d f0
00 00 00 0c 14 page = 0x0c14 --> offset 0x305000
01 31 37 38 33 32 34 20 PZN = 0x01 + '178324 '
00 00 0d 23 d0 60 file-offset = 0x0d23d060
00 00 00 0c 15 page = 0x0c15 --> fiile-offset = 0x305400
01 31 37 38 33 35 33 20 PZN = 0x01 + '178353 '
00 00 12 c9 f6 9c
..........................
Again the software compares the keys and stops when it finds an entry greater or equal to our search-key. As we are interested in the first record with our key we continue our search at file-offset 0x305000:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305000 --read-bytes=1024 TestBig.MYI
305000 03 f2 this is a leaf!
01 31 37 38 32 38 37 20 PZN = 0x01 + '178287 '
00 00 09 fd 3b 88
305010 01 31 37 38 32 38 37 20 PZN = 0x01 + '178287 '
00 00 0a 9f ea 2c
.........
01 31 37 38 32 38 37 20 PZN = 0x01 + '178287 '
00 00 33 58 5e e4
01 31 37 38 33 32 34 20 PZN = 0x01 + '178324 '
3052f0 00 00 00 3d 17 c4 file-offset = 0x003d17c4
01 31 37 38 33 32 34 20 PZN = 0x01 + '178324 '
00 00 00 76 74 20 file-offset = 0x00767420
.........
Starting the comparison from the first entry in this page the code in _mi_search() finds an entry (marked in bold above). As this is a leaf we can stop here and return (and return and return and return, as this was a recursion).The record can be read from the MYD-file at offset 0x003d17c4:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x0000003d17c4 --read-bytes=100 TestBig.MYD
3d17c4 03
00 5e len record = 94d
02 no. of filler-bytes
00 80 flags
30 31 30 34 36 31 38 39 Id = '01046189'
31 37 38 33 32 34 20 PZN = '178324 '
80 00 0a 2d EVP = 10 + 45 (0x0a, 0x2d)
80 00 07 63 HAP = 7 + 99 (0x07, 0x63)
24 4d 41 52 4c 45 4e 45 ArtikelBez (len 0x24)
20 4d 53 20 4b 31 20 41
44 20 4b 20 42 4c 41 43
4b 20 4d 20 20 20 20 20
20 32 20 53 54
1a 42 41 53 49 4c 49 4b ArtikelText (len 0x1a)
55 4d 20 41 45 54 48 45
52 49 53 43 48 45 53 20
4f 45 4c
32 30 38 36 30 Hersteller = 20860'
00 00 filler-bytes
You can find an explanation of this structure in my post with the title od.Finito, in TestBig we got the first record which is given back to the database-server in the var. buf. Some internal values are stored, among others the entry found in MYI is kept for further processing (the key + offset, now 14 bytes).
Record #1 is found but it does not match to the condition of the WHERE-clause. Nevertheless step 1 is done.
So let's go on. The database-server continues to search for more records in TestBig for the record from TestSmall. Here is the function-hierarchy for the next step(s):
ha_myisam::index_next_same()
mi_rnext_same()
_mi_search_next()
check page: info->int_keypos < info->int_maxpos
--> fetch next entry from this page
compare the first 8 bytes of this entry
identical: (*info->read_record) calls _mi_read_dynamic_record()
Now the entry-point into the storage-engine is the function index_next_same(). As the name suggests it searches for the next entry in the tree with the same key (8 bytes). So the last page was checked and as it contains more entries so the next entry is taken from this page.This step will be repeated until a key is found greater than the search-key. Here this happens:
august@AMD4:~/MariaDB/data/TestOpt$ od -t x1 -Ax --skip-bytes=0x305400 --read-bytes=1024 TestBig.MYI
305400 03 f2 this is a leaf
.....................
01 31 37 38 33 32 34 20 PZN = 0x01 + '178324 ' record #59
00 00 32 52 38 f0
01 31 37 38 33 32 34 20 PZN = 0x01 + '178324 ' record #60
00 00 32 6c 11 30 file-offset = 0x326c1130
01 31 37 38 33 35 33 20 PZN = 0x01 + '178353 ' greater then '178324 '
00 00 00 6c b5 0c file-offset = 0x6cb50c
.....................
After record #60 the entry with the key '178353 ' is returned but the function mi_rnext_same() detects that this value differs from the previous one and returns HA_ERR_END_OF_FILE to index_next_same() which returns this value to the database-server.The server now knows that there are no more records with the key '178324 ' in the table TestBig. It switches over to the table TestSmall and continues with the table-scan. For the next record from TestSmall with a value of Hersteller = '00020' it switches over to TestBig and calls ha_myisam::index_read_map(() giving it as a parameter the value of PZN from the then current record from TestSmall.
more explanations
I'm sure you've detected the fact that in my last block of entries I silently switched from the page at offset 0x305000 to the page at offset 0x305400.
In this example above we scanned the index for entries '178324 ' and the SQL-statement somewhere in the beginning of this post showed that there are 60 records in TestBig with this PZN-value so there must be 60 entries with this value in the tree too. These entries are spread over 2 leafs and one node: 19 entries can bei found at the page (leaf) at offset 0x305000, 40 entries can be found in the page (leaf) at offset 0x305400 and one entry is found in the page (node) at offset 0x30F800.
As shown it's easy to continue the search within the current page but what does the storage-engine do when the end of a page is reached?
It does this:
ha_myisam::index_next_same()()
mi_rnext_same()
_mi_search_next() with pos = 0x8859800 (=root)
info->int_keypos >= info->int_maxpos
--> reached the end of this page --> search from root of the tree
_mi_search() with pos = 0x8859800 (= root)
_mi_fetch_keypage(): root-page (from cache)
_mi_seq_search() in this page
stops at 2nd entry, returns 1
_mi_search() with pos = 0x5a5400
_mi_fetch_keypage() (from cache)
_mi_seq_search()
stops at 5th entry, returns 1
_mi_search() with pos = 0x30f800
_mi_fetch_keypage() (from cache)
_mi_seq_search()
stops at 12th entry, returns 1
_mi_search() with pos = 0x305000
_mi_fetch_keypage() (from cache)
_mi_seq_search()
returns -1 (did find only smaller entries)
_mi_search(() with -1
end of search: continue search one level up
_mi_fetch_keypage(): 0x30f800 (from cache)
_mi_get_pack_key()
check key found
read record at 0xd23d060
As you can see _mi_search() is called recursively to find an entry in a leaf. And in this case there is no entry in a leaf which is greater than the last key (comparing the full 14 bytes) and smaller than the entries in nodes so it returns 2 levels up and takes the entry from the node at offset 0x30f800. It returns with a pointer to this entry to the function _mi_search_next() where the key is compared against the search-key (comparing the first 8 bytes) and as these are identical the offset (the last 6 bytes of this entry) is taken and the record is read from the MYD-file.In the next call of index_next_same() it works like this:
ha_myisam::index_next_same()
mi_rnext_same()
_mi_search_next() with pos = 0x8859800 (=root)
uses the last buffer because: info->buff is 0 --> unchanged
_mi_search() with pos = 0x305400
_mi_fetch_keypage() (read from disk via my_pread())
_mi_seq_search()
returns +1
_mi_search() with pos = -1
returns with +1 (search at upper level)
read record at 0xd7e0b58
As the current buffer is a node it takes the page-offset contained in the last entry used. The page with this number (a node or a leaf) will contain entries with key-values greater than the current entry. So with some extra steps the first entry from this page is taken and the corresponding record is read.Next calls to index_next_same() will take the following entries from this page until an entry is found which has a different key. In this case _mi_search_next() returns a value HA_ERR_END_OF_FILE. The server now switches over to TestSmall as already described above.
correctness
This text is still a simplified presentation of what's going on in the software. More conflicts can arise but the code handles these situations. And there are also some optimizations build into the code that I didn't describe (this text is already a bit longer than planned).In case something is wrong with my description please use the mail-function of this side and send me a mail describing my error. I will look at it. Thanks.