an unsystematic journey through the source-code of MySQL and MariaDB: May 2014

this is my third post about this topic: storing data in a column-oriented form. You can find the other posts here: columns and columns (2).

Everything presented here is based on MyISAM. So in the first post the data was stored exactly as a MyISAM-file, only accessing the data was a little bit modified. In the second post an algorithm was introduced in which a file was created and in this file the values of one column were stored additionally to storing the record in the MYD-file. I could show that this algorithm speeds up a table-scan in some cases.

In this post I like to present an improved algorithm and want to show what you can expect of this algorithm.

warning

The goal of this post is to show how to speed-up a SELECT-statements (and if this is possible). The code presented here is not production-ready, it is (as written before) a proof of concept. It is still not generally usable, it cannot handle UPDATEs or DELETEs, lacks error-checking, handling of all possible cases, handling of concurrent access (no locking) and so on. And it is not thoroughly tested.

In short: don't use this code (except for experimenting). Nevertheless I hope to show what we can expect of the algorithm I will present.

algorithm

Adding to the datafile of a MyISAM-table I want to store the data of one (or more) column(s) in two (or more) additional files.

In MySQL/MariaDB the data is organized as a database (= a folder) and in this database as tables, each table is represented by 3 files (MYD contains the data, MYI contains the index(es), frm contains internal data) . This is valid for the MyISAM-engine, other engines can handle this differently. We will base our test-engine on MyISAM so we will follow this route.

On my PC here MariaDB organizes its data in this directory of the disk: ~/MariaDB/data/
and in this folder I have a database TestCol with tables TestBig and TestSmall. So one can find the data of the table TestBig in the file ~/MariaDB/data/TestCol/TestBig.MYD.

For storing the additional data I will create a subdirectory in this folder and give it the name of the table and append '.cols' to the name of this folder. In this folder I will create one file with the name of the relevant column appended by '.key' and another file with the name of the relevant column appended by '.data. In the key-file I will store the values of the corresponding column plus a reference into the data-file. And in the data-file I will store references to the MYD-file (= file-offsets of the corresponding record).

Let's assume a columnar-storage of the values of the column Hersteller of the table TestMyCols2Big in the database TestCol, then the location of the new files will be in the folder ~/MariaDB/data/TestCol/TestMyCols2Big.cols/. In this folder one can find the files Hersteller.key and Hersteller.data.

In such a key-file the data will be stored in entries with this structure:

len of the value following = 1 byte
the value in the internal representation = no. of bytes as described before
a block-number referencing the data-file = 4 bytes

The data-file is organized as a linear list of blocks of size 1K (=256 integers), but it starts with a single integer (=4 bytes), a reference to the next free block in this list. Each block starts with 2 entries (=header): the number of the next block in this list (0 if there is no next block) and a number of entries in this block (starting with a value of at least 3, because of these 2 numbers in the header plus the first entry). Following the 2 integers of the header are up to 253 integers (=file-offsets in the MYD-file).

There are some restrictions introduced here: this algo can only handle VARCHARs up to 255 byte in length in the key-file. And it can only handle datafiles up to 4GB in size, not bigger ones (MyISAM can). For our tests this should be sufficient.

Let me demonstrate this approach by an example with only 10 records stored in the table and additionally storing the data of the column Hersteller in columnar-form (I marked the relevant parts in bold):

MariaDB [TestCol]> select * from TestMyCols2Big;
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L   | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML              | URETERSCHIENEAC4P66        | 36367      |
| 01000002 | 454652 | 0.00   | 0.00  | GLANDULA SUP SUIS INJ O    100 ST    | STIBIUM ARSENICOSUM  D10   | 10070      |
| 01000003 | 999999 | 7.93   | 11.10 | MEASURIT URIMETER 2K 200ML      1 ST | SPRING STYROPORKISS 30X40  | 26058      |
| 01000004 | 999999 | 0.00   | 23.58 | BORT KNOECHELST PO WE R S      1 ST  | NOVACOM BABY BETTN JUN HB  | 87240      |
| 01000005 | 999999 | 0.00   | 3.04  | MEZEREUM C 6     80 ST               | AZELAINSAEURE              | 36245      |
| 01000006 | 999999 | 0.00   | 0.00  | FUCUS VESICUL D 2    200 ST          | LUMB GURT M EINSA 09PE M   | 36260      |
| 01000007 | 999999 | 12.70  | 6.26  | CARBO ANIMALIS Q18     10 ML         | ZITRONE AETHERISCHES OEL   | 02484      |
| 01000008 | 999999 | 6.91   | 0.00  | GILOFA FEIN AM GR5 MARBEL      1 ST  | ORTEL C3 CERV 2391 G2 B    | 87240      |
| 01000009 | 999999 | 0.00   | 0.00  | SUPRIMA KRANK HEMD44/46 71      1 ST | CYPRIPEDIUM PUBESC D12     | 18017      |
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
10 rows in set (0.02 sec)

MariaDB [TestCol]> select * from TestMyCols2Big where Hersteller = '87240';
+----------+--------+------+-------+-------------------------------------+---------------------------+------------+
| Id       | PZN    | EVP  | HAP   | ArtikelBez                          | ArtikelText               | Hersteller |
+----------+--------+------+-------+-------------------------------------+---------------------------+------------+
| 01000004 | 999999 | 0.00 | 23.58 | BORT KNOECHELST PO WE R S      1 ST | NOVACOM BABY BETTN JUN HB | 87240      |
| 01000008 | 999999 | 6.91 | 0.00  | GILOFA FEIN AM GR5 MARBEL      1 ST | ORTEL C3 CERV 2391 G2 B   | 87240      |
+----------+--------+------+-------+-------------------------------------+---------------------------+------------+
2 rows in set (0.00 sec)

MariaDB [TestCol]>

We will look at the values 25677 and 87240 in detail (in the case of 87240 we have 2 entries in this list, all others appear only once).

If we store the values of the column Hersteller in the form as described above the file Hersteller.key will have this content:

august@AMD4:~/MariaDB/data/TestCol/TestMyCols2Big.cols$ od -t x1 -Ax Hersteller.key
000000 05                         len of the entry following
       32 35 36 37 37             25677
       00 00 00 00                look into block #0 of data-file
       
       05                         len 
       33 36 33 36 37             36367
000010 01 00 00 00                reference to block #1

       05                         len 
       31 30 30 37 30             10070
       02 00 00 00                block #2 
       
       05                         len
       32 36 30 35 38             26058
       03 00 00 00                block #3
       
       05                         len
       38 37 32 34 30             87240
       04 00 00 00                block #4
       
and so on.

So every entry in this file is 10 bytes long, this depends on how this table was created. And even there are 2 records in our list with a value of 87240 in the column Hersteller this entry appears only once in the key-file.

Let's look into the file Hersteller.data:

august@AMD4:~/MariaDB/data/TestCol/TestMyCols2Big.cols$ od -t x1 -Ax Hersteller.data
000000 09 00 00 00                next free block is block #9
 
the first block (block #0):       corresponds to: Hersteller = '25677'
       00 00 00 00                no next block in this list
       03 00 00 00                2 entries in header + 1 entry in the list
       00 00 00 00                file-offset of corresponding record in MYD-file
000010 00 00 00 00                filler
       00 00 00 00 
       00 00 00 00
and so on

block #4:                         corresponds to: Hersteller = '87240'
01004  00 00 00 00                no next block in this list
       04 00 00 00                2 entries in header + 2 entries in the list
       9c 01 00 00                filepos of entry #1
001010 38 03 00 00                filepos of entry #2
       00 00 00 00                unused
and so on

august@AMD4:~/MariaDB/data/TestCol/TestMyCols2Big.cols$

In the case of a table-scan the usual way in MyISAM is to go through all the records and therefore the engine has to read the full MYD-file. In the case of this algorithm we only have to read the key-file. If a match is found all corresponding records can be fetched immediately with the values in the corresponding block(s) in the data-file.

Let's look how fast this can be. Please keep in mind that the existing engines are heavily optimized (thanks!) and our engine is not.

code

So I created a storage-engine MYCOLS2. For some internal reasons it was created as a static engine, not as a loadable plugin. You will find the code here. You may also look at this post for a description of the handling of adding an engine: DUMMY.

As this storage-engine is derived from the MyISAM-engine we inherit a lot of code from this engine but some things must be implemented in an extended way. As not all of the columns of a table will be handled as columnar-storage we need a way to detect such a column and handle this column in our very special way: in the CREATE-statement add a comment with the word columnar_storage in it to the definition of this column (I will show this in an example only some lines later). And naturally we have to modify the create()-function to detect and handle this.
We must also modify the code for writing a row to the datafile: write_row() must be implemented to write the data into our additional files and handle the logic behind these files. In this engine I did not implement update_row() and delete_row().

For executing a SELECT-statement we must implement these functions:

rnd_init() for initializing all the structures needed
rnd_next() for scanning through the additional files
rnd_end() for clean-up

Because we want to use the data stored in our additional files we have to implement condition-pushdown and access the WHERE-condition of a SELECT-statement. For doing this I have to implement the function cond_push(). You may look at this post for a description of this: WHERE.
Analog to the engine MYCOLS1 introduced in the earlier post the code in this cond_push()-function is restricted for handling only one condition, not the full condition-tree. And it is restricted to a condition that is part of an AND-condition or a standalone condition. In this routine the information gathered from the WHERE-part is stored in a class ColumnarField.

As the WHERE-part is analyzed in cond_push() the real comparison is done in rnd_next() using the information from the class ColumnarField (if this class exists). For a representation of the columnar-data the classes ColumnKey and ColumnData are used. The class ColumnKey handles the accessing the key-file, the class ColumnData handles accessing the data-file.

testing

For a test I created 2 tables in the database Testcol using my own storage-engine:

MariaDB [(none)]> use TestCol;
Database changed
MariaDB [TestCol]> create table TestMyCols2Big (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5)  comment 'columnar_storage' )
    ->                 engine=MYCOLS2
    -> ;
Query OK, 0 rows affected (0.05 sec)

MariaDB [TestCol]> create table TestMyCols2Small (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5)  comment 'columnar_storage' )
    ->                 engine=MYCOLS2
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]>

Please look at the definition of the column Hersteller, the comment tells my storage-engine to store the data of this column in the extra files. Without this comment the table would be a table of the standard MyISAM-type.

And here you see how the contents of the folder TestCol looks after creating the tables (a bit modified):

august@AMD4:~/MariaDB/data/TestCol$ ls -l 
insgesamt 3399856
...........
drwxrw---- 2 august august       4096 Mai 15 10:25 TestMyCols2Big.cols
-rw-rw---- 1 august august        696 Mai 15 10:25 TestMyCols2Big.frm
-rw-rw---- 1 august august          0 Mai 15 10:25 TestMyCols2Big.MYD
-rw-rw---- 1 august august       1024 Mai 15 10:25 TestMyCols2Big.MYI
drwxrw---- 2 august august       4096 Mai 15 10:25 TestMyCols2Small.cols
-rw-rw---- 1 august august        696 Mai 15 10:25 TestMyCols2Small.frm
-rw-rw---- 1 august august          0 Mai 15 10:25 TestMyCols2Small.MYD
-rw-rw---- 1 august august       1024 Mai 15 10:25 TestMyCols2Small.MYI
august@AMD4:~/MariaDB/data/TestCol$ 

august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols2Big.cols
insgesamt 4
-rw-rw---- 1 august august 4 Mai 15 10:25 Hersteller.data
-rw-rw---- 1 august august 0 Mai 15 10:25 Hersteller.key
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols2Small.cols
insgesamt 4
-rw-rw---- 1 august august 4 Mai 15 10:25 Hersteller.data
-rw-rw---- 1 august august 0 Mai 15 10:25 Hersteller.key
august@AMD4:~/MariaDB/data/TestCol$

as you can see 2 additional folders are created: TestMyCols2Big.cols and TestMyCols2Small.cols. In these folders 2 additional files are created: Hersteller.key and Hersteller.data. The data-file contains the number of the next free block in this file, the key-file is empty.

Let's put the usual data into our tables:

MariaDB [TestCol]> insert into TestMyCols2Small  select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (32.72 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestMyCols2Big  select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (1 hour 35 min 19.16 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

Because we have some additional work to do these operations are a bit slower than before (this is true especially for the Big-file). And the usual optimizations are not built in into this engine. Accept this, we are looking for other operations.

Let's look how the data is stored in our additional files:

august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols2Big.cols
insgesamt 40768
-rw-rw---- 1 august august 41711620 Mai 16 13:35 Hersteller.data
-rw-rw---- 1 august august    25810 Mai 16 12:02 Hersteller.key
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols2Small.cols
insgesamt 3620
-rw-rw---- 1 august august 3676164 Mai 16 11:48 Hersteller.data
-rw-rw---- 1 august august   28140 Mai 16 11:48 Hersteller.key
august@AMD4:~/MariaDB/data/TestCol$

As you can see the files now contain data. Let's look at thee numbers:

MariaDB [TestCol]> select count(distinct Hersteller) from TestMyCols2Small;
+----------------------------+
| count(distinct Hersteller) |
+----------------------------+
|                       2814 |
+----------------------------+
1 row in set (0.51 sec)

MariaDB [TestCol]> select count(distinct Hersteller) from TestMyCols2Big;  
+----------------------------+
| count(distinct Hersteller) |
+----------------------------+
|                       2581 |
+----------------------------+
1 row in set (16.63 sec)

MariaDB [TestCol]>

As our structure for the key-file is 10 bytes long (valid in this case, this may differ with other tables or columns) the size of the key-files fit nicely to the numbers of distinct entries in this column.

A reminder: TestSmall and TestBig are tables of the (pure) MyISAM-type, they are here for comparison-reason and were created in the first post. TestMyCols2Big and TestMyCols2Small are created using the new storage-engine. The data in these tables are identical: TestBig and TestMyCols2Big contain identical data, the same is valid for TestSmall and TestMyCols2Small.

So let's do the tests:

MariaDB [TestCol]> select count(*) from TestSmall where hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.43 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (13.25 sec)

MariaDB [TestCol]> set optimizer_switch='engine_condition_pushdown=on';
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestCol]> select count(*) from TestMyCols2Small where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.02 sec)

MariaDB [TestCol]> select count(*) from TestMyCols2Big where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (1.93 sec)

MariaDB [TestCol]>

Do not forget to set the internal variable (if you are using MariaDB as I did for my tests), otherwise the function cond_push() will not be called by the MariaDB-server.

As you can see it's faster. Our column-files are much smaller than the original MYD-file so scanning through the key-file should be faster. As this test is based on a comparison for equality it's easy to read through the key-file and immediately have access to the records via the entries in the data-file.

Let's look at the more complex query:

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (3 min 17.96 sec)

MariaDB [TestCol]> select count(*) from TestMyCols2Small A  join TestMyCols2Big B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (3 min 6.23 sec)

MariaDB [TestCol]>

Similar result as before in the first and second post: a bit faster but the improvement looks like a negligible value - again.

Let's do some tests at the boundaries (the value ABCDEF does not exist in our data):

MariaDB [TestCol]> select count(*) from TestBig where hersteller = 'ABCDEF';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (12.42 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller <> 'ABCDEF';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (16.01 sec)

MariaDB [TestCol]> set optimizer_switch='engine_condition_pushdown=on';
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestCol]> select count(*) from TestMyCols2Big where hersteller = 'ABCDEF';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.01 sec)

MariaDB [TestCol]> select count(*) from TestMyCols2Big where hersteller <> 'ABCDEF';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (25.54 sec)

MariaDB [TestCol]>

From 12 seconds down to 0.01 seconds: wow! That's damn fast.
From 16 seconds up to 26 seconds: pooh.

This can easily be explained: in the first case everything is handled in rnd_next(). Here only the key-file is scanned and scanning through a file of size 26K is pretty fast compared to the 1G in size of the MYD-file. As no match is found no record is read from the MYD-file and rnd_next() returns the constant HA_ERR_END_OF_FILE. In the second case everything is a match, so the corresponding record must be read from the MYD-file and given back to the server, and this read-operation is not optimized in our engine. And in the case of our algorithm it is a random read: the head of the disk must be moved to the correct position before reading a block containing the record. As the data in our data-file is not stored in a sequential way, all records with the same value in the column Hersteller will be read and after all these records are read the code will switch to the next entry in the key-file. This is not an efficient way for reading the whole MYD-file.

Something that I avoided in this tour until now was the use of an index. Let's create some indexes. And as the join is done via the column PZN let's create an index on this column in every table used in our tests:

MariaDB [TestCol]> create index pzn on TestSmall(pzn);
Query OK, 301036 rows affected (1.93 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index pzn on TestBig(pzn);
Query OK, 10000000 rows affected (1 min 45.02 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (19.35 sec)

MariaDB [TestCol]>

And now the same test for our engine:

MariaDB [TestCol]> create index pzn on TestMyCols2Small(pzn);
Query OK, 301036 rows affected (31.38 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index pzn on TestMyCols2Big(pzn);
Query OK, 10000000 rows affected (1 hour 35 min 44.39 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestMyCols2Small A  join TestMyCols2Big B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (7.37 sec)

MariaDB [TestCol]>

Looks good, a speed-up of a factor of approx. 2.5.

The usual way to improve the handling of a query is to create an index on the column where the filtering is done: Hersteller. OK let's do this:

MariaDB [TestCol]> create index Hersteller on TestSmall(Hersteller);
Query OK, 301036 rows affected (2.38 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index Hersteller on TestBig(Hersteller);
Query OK, 10000000 rows affected (2 min 16.06 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> show index from TestBig;
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | pzn        |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
| TestBig |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
2 rows in set (0.01 sec)

MariaDB [TestCol]> select count(*) from TestSmall where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.01 sec)

MariaDB [TestCol]> select count(*) from TestBig where Hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (0.60 sec)

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (0.06 sec)

MariaDB [TestCol]>

OK, I will stop here, because I still can't beat this value.

an observation

During the creation of an index on the table TestMyCols2Big a lot of files were created in the folder TestCol. MariaDB (the same is valid for MySQL) created a temporary table with the name #sql-3c01_2 and therefore created these files:

            #sql-3c01_2.frm
            #sql-3c01_2.MYD
            #sql-3c01_2.MYI

And our engine created this folder #sql-3c01_2.cols and in this folder these files:

            Hersteller.data
            Hersteller.key

After this temporary table was filled with all the data the original files were destroyed and the temporary files renamed to the correct names. As the engine MYCOLS2 does not handle this the folder #sql-3c01_2.cols and in it all files were not destroyed. For a proof of concept how fast a SELECT-statement can be using this algorithm we didn't need this functionality so the files and folders weren't removed.

Saturday 24 May 2014

columns (3)

warning

algorithm

code

testing

an observation