an unsystematic journey through the source-code of MySQL and MariaDB: April 2014

Thursday, 24 April 2014

columns (2)

this is my second post about this topic: storing data in a column-oriented form. You can find the first post here.

In the first post the way how the data is stored was not changed; it was still a MyISAM-file in a pure form. With the algorithm presented in this post this will change, but only a little bit: I will add one more file in which the data is stored in a column-oriented form.

warning

The goal is to show how to speed-up a SELECT-statements (and if this is possible). The code presented here is not production-ready, it is (as written before) a proof of concept. It does not contain so may hardcoded details of the query as the code of the first shot did but it still is not generally usable. It lacks error-checking, handling of all possible cases, handling of concurrent access (no locking) and so son. And it is not thoroughly tested.

Nevertheless I hope to show what we can expect if we follow this route.

algorithm

Adding to the datafile of a MyISAM-table I want to store the data of one (or more) column(s) in an additional file.

In MySQL/MariaDB the data is organized as a database (= a folder) and in this database as tables, each table is represented as one file. This is valid for the MyISAM-engine, other engines can handle this differently. We will base our test-engine on MyISAM so we will follow this route.
On my PC here the data is organized on the disk in this directory: ~/MariaDB/data/
and in this folder I have a database TestCol with tables TestBig and TestSmall. So one can find the table TestBig in ~/MariaDB/data/TestCol/TestBig.MYD.

For storing the additional data I will create a subdirectory in this folder and give it the name of the table and append '.cols'. In this folder I will create a file with the name of the column appended by '.column' for storing the values of the column in a different form. Let's assume a columnar-storage of the values of the column Hersteller of the table TestMyCols1Big, then the location of the new file will be: ~/MariaDB/data/TestCol/TestMyCols1Big.cols/Herteller.column. This gives us unique path-names for the files.

In such a column-file the data will be stored in entries with this structure:

len of the value = 1 byte
the value in the internal representation = no. of bytes as described before
the file-offset of the record in the MYD-file = 4 bytes

There are some restrictions introduced here: this algo can only handle VARCHARs up to 255 byte in length. And it can only handle datafiles up to 4GB in size, not bigger ones (MyISAM can). For our tests this should be sufficient.

Let me demonstrate this approach by an example (I marked the relevant parts in bold):

MariaDB [TestCol]> select * from TestMyCols1Big;
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L   | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML              | URETERSCHIENEAC4P66        | 36367      |
| 01000002 | 454652 | 0.00   | 0.00  | GLANDULA SUP SUIS INJ O    100 ST    | STIBIUM ARSENICOSUM  D10   | 10070      |
| 01000003 | 999999 | 7.93   | 11.10 | MEASURIT URIMETER 2K 200ML      1 ST | SPRING STYROPORKISS 30X40  | 26058      |
| 01000004 | 999999 | 0.00   | 23.58 | BORT KNOECHELST PO WE R S      1 ST  | NOVACOM BABY BETTN JUN HB  | 87240      |
| 01000005 | 999999 | 0.00   | 3.04  | MEZEREUM C 6     80 ST               | AZELAINSAEURE              | 36245      |
| 01000006 | 999999 | 0.00   | 0.00  | FUCUS VESICUL D 2    200 ST          | LUMB GURT M EINSA 09PE M   | 36260      |
| 01000007 | 999999 | 12.70  | 6.26  | CARBO ANIMALIS Q18     10 ML         | ZITRONE AETHERISCHES OEL   | 02484      |
| 01000008 | 999999 | 6.91   | 0.00  | GILOFA FEIN AM GR5 MARBEL      1 ST  | ORTEL C3 CERV 2391 G2 B    | 87240      |
| 01000009 | 999999 | 0.00   | 0.00  | SUPRIMA KRANK HEMD44/46 71      1 ST | CYPRIPEDIUM PUBESC D12     | 18017      |
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
10 rows in set (0.00 sec)

If we store the values of the column Hersteller in the form as described the additional file will have this content:

august@AMD4:~/MariaDB/data/TestCol/TestMyCols1Big.cols$ od -t x1 -Ax Hersteller.column 
000000 05                         len of the entry following
       32 35 36 37 37             25677
       00 00 00 00                position of corresponding record in MYD-file

       05                         len
       33 36 33 36 37             36367
       64 00 00 00                position

       05                         len
       31 30 30 37 30             10070
       b4 00 00 00                position

       05 
       32 36 30 35 38             26058
       14 01 00 00 

       05 
       38 37 32 34 30             87240
       78 01 00 00 
       and so on

So every entry in this file is 10 bytes long, this depends on how this table is created.

Scanning through the column Hersteller can be done by reading the full file TestMyCols1Big.MYD (the usual way) or by scanning through the file Hersteller.column which is much smaller. If a match is found we take the value of the position and go to the MYD-file to read the full record and give it back to the database-server.

code

So I created a storage-engine MYCOLS1. For some internal reasons it was created as a static engine, not as a loadable plugin. You will find the code here. You may also look at this post for a description of the handling of adding an engine: DUMMY.

As this storage-engine is derived from the MyISAM-engine we inherit a lot of code from this engine but some things must be implemented in an extended way. First of all we need a way to detect a column that should be handled in our way: in the CREATE-statement add a comment with the word columnar_storage in it to the definition of this column (I will show this in an example only some lines later). And naturally we have to modify the create()-function to detect and handle this.
We must also modify the code for writing a row to the datafile: write_row() must be implemented. The same is valid for update_row() and delete_row() (for this a constant is defined to mark an entry in the column-file).

For executing a SELECT-statement we must implement our own functions

rnd_init() for initializing all the structures needed
rnd_next() for scanning through the column-file
rnd_end() for clean-up

For this engine I didn't want to put the condition of the query into the code (as I did for the first test), I wanted a more flexible solution. So I need some code for analyzing the WHERE-part of the SQl-statement. You may look at this post for a description of this: WHERE.
I implemented cond_push() for analyzing the WHERE-part. This code is restricted for handling only one condition, not the full condition. And it is restricted to a single condition or a condition that is part of an AND-condition. In this routine the information gathered from the WHERE-part is stored in a class ColumnarField.

As the WHERE-part is analyzed in cond_push() the real comparison is done in rnd_next() using the information from the class ColumnarField (if this class exists). For a representation of the columnar-data the class ColumnBuffer is used. This class ColumnBuffer handles the data in a column-file.

testing

For a test I created 2 tables in the database Testcol using my own storage-engine:

MariaDB [TestCol]> create table TestMyCols1Big (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5)  comment 'columnar_storage' )
    ->                 engine=MYCOLS1
    -> ;
Query OK, 0 rows affected (0.10 sec)

MariaDB [TestCol]> create table TestMyCols1Small (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5)  comment 'columnar_storage' )
    ->                 engine=MYCOLS1
    -> ;
Query OK, 0 rows affected (0.36 sec)

MariaDB [TestCol]>

Please look at the definition of the column Hersteller, the comment tells my storage-engine to store the data of this column in the extra file.

And here you see how the contents of the folder TestCol looks after creating the tables (a bit modified):

august@AMD4:~/MariaDB/data/TestCol$ ls -l
insgesamt 1799756
-rw-rw---- 1 august august        65 Apr 27 10:04 db.opt
.........................
drwxrw---- 2 august august      4096 Mai  7 15:27 TestMyCols1Big.cols
-rw-rw---- 1 august august       698 Mai  7 15:34 TestMyCols1Big.frm
-rw-rw---- 1 august august         0 Mai  7 15:34 TestMyCols1Big.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:34 TestMyCols1Big.MYI
drwxrw---- 2 august august      4096 Mai  7 15:34 TestMyCols1Small.cols
-rw-rw---- 1 august august       698 Mai  7 15:34 TestMyCols1Small.frm
-rw-rw---- 1 august august         0 Mai  7 15:34 TestMyCols1Small.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:34 TestMyCols1Small.MYI
august@AMD4:~/MariaDB/data/TestCol$ 

august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Big.cols
insgesamt 0
-rw-rw---- 1 august august 0 Mai  7 15:34 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Small.cols
insgesamt 0
-rw-rw---- 1 august august 0 Mai  7 15:34 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$

as you can see 2 additional folders are created: TestMyCols1Big.cols and TestMyCols1Small.cols
In these folders an additional file is created: Hersteller.column. In our tests some data will be written into these files.

Let's put the usual data into our tables:

MariaDB [TestCol]> insert into TestMyCols1Big  select * from TestOK.ABDAOK; 
Query OK, 10000000 rows affected (6 min 14.84 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestMyCols1Small  select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (11.57 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

Because we have some additional work to do these operations are a bit slower than before. And the usual optimizations are not built in into this engine. Accept this, we are looking for other operations.

Again we look at the folder TestCol (=our database):

august@AMD4:~/MariaDB/data/TestCol$ ls -l
insgesamt 2699604
...................
drwxrw---- 2 august august      4096 Mai  7 15:27 TestMyCols1Big.cols
-rw-rw---- 1 august august       698 Mai  7 15:40 TestMyCols1Big.frm
-rw-rw---- 1 august august 895389376 Mai  7 15:48 TestMyCols1Big.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:48 TestMyCols1Big.MYI
drwxrw---- 2 august august      4096 Mai  7 15:34 TestMyCols1Small.cols
-rw-rw---- 1 august august       698 Mai  7 15:34 TestMyCols1Small.frm
-rw-rw---- 1 august august  26049412 Mai  7 15:49 TestMyCols1Small.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:49 TestMyCols1Small.MYI
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Big.cols
insgesamt 97664
-rw-rw---- 1 august august 100000000 Mai  7 15:48 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Small.cols
insgesamt 2940
-rw-rw---- 1 august august 3010360 Mai  7 15:49 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$

As you can see the column-files now contain data. As the Big-file contains 10. mio records and our structure is 10 bytes long (1 for the len, 5 for the column, 4 for the offset) the whole file is 100MB, much smaller than the 1 GB of the MYD-file.

A reminder: TestSmall and TestBig are tables of the (pure) MyISAM-type, they are here for comparison and were created in the first post. TestMyCols1Big and TestMyCols1Small are created using the new storage-engine.
So let's do the tests:

MariaDB [TestCol]> select count(*) from TestSmall where hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.43 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (13.25 sec)

MariaDB [TestCol]> set optimizer_switch='engine_condition_pushdown=on';
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Small where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Big where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (1.53 sec)

MariaDB [TestCol]>

Do not forget to set the internal variable (if you are using MariaDB as I did for my tests), otherwise the function cond_push() will not be called by the server.

As you can see it's faster. This is based on the fact that reading the column-file is about 10times faster than reading the MYD-file because of the size of the files.

Let's look at the more complex query:

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (3 min 15.80 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Small A  join TestMyCols1Big B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (3 min 6.12 sec)

MariaDB [TestCol]>

Similar result as before in the first post: a bit faster but the improvement looks like a negligible value - again.

Let's do some tests at the boundaries (the value ABCDEF does not exist in our data):

MariaDB [TestCol]> select count(*) from TestBig where hersteller = 'ABCDEF';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (12.23 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller <> 'ABCDEF'
    -> ;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (16.39 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Big where hersteller = 'ABCDEF';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.68 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Big where hersteller <> 'ABCDEF';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (25.26 sec)

From 12 seconds down to 0.7 seconds: wow!
From 16 seconds up to 25 seconds: pooh.

This can easily be explained: in the first case everything is handled in rnd_next(). As no match is found the corresponding record is not read and the search in the column-file continues until the end of this file (rnd_next() is called only once). In the second case everything is a match, so the corresponding record must be read from the MYD-file and given back to the server, and this read-operation is not optimized in our engine.

Something that I avoided in this tour until now was the use of an index. Let's create some indexes. And as the join is done via the column PZN let's create an index on this column in every table used in our tests:

MariaDB [TestCol]> create index pzn on TestSmall (pzn);
Query OK, 301036 rows affected (1.57 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index pzn on TestBig (pzn);
Query OK, 10000000 rows affected (1 min 19.46 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (19.67 sec)

MariaDB [TestCol]>

And now the same test for our engine:

MariaDB [TestCol]> create index pzn on TestMyCols1Small (pzn);
Query OK, 301036 rows affected (10.96 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index pzn on TestMyCols1Big (pzn);
Query OK, 10000000 rows affected (6 min 33.83 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestMyCols1Small A  join TestMyCols1Big B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (8.89 sec)

Looks good, a speed-up of a factor of approx. 2.

The usual way to improve the handling of a query is to create an index on the column where the filtering is done: Hersteller. OK let's do this:

MariaDB [TestCol]> create index hersteller on TestSmall (hersteller);
Query OK, 301036 rows affected (2.40 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index hersteller on TestBig (hersteller);
Query OK, 10000000 rows affected (2 min 15.97 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestSmall where hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.01 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (0.58 sec)

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (0.06 sec)

MariaDB [TestCol]>

OK, I will stop here, because I can't beat this value.

columns

In a database the data is usually stored in row-oriented form. But why? Can't we store it column-oriented?

Wikipedia has an article about that topic, you can find it here.

row oriented

Let me give you an example: think of the data organized in the form of a spreadsheet:

In the case of a MyISAM-file (a file with the extension MYD) of type static this data would be stored like this:

 1074892 | 809032 | 11.31   | 5.04    | KOCHSALZ 0.9% ISOTON GL 250 ML       | KOCHSALZ 0.9% ISOTON GL    | 08650      | 1286800 | 252888 | 516.92  | 310.00  | FREKA BUTTON EINZELS 1.7CM 1X1 ST    | FREKA BUTTON EINZELS 1.7CM | 08650      | 1007388 | 44173  | 12.04   | 7.10    | EASY BAG SYSTEM SCHWERKRAF 1X1 ST    | EASY BAG SYSTEM SCHWERKRAF | 08650      |
and so on

Take this picture as a symbol, data is stored in an internal format which usually is not human-readable. And different engines have different forms of storing the data but in general the usual form looks like the data above (I've added the character "|" as a delimiter, it's not stored in the datafile).

On a hard-disk data is stored in the form of a stream of bytes, an organized form of storing the data must be added on top of this stream. So storing the data row by row is the easiest way. In the notation of the spreadsheet above the data is stored in the following order of the cells:

B3  C3  D3  E3  F3  G3  H3  B4  C4  D4  E4  F4  G4  H4  B5  C5  and so on

I hope this picture shows this:

Starting with row 1 (at cell B3) simply follow the direction of the arrow in line 3 of the sheet. At the end of row 1 (=at cell H3) continue with row 2 (in line 4 at cell B4 of the sheet) and so on.

column-oriented

How does it look when the data ist stored column-wise? Look at this picture:

Simply start with column B of the sheet (at cell B3 again) and follow the arrow. At the end of the data in column B (=at cell B7) switch over to column C, start with cell C3 and follow the arrow in this column down to cell C7, then switch to column D and so on. So the data would be stored in the datafile like this:

1074892 |  1286800 | 1007388 | 1074700 | 1074856 | 809032 | 252888 | 44173  | 41447  | 541598 | 11.31   | 516.92  | 12.04   | 79.76   | 104.86  | 5.04    | 310.00  | 7.10    | 43.64   | 
and so on.

In the cell-notation of the spreadsheet the data will be stored in this form:

B3  B4  B5  B6  B7  C3  C4  C5  C6  C7  D3 D4  D5  D6  D7  E3  E4  and so on

These pictures and the descriptions should show the different approaches in storing the data in a datafile. But this is simplified.

advantage

When the data is stored in column-oriented fashion the values in one column are stored close to each other. By filtering the data we do not need to read the full record to decide if this record fulfills the search-criteria. Reading less data (=reading only one column) should be much faster.

disadvantage

Immediately you can see one problem: when a record is added to this table it can simply be written at the end of the datafile in the first case (row-oriented) but in the second case (columns-oriented) this looks like a problem:

B3  B4  B5  B6  B7  B8  C3  C4  C5  C6  C7  C8  D3 D4  D5  D6  D7  D8  E3  E4  and so on

And it is a problem: look at the cell-names marked in bold which describe the new values. For these values we have to make room in the datafile to store the first column of this row (and we have to move the bytes again for storing the value of column 2 of this record and again for column 3 and so on).

So we easily found a disadvantage but are there any advantages for storing the data in columnar form? And the answer is: it depends.
Let's look at some examples.

code

When we change the orientation of the data into a column-oriented format we do need to change a lot of things in the code of the database-server. So let's not start with the full picture but instead start wit a trivial approach and continue improving this. I want to start with the existing format (=row-oriented) and simply modify the access by filtering the records on the level of the storage-engine. So I will stay with the row-oriented format for the first shot.

Sequentially reading a table is done by the member-function rnd_next() of any storage-engine. So I will modify this routine a bit.

The code shown here is for demonstration purpose only. It's a proof of concept, do not add it to any production version of MySQL/MariaDB.
It is based on the MINTEMPL-engine which is presented here. You will find the code of this engine here. The handling of adding an engine to MySQL/MaraDB is described here.

You have to modify the code of the engine MINTEMPL:
In the file ha_mintempl.h please add this line:

 int rnd_next(uchar *buf);

In the file ha_mintempl.cc please add these lines:

int ha_mintempl::rnd_next(uchar *buf)
{
  DBUG_ENTER("ha_mintempl::rnd_next");
  int retCode;

  if ( 0 == strncmp(table_share->table_name.str, "TestColsBig",11) )
  {
      // this is table TestColsBig
      Field **ptr2Col=table->field;           // get the pointers to the data of the record
      ptr2Col += 6;                           // now it points to the column Hersteller
      void *ptrData = (*ptr2Col)->ptr;
      while ( true )
      {
          retCode = ha_myisam::rnd_next(buf); // fetch the original record
          if ( retCode != 0 )                 // in case of EOF
              DBUG_RETURN( retCode );
          // check for: WHERE hersteller = '36440'
          if ( 0 == strncmp( (const char*)ptrData, "36440", 5) )
              DBUG_RETURN( 0 );               // found one record -> return it
          // not found: continue reading
      }
  } else if ( 0 == strncmp( table_share->table_name.str, "TestColsSmall",11) )
  {
      // this is the table TestColsSmall
      Field **ptr2Col=table->field;           // get the pointers to the data of the record
      ptr2Col += 6;                           // now it points to the column Hersteller
      void *ptrData = (*ptr2Col)->ptr;
      while ( true )
      {
       retCode = ha_myisam::rnd_next(buf); // fetch the original record
       if ( retCode != 0 )                 // in case of EOF
        DBUG_RETURN( retCode );
       // WHERE hersteller = '08650'
       if ( 0 == strncmp(  (const char*)ptrData, "08650", 5) )
        DBUG_RETURN( 0 );               // found one record -> return it
       // not found: continue reading
      }
  }
  // any other table: read the record and return
  DBUG_RETURN( ha_myisam::rnd_next(buf) );
}

Now compile everything and we have another storage-engine which we could install and will use in our tests.

filtering

Our tests will be done on tables without any indexes, so the tables contain only data. Indexes can speed-up the operation dramatically but I didn't use them here; I wanted to demonstrate the results of different approaches of the execution of a table-scan.

When executing a table-scan the db-server-software calls rnd_next() to fetch one row of a table, so this function is called once for every row in the table. The approach in the existing code of the MyISAM-engine is to read a record and give it back to the caller where the filtering will be done. The idea here is to not act like this but instead filter the records in the storage-engine and give only those rows back to the caller (= higher level of the db-server) that match the condition (=the WHERE-clause). As I will show you soon I have a table which contains approx. 10 mio. records but only 3% of these records match the given condition. In another table only 0.5% of the 300K records match the condition. So wouldn't it be helpful to return only those records back the the db-server? It would be a much smaller amount of data that the server-software has to handle.

So let's take this little code and see what happens.

preparations

As MINTEMPL is dervied from MyISAM (precise: it is MYISAM) I created a database TestCol and in this database I created 2 tables of type MYISAM for comparison:

MariaDB [(none)]> create database TestCol;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use TestCol;
Database changed
MariaDB [TestCol]> create table TestBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> create table TestSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> insert into TestSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (1.48 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestBig select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (49.73 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

Then I created 2 tables of type MINTEMPL to do my tests:

MariaDB [TestCol]> create table TestColsSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> create table TestColsBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> insert into TestColsSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (1.53 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestColsBig select * from TestOK.ABDAOK; 
Query OK, 10000000 rows affected (49.40 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

OK, now we have 2 tables of type MyISAM and 2 tables of type MINTEMPL, and for each storage engine we have a small table (with 300K records) and a big table (with 10 mio. records). In the tables the data is identical meaning that the data in the smaller tables contain the same data and so for the bigger tables, so we can do some tests on each and compare the results.

testing

Let's do a quick test:

MariaDB [TestCol]> select count(*) from TestSmall where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.40 sec)

MariaDB [TestCol]> select count(*) from TestColsSmall where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.24 sec)

MariaDB [TestCol]> select count(*) from TestBig where Hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (12.32 sec)

MariaDB [TestCol]> select count(*) from TestColsBig where Hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (7.41 sec)

As you can see I did a simple SELECT on each table: on the table TestSmall (=standard MyISAM, contains 300K records) and TestBig (=standard MyISAM, contains 10 mio. records) and also on the tables TestColsSmall (=of type MINTEMPL, approx. 300K records, our modified code handles the access) and TestColsBig (=of type MINTEMPL, 10 mio. records, our modified code handles the access). Compare these values and it looks like a speed-up of about 35 to 40%.

As this is such a simple operation let's do something a bit more complex:

MariaDB [TestCol]> MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (4 min 7.80 sec)

MariaDB [TestCol]> select count(*) from TestColsSmall A  join TestColsBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (4 min 2.77 sec)

MariaDB [TestCol]>

In this case I joined the 2 tables of type MyISAM together and got the first result. For comparison I joined the 2 tables of type MINTEMPL together. You can compare and review the time-values given.

conclusions

If you compare the values of SELECTing from a standard-table to the values of SELECTing from a MINTEMPL-table then you will see a noticeable speed-up in the case of a simple SELECT, but this type of operation is not of interest. In the case of JOINing 2 tables the improvement looks like a negligible value.

So I would say: forget this approach (and the code). We have to think about another algorithm.

a remark

If you look at the code I presented some lines above you will see that this code is unusable in any environment except for this test. It contains a lot of hardcoded constants like the table-names (TestColsBig, TestColsSmall), the position of the field Hersteller (no. 6) and the values to check for ("36440", "08650").

The intention for this is to have a quick check if this approach does make sense.

For your own tests you must modify these entries so that they fit to your data.

Tuesday, 22 April 2014

aio

aio is an abbreviation, the full name is asynchronous input/output. The title for this post is a bit misleading because I will present 4 algorithms and only one of them works in an asynchronous way. Please excuse, I don't want to change the title.

Some time ago I started implementing something like asynchronous i/o in MySQL/MariaDB to speed-up the read-operation in the case of a full table scan. You will find a description how MySQL/MariaDB handles a sequential read-operation here. At that time I did know what caching was but I didn't know that it was used in the MyISAM-engine and how it was used there.

In theory my approach looked so simple and naturally would be faster: in rnd_next() you return one row from the table and in the background you start reading the next row. A little bit of synchronization is needed, OK, but this can be done. The result was, as I described it, not what I expected.

So I started looking into the code of MariaDB/MySQL again, a bit deeper now. What I found led to the post about caches.

After this step I decided to look again at asynchronous I/O. For further examination I wanted a stripped down version of the program to verify my observations. I wanted an easy and simple program, one without all the overhead MySQL/MariaDB needs. So I created a program tst-disk-access which should only do some writing to and reading from a file using a variety of different routines. And I want to describe this program here.

Linux only

This program is Linux only, the same is valid for the test-results. I did not create or test any version of this program under Windows. And I have no intention to do so.

If you want to port this program to Windows or write a similar program for Windows this seems to be a good start: Memory-Mapped Files and Asynchronous File I/O.

algorithms

I used 4 different approaches for my tests:

the standard approach
using a cache of 128K
reading or writing in the background
mapping the file into memory

I will describe these algos in more detail. And for reading and writing I will use the lower level functions of the language C because these are used in MariaDB/MySQL.

103

I assume a record of length of 103 bytes with dummy contents and write this record multiple times to a file. By reading I read one record after the other. The database-server does the same in the case of an INSERT or a SELECT for a MyISAM-table of type static with a record width of 102 bytes.

In one of my older examples I created a table with a record length of 102 bytes (add 1 byte, used internally for a flag). You will find the example here. And so I stayed with this example for this test.

the algorithms

Let me now describe the algorithms I used.

Algorithms 2, 3 and 4 use a buffer of size 128K, because a buffer of this size is used in the MyISAM-engine too. And as I've seen in my earlier tests this size is a good value (please look here).

algo 1: standard

This looks like a quick shot at the job of writing to or reading from a file: use write() or read(). So I write a record-buffer to the file multiple times by calling write() again and again. For reading I filled the record-buffer by a call to read() until the filepointer is at the end of the file.
Nothing spectacular here. read() and write() are synchronous functions, they return to the caller when the job is done.

You will find a more detailed description of the structures and functions used in the code here.

algo 2: cached

This is the approach used in MariaDB/MySQL in the MyISAM-engine: use a cache of size 128K. When data is written to to the file this data will only be copied to this cache. When the cache is full then the cache will be written to the file and then reused for further copying. At the end of the writing-process the cache may be partially filled and this content of the cache will be written to the file.
By reading from the file a similar procedure is used: first the cache is filled by reading 128K from the file and the data is given back to the caller from the cache in a chunk of 103 bytes. This will be repeated, all further requests will be served from the cache until the cache is empty or unable to satisfy the request. Then the next 128K-block of data is read from the file into the cache.

This strategy reduces the amount of calls of write() and read() drastically. But is is still a synchronous operation.

This description is simplified, the database-server does not always use such a cache. There are situations where the server uses the standard-approach (e.g. in the case of tables of type dynamidc).

algo 3: background

I will also use a cache of the same size in this algo, but the read- and write-operations will be done in the background. And it will be done witout calling write() or read().

In the case of writing to the file the program will fill the cache. When the cache is full the program gives the order for writing the cache to the file to the OS. As this write-operation will be done in the background the program can continue to fill the cache with the next records (=overlapping operations). When the cache is full again some synchronization is needed because the program has to wait until the former write-operation has finished before the next write-job can be given to the OS.

By reading from the file the program tells the OS to read from the file and waits until this job is finished. Then it gives the next read-request to the OS for reading the next block of data. The OS can handle this request in the background while the program is busy copying the data from the cache to the local buffer in the foreground. When the cache is emptied we have to wait until the OS tells us that the next block of the data is ready. In this case we copy the data from the OS-buffer to our cache and start the next read-request.

You will find a more detailed description of the structures and functions used in the code here

algo 4: mapping

Again the program uses a cache of size 128K. In this approach everything is handled by a memory-region that is mapped to a part of size 128K of the file. Writing to the file means again filling the cache. When the cache is full then the cache is not 'written' to the file by calling some write-routine but instead we tell the OS that we need another part of the file in memory. The OS detects that the current memory-region has been changed and writes this memory-region to the file and gives us back the requested memory-region.

Reading from the file means that we request from the OS a 128K-part of the file in a memory-region from which we copy our data to our local buffer.

Using memory mapped files is a synchronous operation.

Please keep in mind that we can only map an existing file with the appropriate size to memory, e.g. extending the file by writing behind the current eof-position resulted in a crash of the program. This is the reason why in the code after opening the file for writing it will be sized to the calculated length.

You will find a more detailed description of the structures and functions used in the code here.

asynchronous

In the case of the 3rd algo the function-calls for reading or writing only transfer the order to the OS, they return immediately. This differs from the read()- and write()-calls used in the other algorithm before as these do not return until the data is read from the file or write to the file (whatever the OS does in this case; the OS can fool you, that's why we have a flush()-function).

With this program I want to compare the 3rd algo against algo 2 (and added algo 1 and 4 only for curiosity)

OS

The first and the second approach rely on the operating system for speeding up the access to the file. The third and fourth approach bypasses the OS, so the OS can only execute the requests and not help too much (depends on implementation, maybe the OS can help a bit, see the function madvise()).

code

You will find the code and some text-files with results (raw data) here.

The code is split into several files:

a global header-file: tst-disk-access.h
entry point, some validating and jumping to the algo selected: tst-disk-access.c
the standard-approach (algo 1): write_normal.c and read_normal.c
analogous to MyISAM (algo 2): write_db.c and read_db.c
asynchronous (algo 3): write_aio.c and read_aio.c
memory mapped (algo 4): write_mmap.c and read_mmap.c

correctness

Is this code correct? I hope it is, I tested it. If there is something wrong with this program please drop me a line.

For checking the write-routines I used each algo and created 4 outfiles by using 4 different names and compared these files using the cmp-command. These files were identical.

The read-functions are tested by adding up all bytes read and comparing the final values for all algos (they were identical).
You can enable this by uncommenting a line in the header-file and recompile the sources. The time-consumption of each algo was measured with this feature disabled.

parameters

The program can write to a file and read from a file. For the write-operation these parameters are valid:

select the algorithm: 11, 12, 13 or 14 is valid.
number of lines: tell the program how many times the record has to be written to the file
optional is the name of the file. If no name is given the name as given in the header-file will be used (you may modify this name)

For the read-operations these parameters are valid:

select the algorithm: 1, 2, 3 or 4 are valid
optional is the name of the file (as described above)

If you simply start the program a text describing the parameters will be printed.

measuring time

The time-consumption of the program was measured by using the time-command of Linux. Let me give you an example:

august@AMD4:~/workspace/tst-disk-access/Debug$ time ./tst-disk-access 11 10000000
writing to file </home/august/Temp/ABDAOK.MYD> synchronized

<./tst-disk-access>: no of lines written: 10000000

real 0m29.832s
user 0m0.636s
sys 0m23.274s
august@AMD4:~/workspace/tst-disk-access/Debug$

This example calls the standard-algorithm, and this routine has written 10 million records (=lines) to the file.

Of interest of the timing-values is the value for real because we want to test how fast the program finishes and returns to do other work. The other values shows where most of the work is done, but this is not of interest here. These values are of interest if you want to know how much work you can put on an machine using different algorithms.

You will find a script test.sh in the Debug-directory included with the source-code. As you can see this shell-script uses the GNU-version of time because this allowed for better logging of the results. And you will also see that this shell-script removes the datafile from the disk before calling the program in the case of a write because not doing so added up to 5 seconds to the time needed.

results

The program was tested here on the machine but this machine is equipped with 32GB of RAM which allows the OS to cache a lot of data. So I tested the program again with only 8GB of RAM in the same machine. You can find the raw data of the tests in the file time32.txt and time8.txt in the Debug-folder.

The tests started with 10 million records (=filesize of approx. 1GB) and ended with 1,280 million records (=filesize of approx, 128 GB) by simply doubling the number of records.

The results had a range from 0.38 seconds up to 40 minutes and it's hard to present them in graphical form. As I was mostly interested not in the absolute numbers but in comparing the different algos I decided to set the (absolute) result of algo 1 to 100% and present the realative values for the other algorithms. As the filesize doubles with every test you can expect a doubling of the time needed for the job.

with 32GB of RAM

Enough words, let's look at the numbers for the PC equipped with 32GB of RAM:

with 8GB of RAM

And here are the numbers for the machine equipped with 8GB of RAM:

looking at the results

Writing to a file seems to be much faster than the algo 1, the results here are always below 100%. Reading from a file comes close to algo 1 if the file has a size that does no longer fit into the memory. So with a filesize of 32GB and up the OS can't be of help anymore and the numbers are very similar. If the machine is equipped with 8GB of RAM then this effect takes place with a filesize of 8GB (and up).

Please look at these pictures:
algo 1:

algo 2:

algo 3:

algo 4:

As you can see algo 1 utilizes one CPU for full because of the millions of read()- or write()-calls.

algo 1

This algorithm is really slow. You cannot use it for a DB-application. You can use it for writing/reading if the data is less than 1GB (or so) in size. But it's easy to code.

algo 2

This looks like the best algorithm; it's the algo that is used in the MyI(SAM-engine of MariaDB/MySQL. It's a bit more work for coding and testing but the results are convincing. If the filesize exceeds the available RAM the time-values are getting worse. You could expect doubling the time if the filesize doubles but at some point in the test this value jumped up because of this effect. In the case of reading one can see this clearly at 32GB and up for the PC with 32GB of RAM (and a filesize of 8GB and up for the PC with 8GB of RAM).

algo 3

In the case of the asynchronous write/read-operation I expected a faster execution of the program because of the overlapping work but the numbers show that thi is not the case. I assume that the reason for this behavior is the bypassing of the OS which included some tricks that not applied for this algorithm but instead for algo 2.

algo 4

I added this only for my curisoity. The values do look close to algo 2.

conclusion

The best algorithm seems to be algo 2, the version used in the code of the MyISAM-engine. Very close to this is algo 4 (memory mapped), and algo 3 (asynchronous) is disappointing. Forget algo 1.

some thoughts

In real world multiple users are requesting data from or sending data to the database. Many reads and writes occur in (quasi) parallel. The tests here do not mirror this situation. So take the results with a grain of salt.

SSD

What about using a SSD?

My machine here is equipped with a SSD; a Samsung 840 PRO, which has a capacity of 120GB, and also has a harddisk, a Toshiba DT01ACA200 with 2TB capacity. I did all the tests above using the harddisk but I decided to test my program using this SSD too. You can compare the results with the harddissk-tests:

writing/reading 64GB-file using algo 2
PC with 32 GB RAM
              SSD           harddisk           disk/SSD
writing:   182,70 sec.      412,63 sec.         225,85%
reading:   126,53 sec.      499,47 sec.         394,74%

PC with 8GB RAM
              SSD           harddisk           disk/SSD
writing:   173,90 sec.      425,86 sec.         244,89%
reading:   127,16 sec.      499,47 sec.         392,79%

For speeding up the DB-operation you should buy a SSD.

But please wait. Before you rush out and spend your money for a dozen of SSDs please look carefully at your database-design and the SQL-statements. Let me tell you a story: in a project some time ago we heard complains of the end-users about an application that took about 50 minutes before the first line of the result appears. 6 months later management mad a decision: these complains must disappear. So someone in the group got the job to look at this phenomenon. The application and the database were ported to our test-environment and in this environment the application was started. Here it took about 550 seconds until the first line of the result appears. Digging into the Java-code he found the SQL-statement for this job which was extracted and tested again using Squirrel. After some playing with this statement and looking into explain-data it was clear that all information was available but the database-server choose the wrong path for executing the request. So the statement was modified to convince the server of a better path to the result. With this modified statement the server presented the first line of the result within 10 seconds. The end-users were happy.

You can't achieve a speed-up of a factor of 50 by replacing the disks with SSDs.

Monday, 7 April 2014

caches (2)

This is my second post about caching in the MyISAM-engine of MySQL/MariaDB. You will find the first post here.

This post describes some of the observations I made in studying the source-code. The earlier post describes the effect of caching for reading from the hard-disk, This article describes this effect for writing onto the disk.

standard-result

Let's start directly with a result. For all tests i've used the tables TestStat and TestDyn as described in the earlier post. Here are the results with a server out of the box:

MariaDB [TestCache]> use TestCache;
Database changed

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.19 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (48.63 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.18 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (47.76 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.19 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (49.14 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.18 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (47.15 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestStat;
Query OK, 0 rows affected (0.20 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (31.75 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestStat;
Query OK, 0 rows affected (0.20 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (31.87 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestStat;
Query OK, 0 rows affected (0.19 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (31.29 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> quit
Bye
august@AMD4:~/MariaDB/pgm$

As you can see again writing to a static (=fixed length) table is faster then writing to a dynamic (=varying length) table because of the more complex structure of a dynamic table.

Please keep in mind that this INSERT-statement needs a read-operation before the write-operation, this adds up to the time needed for the whole statement so the effect of caching happens twice. And for this reason the numbers for reading and writing can't be compared

This is how the tables look on the OS-side after inserting the records::

august@AMD4:~/MariaDB/data/TestCache$ ls -l
insgesamt 1880292
-rw-rw---- 1 august august         65 Mär 18 15:47 db.opt
-rw-rw---- 1 august august        685 Mär 18 15:50 TestDyn.frm
-rw-rw---- 1 august august  895389376 Apr  1 16:01 TestDyn.MYD
-rw-rw---- 1 august august       1024 Apr  1 16:04 TestDyn.MYI
-rw-rw---- 1 august august        679 Mär 18 15:50 TestStat.frm
-rw-rw---- 1 august august 1030000000 Apr  1 16:03 TestStat.MYD
-rw-rw---- 1 august august       1024 Apr  1 16:04 TestStat.MYI
august@AMD4:~/MariaDB/data/TestCache$

my goal

I want to show you how the storage-engine carries out the write-operation and particularly how the engine uses caches for the write-operation.

into the code

Let's step into the code and look how the server does the write-operation:

xxxx::start_bulk_insert() is called at the beginning of the process. it initializes everything needed
xxxx::write_row() writes one row to the buffer or the disk. this function is called repeatedly until all records are written to disk
xxxx::end_bulk_insert() is called at the end of the of the write-operation. here you can clean things up

This is valid for all engines, every storage engine must implement these functions or the function of the base class will be executed. So replace xxxx with the class name and look into the corresponding file for the details. You will find more on this here

The description above is valid in the case of writing multiple records to disk. When you insert a single record things look easier as you can see here.

This is the function hierarchy using MyISAM in the case of fixed-length records.
init:

ha_myisam.cc  :  ha_myisam::start_bulk_insert()
    mi_extra.c  :  mi_extra()
        mf_iocache.c  :  int init_io_cache()            inits a buffer (=cache) of size 128K

writing rows:

ha_myisam.cc  :  ha_myisam::write_row()
    mi_write.c  :  mi_write()
        mi_statrec.c  :  _mi_write_static_record()
               my_sy.h  :  my_b_write()
                    mf_iocache.c  :  my_b_flush_io_cache()
                        mysql_file.h  :  mysql_file_write()
                            mysql_file.h  :  inline_mysql_file_write()
                                my_write.c  :  my_write()    writes buffer (approx. 128K) to disk

the final step:

ha_myisam.cc  :  ha_myisam::end_bulk_insert()
    mi_extra.c  :  mi_extra()
        mf_iocache.c  :  end_io_cache()
            mf_iocache.c  :  my_b_flush_io_cache()
                mysql_file.h  :  mysql_file_write()
                    mysql_file.h  :  inline_mysql_file_write()
                        my_write.c  :  my_write()            writes buffer (max. 128K) to disk

This is the function hierarchy using MyISAM in the case of variable-length records:
init:

ha_myiam.cc  :  ha_myisam::start_bulk_insert()
        mi_extra.c  : mi_extra()
            mf_iocache.c  :  int init_io_cache()    inits a buffer (=cache) of size 128K

writing rows:

ha_myisam.cc  :  ha_myisam::write_row()
        mi_write.c  :  mi_write()
            int _mi_write_dynamic_record(MI_INFO *info, const uchar *record)
                mi_dynrec.c  :  write_dynamic_record()
                    mi_dynrec.c  :  _mi_write_part_record()
                        my_sy.h  :  my_b_write()
                            mf_iocache.c  :  my_b_flush_io_cache()
                                mysql_file.h  :  mysql_file_write()
                                    mysql_file.h  :  inline_mysql_file_write()
                                        my_write.c  :  my_write()    writes buffer (approx. 128K) to disk

the clean-up:

ha_myisam.cc  :  ha_myisam::end_bulk_insert()
        mi_extra.c  :  mi_extra()
            mf_iocache.c  :  end_io_cache()
                mf_iocache.c  :  my_b_flush_io_cache()
                    myql_file.h  :  mysql_file_write()
                        mysql_file.h  :  inline_mysql_file_write()
                            my_write.c  :  my_write()        calls write() of C std-lib which writes the current contents of the buffer to disk

modifications

As in the case of read-caches I want to know how this buffer is used and especially what happens when I change the size of the buffer. So let's again modify the code a bit. I will present the places where I added or changed code and mark added/changed lines in bold, but only those modifications that are needed for our monitoring of write-caching.

I've already described the reason and some aspects of these modifications here so please look in my former post.

At first: when the cache is unable to satisfy the request the data must be written to the datafile on disk, so let's count how many times the function write() of the C-library is called:
modify the file storage/myisam/ha_myssiam.cc near the top of this file:

extern int counterReadStaticRecord;     // from the former test
extern int counterReadDynamicReacord;
extern int counterWriteStaticRecord;
extern int counterWriteDynamicRecord;

In the same file please look for the function start_bulk_insert() and modify it like this:

void ha_myisam::start_bulk_insert(ha_rows rows, uint flags)
{
  DBUG_ENTER("ha_myisam::start_bulk_insert");
  counterWriteStaticRecord = counterWriteDynamicRecord = 0;

  THD *thd= current_thd;

and next look for end_bulk_insert() in the same file:

int ha_myisam::end_bulk_insert()
{
  DBUG_ENTER("ha_myisam::end_bulk_insert");
  mi_end_bulk_insert(file);
  int err=mi_extra(file, HA_EXTRA_NO_CACHE, 0);
  fprintf(stderr, "no. of writes:\tstatic=%d\tdynamic=%d\n", counterWriteStaticRecord, counterWriteDynamicRecord);
  if (!err && !file->s->deleting)

modify the file storage/myisam/mi_statrec.c near the function _mi_write_static_record():

//##########################################################
// the original code:
//#define my_b_write(info,Buffer,Count)
// ((info)->write_pos + (Count) <=(info)->write_end ?
//  (memcpy((info)->write_pos, (Buffer), (size_t)(Count)),
//   ((info)->write_pos+=(Count)),0) :
//   (*(info)->write_function)((info),(uchar *)(Buffer),(Count)))
int counterWriteStaticRecord;
static inline int AQ_my_b_write(IO_CACHE *info, uchar *Buffer, ulong Count)
{
    if ( (info)->write_pos + (Count) <=(info)->write_end )
    {
     memcpy((info)->write_pos, (Buffer), (size_t)(Count));
     ((info)->write_pos+=(Count));
     return 0;
    }
    counterWriteStaticRecord++;
    return  (*(info)->write_function)((info),(uchar *)(Buffer),(Count));
}

int the function _mi_write_static_record() I changed some lines:

    if (info->opt_flag & WRITE_CACHE_USED)
    {    /* Cash in use */
//      if (my_b_write(&info->rec_cache, record,
//       info->s->base.reclength))
//
       if (AQ_my_b_write(&info->rec_cache, record, info->s->base.reclength))
        goto err;

OK, for static tables we are done, so let's go to dynamic tables: in the file storage/myisam/mi_dynrec.c near the function _mi_write_part_record():

 /* Write a block to datafile */
/*
#define my_b_write(info,Buffer,Count) \
 ((info)->write_pos + (Count) <=(info)->write_end ?\
  (memcpy((info)->write_pos, (Buffer), (size_t)(Count)),\
   ((info)->write_pos+=(Count)),0) : \
   (*(info)->write_function)((info),(uchar *)(Buffer),(Count)))
*/
int counterWriteDynamicRecord;
static inline int AQ_my_b_write(IO_CACHE *info, uchar *Buffer, ulong Count)
{
    if ( (info)->write_pos + (Count) <=(info)->write_end )
    {
     memcpy((info)->write_pos, (Buffer), (size_t)(Count));
     ((info)->write_pos+=(Count));
     return 0;
    }
    counterWriteDynamicRecord++;
 return  (*(info)->write_function)((info),(uchar *)(Buffer),(Count));
}

int _mi_write_part_record(MI_INFO *info,

and in the function _mi_write_part_record():

//    else if (my_b_write(&info->rec_cache,(uchar*) *record-head_length,
//   length+extra_length+del_length))
    else if (AQ_my_b_write(&info->rec_cache,(uchar*) *record-head_length, length+extra_length+del_length))
      goto err;

You will find the reason for the usage of AQ_my_b_write() in my former post.

more modifications

Now we are able to count the number of writes, but we have to add some more modifications to change the size of the buffer.
in the file sql/sysvars.cc:

static Sys_var_ulong Sys_read_buff_size(   
       "read_buffer_size",
       "Each thread that does a sequential scan allocates a buffer of "
       "this size for each table it scans. If you do many sequential scans, "
       "you may want to increase this value",
       SESSION_VAR(read_buff_size), CMD_LINE(REQUIRED_ARG),
       VALID_RANGE(IO_SIZE*2, INT_MAX32), DEFAULT(128*1024),
       BLOCK_SIZE(IO_SIZE));

as you can see the default value is 128*1024 = 128K (I marked the entry in bold). For the tests I will change this value, recompile and do my tests again.

And finally in include/my_global.h:

// #define IO_SIZE  4096
#define IO_SIZE   64

now recompile everything (the last modification leads to a recompilation of almost everything) and do the test.

results

Each test is done 3times, it starts with truncating the target-table so we start with a table of size 0. The values presented here are the average values of the 3 tests. The tests started with a cache of size 128 bytes and ended with a size of 1GB. The SQL-statements used are the same statements as those used for the standard-result above.

conclusion

The default value of these buffers is 128K bytes. If you look at the picture above you will see again that this is a good compromise. And keep in mind that RAM is a finite resource, and you're not alone on this server. Each INSERT-statement of the test-examples uses 2 buffers, one for reading the data from the table and the other for writing the data in the target table so you have double the amount of RAM needed for such a simple statement. And think of situations when 1000 users are active in your server having multiple and more complex statements, then you will need a lot of RAM and the graph shows that you will not gain much performance.

In these day of modern economies and especially accounting practices in bigger countries you have to pay for the RAM you assign to your server, even if this billing is only done internally. This can add up to significant numbers.

I think the value of 128K for the size of the cache is a good compromise.

If you are interested in more details about my tests you may look into this text-file.

some numbers

Let's first check the number of writes as reported by our code added to the original source
The values are:

number of writes with different sizes of buffer:
size of            number of writes
 buffer           static      dynamic        
   128b          3906250      3293831
   256b          3906250      3293831
   512b          2000000      1679340
     1K          1000000       855184
     2K           500000       432774
     4K           250000       217485
     8K           125000       109022
    16K            62822        54580
    32K            31411        27307
    64K            15705        13658
   128K             7856         6830
   256K             3928         3415
   512K             1964         1707
     1M              982          853
     2M              491          426
     4M              245          213
     8M              122          106
    16M               61           53
    32M               30           26
    64M               15           13
   128M                7            6
   256M                3            3
   512M                1            1
    1G                 0            0

In the last line you see why my testing stopped here without further increasing the size of the buffer: with a buffer-size of 1GB the whole file fits into the buffer.

The writes are counted when the buffer is full and hast to be written to disk. These numbers do not include the final write()-call when the contents of the buffer is written to disk in a call to end_bulk_insert().This explains the last values of 0 writes.

In the first lines you will also see identical numbers for the cache-sizes of 128 and 256 bytes. The reason for this is that with these values a minimal buffer of 256 is used even when it is set to a smaller value.

If you multiply the number of writes with the size of the buffer you will always get a value that is smaller then the size of the corresponding file. The difference is less then the size of the buffer used in the test and this difference fits in the buffer and is written to the disk in the call of end_bulk_insert().

And you will also see that doubling the size of the cache halves the number of writes.

a hint

when the write-buffer is accessed for the first time it will be created with a size of 128K by default. But this size is not fully used, it uses a little bit less (and uses the buffer for full in all following rounds). I will try to explain this behaviour in an example.

In one test my file (=table) had initially a size of 133900 bytes. When I insert data into this table it is appended at the end of this file (no free space in this table). Now a buffer of 128K is allocated and the numbers are as follows:

size of file 133900 = 0x20b0c
buffer 131072 = 0x2000
of this buffer the server uses only 128244 bytes (=0x1f4f4).

Now let's add these numbers together:

end of file          0x20b0c
size of buffer     + 0x1f4f4
total              = 0x40000

After the server has written this buffer to the disk the file has a size of 0x40000.

As you can see we are now on a round number (at least in hex). But why did the erver waste these 2828 bytes?
data is read from the harddisk in blocks and it is written on harddisk in blocks too. Positioning somewhere in such a block and writing some bytes on this place means that this block must be read, the data written on the position in the block and then the block is written back to its original place. When we follow the strategy of the server then with the first usage of the block we have to do the same as described, butt all following cases will begin writing on block boundaries, so the first read-operations is not needed. This trick speeds up the operation a bit (not tested).

The block-sizes of harddisks are 512 bytes, 1K, 2K or (maybe) 4K currently. So the server subtracts a little bit from the end so that the addition will result in multiple of the block-size (and it assumes a 4K block). Here is the code taken from init_io_cache() (in mf_iocache.c):

  if (type == WRITE_CACHE)
    info->write_end=
      info->buffer+info->buffer_length- (seek_offset & (IO_SIZE-1));

functions

Above I've written about functions of the storage-engine that are called in the execution of the INSERT-stmt. Here is a listing of these functions as they are called by the server:

<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::column_bitmaps_signal
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::store_lock
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::start_bulk_insert
<TestMaxVar> ha_maxtempl::write_row     this function is called mulitple times, once for each rowy
<TestMaxVar> ha_maxtempl::end_bulk_insert
<TestMaxVar> ha_maxtempl::enable_indexes
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::reset
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::reset

For this test I created a table TestMaxVar in the database TestCache. Then I inserted 10 records into this table using the INSERT-statement used above, but restricted to 10 records. You can see which functions are called, how often they are called and in which order they are called.

The table TestMaxVar uses the storage engine MAXTEMPL.

In the case of inserting only a single row this looks like that:

insert into TestMaxVar values(
 '01000000', '999999', '132.36', '0.00', 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', '25677' );

<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::column_bitmaps_signal
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::store_lock
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::write_row
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::reset

Ass you can see it's a bit simpler. And the functions start_bulk_insert() and end_bulk_insert() are missing in this case so no caching is done here.