Monday 7 April 2014

caches (2)

This is my second post about caching in the MyISAM-engine of MySQL/MariaDB. You will find the first post here.

This post describes some of the observations I made in studying the source-code. The earlier post describes the effect of caching for reading from the hard-disk, This article describes this effect for writing onto the disk.

standard-result

Let's start directly with a result. For all tests i've used the tables TestStat and TestDyn as described in the earlier post. Here are the results with a server out of the box:
MariaDB [TestCache]> use TestCache;
Database changed

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.19 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (48.63 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.18 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (47.76 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.19 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (49.14 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestDyn;
Query OK, 0 rows affected (0.18 sec)

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (47.15 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestStat;
Query OK, 0 rows affected (0.20 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (31.75 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestStat;
Query OK, 0 rows affected (0.20 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (31.87 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> truncate table TestStat;
Query OK, 0 rows affected (0.19 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (31.29 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> quit
Bye
august@AMD4:~/MariaDB/pgm$ 
As you can see again writing to a static (=fixed length) table is faster then writing to a dynamic (=varying length) table because of the more complex structure of a dynamic table.

Please keep in mind that this INSERT-statement needs a read-operation before the write-operation, this adds up to the time needed for the whole statement so the effect of caching happens twice. And for this reason the numbers for reading and writing can't be compared

This is how the tables look on the OS-side after inserting the records::
august@AMD4:~/MariaDB/data/TestCache$ ls -l
insgesamt 1880292
-rw-rw---- 1 august august         65 Mär 18 15:47 db.opt
-rw-rw---- 1 august august        685 Mär 18 15:50 TestDyn.frm
-rw-rw---- 1 august august  895389376 Apr  1 16:01 TestDyn.MYD
-rw-rw---- 1 august august       1024 Apr  1 16:04 TestDyn.MYI
-rw-rw---- 1 august august        679 Mär 18 15:50 TestStat.frm
-rw-rw---- 1 august august 1030000000 Apr  1 16:03 TestStat.MYD
-rw-rw---- 1 august august       1024 Apr  1 16:04 TestStat.MYI
august@AMD4:~/MariaDB/data/TestCache$ 

my goal

I want to show you how the storage-engine carries out the write-operation and particularly how the engine uses caches for the write-operation.

into the code

Let's step into the code and look how the server does the write-operation:
  • xxxx::start_bulk_insert() is called at the beginning of the process. it initializes everything needed
  • xxxx::write_row() writes one row to the buffer or the disk. this function is called repeatedly until all records are written to disk
  • xxxx::end_bulk_insert() is called at the end of the of the write-operation. here you can clean things up
This is valid for all engines, every storage engine must implement these functions or the function of the base class will be executed. So replace xxxx with the class name and look into the corresponding file for the details. You will find more on this here

The description above is valid in the case of writing multiple records to disk. When you insert a single record things look easier as you can see here.

This is the function hierarchy using MyISAM in the case of fixed-length records.
init:
ha_myisam.cc  :  ha_myisam::start_bulk_insert()
    mi_extra.c  :  mi_extra()
        mf_iocache.c  :  int init_io_cache()            inits a buffer (=cache) of size 128K
writing rows:
ha_myisam.cc  :  ha_myisam::write_row()
    mi_write.c  :  mi_write()
        mi_statrec.c  :  _mi_write_static_record()
               my_sy.h  :  my_b_write()
                    mf_iocache.c  :  my_b_flush_io_cache()
                        mysql_file.h  :  mysql_file_write()
                            mysql_file.h  :  inline_mysql_file_write()
                                my_write.c  :  my_write()    writes buffer (approx. 128K) to disk
the final step:
ha_myisam.cc  :  ha_myisam::end_bulk_insert()
    mi_extra.c  :  mi_extra()
        mf_iocache.c  :  end_io_cache()
            mf_iocache.c  :  my_b_flush_io_cache()
                mysql_file.h  :  mysql_file_write()
                    mysql_file.h  :  inline_mysql_file_write()
                        my_write.c  :  my_write()            writes buffer (max. 128K) to disk

This is the function hierarchy using MyISAM in the case of variable-length records:
init:
ha_myiam.cc  :  ha_myisam::start_bulk_insert()
        mi_extra.c  : mi_extra()
            mf_iocache.c  :  int init_io_cache()    inits a buffer (=cache) of size 128K
writing rows:
ha_myisam.cc  :  ha_myisam::write_row()
        mi_write.c  :  mi_write()
            int _mi_write_dynamic_record(MI_INFO *info, const uchar *record)
                mi_dynrec.c  :  write_dynamic_record()
                    mi_dynrec.c  :  _mi_write_part_record()
                        my_sy.h  :  my_b_write()
                            mf_iocache.c  :  my_b_flush_io_cache()
                                mysql_file.h  :  mysql_file_write()
                                    mysql_file.h  :  inline_mysql_file_write()
                                        my_write.c  :  my_write()    writes buffer (approx. 128K) to disk
the clean-up:
ha_myisam.cc  :  ha_myisam::end_bulk_insert()
        mi_extra.c  :  mi_extra()
            mf_iocache.c  :  end_io_cache()
                mf_iocache.c  :  my_b_flush_io_cache()
                    myql_file.h  :  mysql_file_write()
                        mysql_file.h  :  inline_mysql_file_write()
                            my_write.c  :  my_write()        calls write() of C std-lib which writes the current contents of the buffer to disk

modifications

As in the case of read-caches I want to know how this buffer is used and especially what happens when I change the size of the buffer. So let's again modify the code a bit. I will present the places where I added or changed code and mark added/changed lines in bold, but only those modifications that are needed for our monitoring of write-caching.

I've already described the reason and some aspects of these modifications here so please look in my former post.

At first: when the cache is unable to satisfy the request the data must be written to the datafile on disk, so let's count how many times the function write() of the C-library is called:
modify the file storage/myisam/ha_myssiam.cc near the top of this file:
extern int counterReadStaticRecord;     // from the former test
extern int counterReadDynamicReacord;
extern int counterWriteStaticRecord;
extern int counterWriteDynamicRecord;
In the same file please look for the function start_bulk_insert() and modify it like this:
void ha_myisam::start_bulk_insert(ha_rows rows, uint flags)
{
  DBUG_ENTER("ha_myisam::start_bulk_insert");
  counterWriteStaticRecord = counterWriteDynamicRecord = 0;

  THD *thd= current_thd;
and next look for end_bulk_insert() in the same file:
int ha_myisam::end_bulk_insert()
{
  DBUG_ENTER("ha_myisam::end_bulk_insert");
  mi_end_bulk_insert(file);
  int err=mi_extra(file, HA_EXTRA_NO_CACHE, 0);
  fprintf(stderr, "no. of writes:\tstatic=%d\tdynamic=%d\n", counterWriteStaticRecord, counterWriteDynamicRecord);
  if (!err && !file->s->deleting)

modify the file storage/myisam/mi_statrec.c near the function _mi_write_static_record():
//##########################################################
// the original code:
//#define my_b_write(info,Buffer,Count)
// ((info)->write_pos + (Count) <=(info)->write_end ?
//  (memcpy((info)->write_pos, (Buffer), (size_t)(Count)),
//   ((info)->write_pos+=(Count)),0) :
//   (*(info)->write_function)((info),(uchar *)(Buffer),(Count)))
int counterWriteStaticRecord;
static inline int AQ_my_b_write(IO_CACHE *info, uchar *Buffer, ulong Count)
{
    if ( (info)->write_pos + (Count) <=(info)->write_end )
    {
     memcpy((info)->write_pos, (Buffer), (size_t)(Count));
     ((info)->write_pos+=(Count));
     return 0;
    }
    counterWriteStaticRecord++;
    return  (*(info)->write_function)((info),(uchar *)(Buffer),(Count));
}
int the function _mi_write_static_record() I changed some lines:
    if (info->opt_flag & WRITE_CACHE_USED)
    {    /* Cash in use */
//      if (my_b_write(&info->rec_cache, record,
//       info->s->base.reclength))
//
       if (AQ_my_b_write(&info->rec_cache, record, info->s->base.reclength))
        goto err;

OK, for static tables we are done, so let's go to dynamic tables: in the file storage/myisam/mi_dynrec.c near the function _mi_write_part_record():
 /* Write a block to datafile */
/*
#define my_b_write(info,Buffer,Count) \
 ((info)->write_pos + (Count) <=(info)->write_end ?\
  (memcpy((info)->write_pos, (Buffer), (size_t)(Count)),\
   ((info)->write_pos+=(Count)),0) : \
   (*(info)->write_function)((info),(uchar *)(Buffer),(Count)))
*/
int counterWriteDynamicRecord;
static inline int AQ_my_b_write(IO_CACHE *info, uchar *Buffer, ulong Count)
{
    if ( (info)->write_pos + (Count) <=(info)->write_end )
    {
     memcpy((info)->write_pos, (Buffer), (size_t)(Count));
     ((info)->write_pos+=(Count));
     return 0;
    }
    counterWriteDynamicRecord++;
 return  (*(info)->write_function)((info),(uchar *)(Buffer),(Count));
}

int _mi_write_part_record(MI_INFO *info,
and in the function _mi_write_part_record():
//    else if (my_b_write(&info->rec_cache,(uchar*) *record-head_length,
//   length+extra_length+del_length))
    else if (AQ_my_b_write(&info->rec_cache,(uchar*) *record-head_length, length+extra_length+del_length))
      goto err;
You will find the reason for the usage of AQ_my_b_write() in my former post.

more modifications

Now we are able to count the number of writes, but we have to add some more modifications to change the size of the buffer.
in the file sql/sysvars.cc:
static Sys_var_ulong Sys_read_buff_size(   
       "read_buffer_size",
       "Each thread that does a sequential scan allocates a buffer of "
       "this size for each table it scans. If you do many sequential scans, "
       "you may want to increase this value",
       SESSION_VAR(read_buff_size), CMD_LINE(REQUIRED_ARG),
       VALID_RANGE(IO_SIZE*2, INT_MAX32), DEFAULT(128*1024),
       BLOCK_SIZE(IO_SIZE));
as you can see the default value is 128*1024 = 128K (I marked the entry in bold). For the tests I will change this value, recompile and do my tests again.

And finally in include/my_global.h:
// #define IO_SIZE  4096
#define IO_SIZE   64

now recompile everything (the last modification leads to a recompilation of almost everything) and do the test.

results

Each test is done 3times, it starts with truncating the target-table so we start with a table of size 0. The values presented here are the average values of the 3 tests. The tests started with a cache of size 128 bytes and ended with a size of 1GB. The SQL-statements used are the same statements as those used for the standard-result above.

conclusion

The default value of these buffers is 128K bytes. If you look at the picture above you will see again that this is a good compromise. And keep in mind that RAM is a finite resource, and you're not alone on this server. Each INSERT-statement of the test-examples uses 2 buffers, one for reading the data from the table and the other for writing the data in the target table so you have double the amount of RAM needed for such a simple statement. And think of situations when 1000 users are active in your server having multiple and more complex statements, then you will need a lot of RAM and the graph shows that you will not gain much performance.

In these day of modern economies and especially accounting practices in bigger countries you have to pay for the RAM you assign to your server, even if this billing is only done internally. This can add up to significant numbers.

I think the value of 128K for the size of the cache is a good compromise.

If you are interested in more details about my tests you may look into this text-file.


some numbers

Let's first check the number of writes as reported by our code added to the original source
The values are:
number of writes with different sizes of buffer:
size of            number of writes
 buffer           static      dynamic        
   128b          3906250      3293831
   256b          3906250      3293831
   512b          2000000      1679340
     1K          1000000       855184
     2K           500000       432774
     4K           250000       217485
     8K           125000       109022
    16K            62822        54580
    32K            31411        27307
    64K            15705        13658
   128K             7856         6830
   256K             3928         3415
   512K             1964         1707
     1M              982          853
     2M              491          426
     4M              245          213
     8M              122          106
    16M               61           53
    32M               30           26
    64M               15           13
   128M                7            6
   256M                3            3
   512M                1            1
    1G                 0            0

In the last line you see why my testing stopped here without further increasing the size of the buffer: with a buffer-size of 1GB the whole file fits into the buffer.

The writes are counted when the buffer is full and hast to be written to disk. These numbers do not include the final write()-call when the contents of the buffer is written to disk in a call to end_bulk_insert().This explains the last values of 0 writes.

In the first lines you will also see identical numbers for the cache-sizes of 128 and 256 bytes. The reason for this is that with these values a minimal buffer of 256 is used even when it is set to a smaller value.

If you multiply the number of writes with the size of the buffer you will always get a value that is smaller then the size of the corresponding file. The difference is less then the size of the buffer used in the test and this difference fits in the buffer and is written to the disk in the call of end_bulk_insert().

And you will also see that doubling the size of the cache halves the number of writes.

a hint

when the write-buffer is accessed for the first time it will be created with a size of 128K by default. But this size is not fully used, it uses a little bit less (and uses the buffer for full in all following rounds). I will try to explain this behaviour in an example.

In one test my file (=table) had initially a size of 133900 bytes. When I insert data into this table it is appended at the end of this file (no free space in this table). Now a buffer of 128K is allocated and the numbers are as follows:
size of file 133900 = 0x20b0c
buffer 131072 = 0x2000
of this buffer the server uses only 128244 bytes (=0x1f4f4).
Now let's add these numbers together:
end of file          0x20b0c
size of buffer     + 0x1f4f4
total              = 0x40000
After the server has written this buffer to the disk the file has a size of 0x40000.

As you can see we are now on a round number (at least in hex). But why did the erver waste these 2828 bytes?
data is read from the harddisk in blocks and it is written on harddisk in blocks too. Positioning somewhere in such a block and writing some bytes on this place means that this block must be read, the data written on the position in the block and then the block is written back to its original place. When we follow the strategy of the server then with the first usage of the block we have to do the same as described, butt all following cases will begin writing on block boundaries, so the first read-operations is not needed. This trick speeds up the operation a bit (not tested).

The block-sizes of harddisks are 512 bytes, 1K, 2K or (maybe) 4K currently. So the server subtracts a little bit from the end so that the addition will result in multiple of the block-size (and it assumes a 4K block). Here is the code taken from init_io_cache() (in mf_iocache.c):
  if (type == WRITE_CACHE)
    info->write_end=
      info->buffer+info->buffer_length- (seek_offset & (IO_SIZE-1));

functions

Above I've written about functions of the storage-engine that are called in the execution of the INSERT-stmt. Here is a listing of these functions as they are called by the server:
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::column_bitmaps_signal
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::store_lock
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::start_bulk_insert
<TestMaxVar> ha_maxtempl::write_row     this function is called mulitple times, once for each rowy
<TestMaxVar> ha_maxtempl::end_bulk_insert
<TestMaxVar> ha_maxtempl::enable_indexes
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::reset
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::reset
For this test I created a table TestMaxVar in the database TestCache. Then I inserted 10 records into this table using the INSERT-statement used above, but restricted to 10 records. You can see which functions are called, how often they are called and in which order they are called.

The table TestMaxVar uses the storage engine MAXTEMPL.

In the case of inserting only a single row this looks like that:
insert into TestMaxVar values(
 '01000000', '999999', '132.36', '0.00', 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', '25677' );

<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::column_bitmaps_signal
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::store_lock
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::write_row
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::external_lock
<TestMaxVar> ha_maxtempl::extra
<TestMaxVar> ha_maxtempl::reset
Ass you can see it's a bit simpler. And the functions start_bulk_insert() and end_bulk_insert() are missing in this case so no caching is done here.