Thursday 27 March 2014

caches

I've looked through the code of MariaDB/MySQL and tried to understand what the program does and how it does this. And (quite naturally) I was sure that I could do better. So I thought of a better approach, a better algorithm, made some drawings on paper, wrote some lines of pseudo-code on paper and then implemented my own code.

And then: I tested my approach and compared it to the results of the original code. Well, what do I say, hmmmm, ..... (please look at the little drawing at the bottom of this page)

OK, I say it: the original code was faster than my code.

There could be only one reason for this: the server caches a lot of the data (and my code didn't).

So how is this caching done by MariaDB/MySQL?

what is caching

You can find a good description of caching on the Wikipedia.
In a few words: if you connect a fast device with a slow one a cache can help speedup some operations. And in our case the fast device is the CPU and the slow device is the harddisk.

OS

Caching is used quite often. The operating system does a lot of caching by itself. I don't want to describe this and I don't want to play with this, I only want to show the effect of caching done by the MySQL/MariaDB-server.

cash

If you look through the code of MySQL and MariaDB you will see the words cache and cash appear, they are used interchangeably. Don't let this confuse you, it's a programmers fun. Newer versions of the software replace the word cash with the word cache (the correct word).

environment

I did all my tests with the MyISAM-engine. In caching data each engine can do whatever it wants so the results may differ between the engines.

My machine here is equipped with 32 GB of RAM. If youre machine is equipped with a different amount of RAM then your results may differ a bit.

types

The MyISAM-engine supports 3 types of tabls: static (fixed-length records), dynamic (variable length records) and packed. I tested only static and dynamic.

standard

Let's start with the standard-approach:
MariaDB [TestOK]> create database TestCache;
Query OK, 1 row affected (0.00 sec)

MariaDB [TestOK]> use TestCache;
Database changed
MariaDB [TestCache]>  create table TestStat (                                                                                                                                                                             
    ->                 Id   char(8) not null,
    ->                 PZN  char(7) not null,
    ->                 EVP  char(8) not null,
    ->                 HAP  char(8) not null,
    ->                 ArtikelBez char(40) not null,
    ->                 ArtikelText char(26) not null,
    ->                 Hersteller  char(5) not null)
    ->                 engine = MyIsam
    -> ;
Query OK, 0 rows affected (0.05 sec)

MariaDB [TestCache]>  create table TestDyn (                                                                                                                                                                             
    ->                 Id   varchar(8) not null,
    ->                 PZN  varchar(7) not null,
    ->                 EVP  varchar(8) not null,
    ->                 HAP  varchar(8) not null,
    ->                 ArtikelBez varchar(40) not null,
    ->                 ArtikelText varchar(26) not null,
    ->                 Hersteller  varchar(5) not null)
    ->                 engine = MyIsam
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCache]> insert into TestStat select * from TestOK.ABDAOK
    -> ;
Query OK, 10000000 rows affected (34.84 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> insert into TestDyn select * from TestOK.ABDAOK
    -> ;
Query OK, 10000000 rows affected (46.90 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCache]> 
As you can see I created a database and in this database I created 2 tables. The tables are identical, except that the table TestStat is of type static (=fixed length records) whereas the table TestDyn is of type dynamic (=variable length records). Both tables contain 10 Mio. records of identical data. The table TestDyn does not contain any updated data, and each record consist of only one block, so no data is spread around the datafile.

This is how the tables look on the OS-side:
august@AMD4:~/MariaDB/data/TestCache$ ls -l
insgesamt 1880292
-rw-rw---- 1 august august         65 Mär 18 15:47 db.opt
-rw-rw---- 1 august august        685 Mär 18 15:50 TestDyn.frm
-rw-rw---- 1 august august  895389376 Mär 18 15:53 TestDyn.MYD
-rw-rw---- 1 august august       1024 Mär 18 16:15 TestDyn.MYI
-rw-rw---- 1 august august        679 Mär 18 15:50 TestStat.frm
-rw-rw---- 1 august august 1030000000 Mär 18 15:51 TestStat.MYD
-rw-rw---- 1 august august       1024 Mär 18 16:15 TestStat.MYI
august@AMD4:~/MariaDB/data/TestCache$ 
If you add the numbers of the column-widths given in the CREATE-statement you will get 102 as the width of the record. Add to this one byte for a flag, multiply this value with 10 Mio. and ŷou will get exactly the size of the file TestStat.MYD. The file TestDyn.MYD is a bit smaller because the data is not filled up to the full width of each column (justification of using the VARCHAR column-type).

Let's do our first tests. Nothing is changed, these tests are all done with the server out of the box. And everything is a table-scan:
MariaDB [TestCache]> select count(*) from TestStat where Hersteller <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (10.63 sec)

MariaDB [TestCache]> select count(*) from TestStat where Hersteller <> 'ABCDF';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (10.58 sec)

MariaDB [TestCache]> select count(*) from TestStat where Hersteller <> 'ABCDG';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (10.58 sec)

MariaDB [TestCache]> select count(*) from TestDyn where Hersteller <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (14.16 sec)

MariaDB [TestCache]> select count(*) from TestDyn where Hersteller <> 'ABCDF';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (14.14 sec)

MariaDB [TestCache]> select count(*) from TestDyn where Hersteller <> 'ABCDG';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (14.13 sec)

MariaDB [TestCache]> 

All time-values are taken from the frontend mysql as you can see. All tests are done 3 times to compensate for optimizations done by the OS (=caching on the OS-level).

Accessing table TestStat is about 30% faster than accessing the table TestDyn. The reason for this is the more complex structure of the later table. You will find a description of the structure of the MyISAM-tables here. You may also look at this post.

In all future examples of this post I will do the tests 3 times, but I will only present the average value for each test.

into the code

Let's step into the code and look how the server does the read-operation.

All operations that I used in my tests are sequential reads (=table scan) because no other access is possible (no index was created). These operations are executed by the following functions:
  • xxxx::rnd_init() is called at the beginning of the process. it initializes everything needed
  • xxxx::rnd_next() returns one row of the table. this function is called repeatedly until it returns the constant HA_ERR_END_OF_FILE
  • xxxx::rnd_end() is called at the end of the of the read-operation. here you can clean things up
This is valid for all engines, every storage engine must implement these functions or the function of the base class will be executed. So replace xxxx with the class name and look into the corresponding file for the details. You will find more on this here

This is the function hierarchy using MyISAM in the case of fixed-sized records:
ha_myisam.cc  :  ha_myisam::rnd_next()
    mi_scan.c  : mi_scan()
        mi_stat_rec.c  : mi_read_rnd_static_record() 
            mf_iocache.c  : _my_b_read()
                mysql_file.h  : inline_mysql_file_read
                    my_read.c  :  my_read()     reads 128K from the datafile into the buffer

And in the case of variable length records:
ha_myisam.cc  :  ha_myisam::rnd_next()
    mi_scan.c  : mi_scan()
        mi_dynrec.c  :  _mi_read_rnd_dynamic_record()
            mi_cache.c  :  _mi_read_cache()
                mf_iocache.c  :  _my_b_read()
                    mysql_file.h  :  inline_mysql_file_read
                        my_read.c  :  my_read()     reads 128K from the datafile into the buffer
The last function fills the cache and this buffer is used internally by the MyISAM-engine in the next read-operations.

modifications

I want to know how this buffer is used and especially what happens when I change the size of the buffer. So let's modify the code a bit. I will present the places where I added or changed code and mark added/changed lines in bold.

At first: when the cache is unable to satisfy the request the data must be read from the datafile, so let's count how many times the function read() of the C-library is called:
modify the file storage/myisam/ha_mysiam.cc near the top of this file:
extern int counterReadStaticRecord;    
extern int counterReadDynamicReacord;
somewhere in the middle of this file:
int ha_myisam::rnd_init(bool scan)
{
  counterReadStaticRecord = counterReadDynamicReacord = 0;
  if (scan)
    return mi_scan_init(file);
  return mi_reset(file);                        // Free buffers
}

int ha_myisam::rnd_end()
{
  DBUG_ENTER("ha_myisam::rnd_end");
  fprintf(stderr, "no. of reads:\tstatic=%d\tdynamic=%d\n", counterReadStaticRecord, counterReadDynamicReacord);
  ds_mrr.dsmrr_close();
#if !defined(DBUG_OFF) && defined(SQL_SELECT_FIXED_FOR_UPDATE)
  file->update&= ~HA_STATE_AKTIV;               // Forget active row
#endif
  DBUG_RETURN(0);
}

modify the file storage/myisam/mi_statrec.c before the function _mi_read_rnd_static_record():
//##########################################################
// the original code:
//#define my_b_read(info,Buffer,Count) \
//  ((info)->read_pos + (Count) <= (info)->read_end ?\
//   (memcpy(Buffer,(info)->read_pos,(size_t) (Count)), \
//    ((info)->read_pos+=(Count)),0) :\
//   (*(info)->read_function)((info),Buffer,Count))

int counterReadStaticRecord;
static inline int AQ_my_b_read(IO_CACHE *info, uchar *Buffer, ulong Count)
{
     if ( (info)->read_pos + (Count) <= (info)->read_end )
     {
         memcpy(Buffer,(info)->read_pos,(size_t) (Count));
         ((info)->read_pos+=(Count));
         return 0;
     }
     counterReadStaticRecord++;
     return  (*(info)->read_function)((info),Buffer,Count);
}

in the function _mi_read_rnd_static_record():
//   error=my_b_read(&info->rec_cache,(uchar*) buf,share->base.reclength);
/////
    error=AQ_my_b_read(&info->rec_cache,(uchar*) buf,share->base.reclength);
/////

I have to explain a bit what I did here and why:
There is a makro my_b_read, you will find it in include/my_sys.h.
I included it as a comment in the file mi_statrec.c and also duplicated it as a function with a new name: AQ_my_b_read. I hope this function better shows what's going on in this code and it also allows for better debugging. And I added the counter-variable to this inline-function.
Then I replaced the call of the makro my_b_read with the call of my own function AQ_my_b_read.

Let's continue with our modifications, now for the dynamic table.

in storage/myisam/mi_cache.c somewhere near the start of the file add:
int counterReadDynamicReacord;

somewhere in the function _mi_read_cache():
    else
      info->read_pos=info->read_end;   /* All block used */

    counterReadDynamicReacord++;  // <----  add this line
    if (!(*info->read_function)(info,buff,length))
      DBUG_RETURN(0);

why these modifications

A table-scan is done by calling at first the function ha_myisam::rnd_init(). In this function I initialize the counters to 0.
Then the function ha_myisam::rnd_next() ist called until we reached the end of the datafile. Nothing new here.
And finally the function ha_myisam::rnd_end() is called which will report the values of our counter-variables.

In the case of a static table the read-operation is handled in the function AQ_my_b_read in the file mi_statrec.c so here I'm counting.
And for a dynamic table the read-operation is done in the function _mi_read_cache() so I'm counting here.

more modifications

Counting the number of disk-accesses is only a minor point. With thi information we can compare the effect of modifying the size of the buffer. But we didn't modify this size, so let's do it now: in the file sql/sysvars.cc:
static Sys_var_ulong Sys_read_buff_size(   
       "read_buffer_size",
       "Each thread that does a sequential scan allocates a buffer of "
       "this size for each table it scans. If you do many sequential scans, "
       "you may want to increase this value",
       SESSION_VAR(read_buff_size), CMD_LINE(REQUIRED_ARG),
       VALID_RANGE(IO_SIZE*2, INT_MAX32), DEFAULT(128*1024),
       BLOCK_SIZE(IO_SIZE));
as you can see the default value is 128*1024 = 128K (I marked the entry in bold). I can easily change this, recompile and do the tests, but there are restrictions built into the server-code which can lead to the following error-message appearing on start-up:
Sysvar 'read_buffer_size' failed 'min_val <= def_val'
I could try to find the places where these checks happen or look for something else. I decided to modify this entry: in include/my_global.h:
// #define IO_SIZE  4096
#define IO_SIZE   64

now recompile everything (the last modification leads to a recompilation of almost everything) and do the test. The error-message did not appear anymore.

results

Here are the results of my tests. I changed the DEFAULT-value from 128 bytes to 256 bytes to 512 bytes and so on up to 1 GB and did my tests, with both types of tables, and did every test 3times. Here are the results of the tests in graphical form:

If you are interested in the details of my tests please look at this text-file. It contains the raw data of my tests.

conclusion

The default value of these buffers is 128K bytes. If you look at the picture above you will see that this is a good compromise. If you make the buffer smaller everything will need more time. If you increase the size you will not get much improvement (and maybe the request will need more time). And keep in mind that RAM is a finite resource, and you're not alone on this server.
It's a good compromise.


some more information

I collected a lot more of information then the time elapsed for the SELECT. So let me give you some more data.

Let's first check the number of reads.
The values are:
number of reads with different sizes of buffer:
size of            number of reads
 buffer           static      dynamic
   128b          4023438      3497615
   256b          4023438      3497615
   512b          2011719      1748808
     1K          1005860       874404
     2K           502930       437202
     4K           251465       218601
     8K           125733       109301
    16K            62867        54651
    32K            31434        27326
    64K            15717        13663
   128K             7859         6832
   256K             3930         3416
   512K             1965         1708
     1M              983          854
     2M              492          427
     4M              246          214
     8M              123          107
    16M               62           54
    32M               31           27
    64M               16           14
   128M                8            7
   256M                4            4
   512M                2            2
    1G                 1            1

In the last line you see why my testing stopped here: with a buffer-ize of 1GB the whole file is accessed in one read()-call.

In the first lines you will alo see identical numbers for the cache-sizes of 128 and 256 bytes. The reason for this is that with these values a minimal buffer of 256 is used even when it is set to a smaller value.

If you multiply the number of reads with the size of the buffer you will always get a value that is bigger then the size of the corresponding file, so the whole file is read.

And you will also see that doubling the size of the cache halves the number of reads.

support

The support for caching data is already there. In your own storage engine you can use it (or write your own routine which are hopefully better/faster).

Please look at the code in mysys/mf_iocache.c. The buffers are created in the function init_io_cache(). This function is called after a call of ha_myisam::rnd_init(). Here is the hierarchy:
ha_myiam.cc  :  ha_myisam::extra_opt()
    mi_extra.c  : mi_extra()
        mf_iocache.c : init_io_cache()

and the buffers are freed after a call of ha_myisam::rnd_end() in the function end_io_cache(). This is the hierarchy:
ha_myisam.cc  :  ha_myisam::extra()
    mi_extra.c : mi_extra()
        mf_iocache.c : end_io_cache()

functions

Above I've written about functions of the storage-engine that are called in the execution of the SELECT-stmt. Here is a listing of these functions as they are called called by the server:
<TestMax> ha_maxtempl::extra
<TestMax> ha_maxtempl::column_bitmaps_signal
<TestMax> ha_maxtempl::extra
<TestMax> ha_maxtempl::extra
<TestMax> ha_maxtempl::store_lock
<TestMax> ha_maxtempl::external_lock
<TestMax> ha_maxtempl::table_cache_type
<TestMax> ha_maxtempl::register_query_cache_table
<TestMax> ha_maxtempl::table_cache_type
<TestMax> ha_maxtempl::info
<TestMax> ha_maxtempl::info
<TestMax> ha_maxtempl::scan_time
<TestMax> ha_maxtempl::scan_time
<TestMax> ha_maxtempl::rnd_init
<TestMax> ha_maxtempl::extra_opt
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_next
<TestMax> ha_maxtempl::rnd_end
<TestMax> ha_maxtempl::extra
<TestMax> ha_maxtempl::external_lock
<TestMax> ha_maxtempl::extra
<TestMax> ha_maxtempl::extra
<TestMax> ha_maxtempl::reset
For this test I created a table TestMax in the database TestCache and inserted 6 records into this table. Then I started the >SELECT-stmt given above. You can see which functions are called, how often they are called and in which order they are called. An example: rnd_next() is called 7 times, once for each of the 6 records in the table and a 7th time, in which the EOF is detected and returned.

The table TestMax uses the storage engine MAXTEMPL.