Tuesday, 22 April 2014

aio

aio is an abbreviation, the full name is asynchronous input/output. The title for this post is a bit misleading because I will present 4 algorithms and only one of them works in an asynchronous way. Please excuse, I don't want to change the title.

Some time ago I started implementing something like asynchronous i/o in MySQL/MariaDB to speed-up the read-operation in the case of a full table scan. You will find a description how MySQL/MariaDB handles a sequential read-operation here. At that time I did know what caching was but I didn't know that it was used in the MyISAM-engine and how it was used there.

In theory my approach looked so simple and naturally would be faster: in rnd_next() you return one row from the table and in the background you start reading the next row. A little bit of synchronization is needed, OK, but this can be done. The result was, as I described it, not what I expected.

So I started looking into the code of MariaDB/MySQL again, a bit deeper now. What I found led to the post about caches.

After this step I decided to look again at asynchronous I/O. For further examination I wanted a stripped down version of the program to verify my observations. I wanted an easy and simple program, one without all the overhead MySQL/MariaDB needs. So I created a program tst-disk-access which should only do some writing to and reading from a file using a variety of different routines. And I want to describe this program here.

Linux only

This program is Linux only, the same is valid for the test-results. I did not create or test any version of this program under Windows. And I have no intention to do so.

If you want to port this program to Windows or write a similar program for Windows this seems to be a good start: Memory-Mapped Files and Asynchronous File I/O.

algorithms

I used 4 different approaches for my tests:
  1. the standard approach
  2. using a cache of 128K
  3. reading or writing in the background
  4. mapping the file into memory
I will describe these algos in more detail. And for reading and writing I will use the lower level functions of the language C because these are used in MariaDB/MySQL.

103

I assume a record of length of 103 bytes with dummy contents and write this record multiple times to a file. By reading I read one record after the other. The database-server does the same in the case of an INSERT or a SELECT for a MyISAM-table of type static with a record width of 102 bytes.

In one of my older examples I created a table with a record length of 102 bytes (add 1 byte, used internally for a flag). You will find the example here. And so I stayed with this example for this test.

the algorithms

Let me now describe the algorithms I used.

Algorithms 2, 3 and 4 use a buffer of size 128K, because a buffer of this size is used in the MyISAM-engine too. And as I've seen in my earlier tests this size is a good value (please look here).

algo 1: standard

This looks like a quick shot at the job of writing to or reading from a file: use write() or read(). So I write a record-buffer to the file multiple times by calling write() again and again. For reading I filled the record-buffer by a call to read() until the filepointer is at the end of the file.
Nothing spectacular here. read() and write() are synchronous functions, they return to the caller when the job is done.

You will find a more detailed description of the structures and functions used in the code here.

algo 2: cached

This is the approach used in MariaDB/MySQL in the MyISAM-engine: use a cache of size 128K. When data is written to to the file this data will only be copied to this cache. When the cache is full then the cache will be written to the file and then reused for further copying. At the end of the writing-process the cache may be partially filled and this content of the cache will be written to the file.
By reading from the file a similar procedure is used: first the cache is filled by reading 128K from the file and the data is given back to the caller from the cache in a chunk of 103 bytes. This will be repeated, all further requests will be served from the cache until the cache is empty or unable to satisfy the request. Then the next 128K-block of data is read from the file into the cache.

This strategy reduces the amount of calls of write() and read() drastically. But is is still a synchronous operation.

This description is simplified, the database-server does not always use such a cache. There are situations where the server uses the standard-approach (e.g. in the case of tables of type dynamidc).

algo 3: background

I will also use a cache of the same size in this algo, but the read- and write-operations will be done in the background. And it will be done witout calling write() or read().

In the case of writing to the file the program will fill the cache. When the cache is full the program gives the order for writing the cache to the file to the OS. As this write-operation will be done in the background the program can continue to fill the cache with the next records (=overlapping operations). When the cache is full again some synchronization is needed because the program has to wait until the former write-operation has finished before the next write-job can be given to the OS.

By reading from the file the program tells the OS to read from the file and waits until this job is finished. Then it gives the next read-request to the OS for reading the next block of data. The OS can handle this request in the background while the program is busy copying the data from the cache to the local buffer in the foreground. When the cache is emptied we have to wait until the OS tells us that the next block of the data is ready. In this case we copy the data from the OS-buffer to our cache and start the next read-request.

You will find a more detailed description of the structures and functions used in the code here

algo 4: mapping

Again the program uses a cache of size 128K. In this approach everything is handled by a memory-region that is mapped to a part of size 128K of the file. Writing to the file means again filling the cache. When the cache is full then the cache is not 'written' to the file by calling some write-routine but instead we tell the OS that we need another part of the file in memory. The OS detects that the current memory-region has been changed and writes this memory-region to the file and gives us back the requested memory-region.

Reading from the file means that we request from the OS a 128K-part of the file in a memory-region from which we copy our data to our local buffer.

Using memory mapped files is a synchronous operation.

Please keep in mind that we can only map an existing file with the appropriate size to memory, e.g. extending the file by writing behind the current eof-position resulted in a crash of the program. This is the reason why in the code after opening the file for writing it will be sized to the calculated length.

You will find a more detailed description of the structures and functions used in the code here.

asynchronous

In the case of the 3rd algo the function-calls for reading or writing only transfer the order to the OS, they return immediately. This differs from the read()- and write()-calls used in the other algorithm before as these do not return until the data is read from the file or write to the file (whatever the OS does in this case; the OS can fool you, that's why we have a flush()-function).

With this program I want to compare the 3rd algo against algo 2 (and added algo 1 and 4 only for curiosity)

OS

The first and the second approach rely on the operating system for speeding up the access to the file. The third and fourth approach bypasses the OS, so the OS can only execute the requests and not help too much (depends on implementation, maybe the OS can help a bit, see the function madvise()).

code

You will find the code and some text-files with results (raw data) here.

The code is split into several files:
  • a global header-file: tst-disk-access.h
  • entry point, some validating and jumping to the algo selected: tst-disk-access.c
  • the standard-approach (algo 1): write_normal.c and read_normal.c
  • analogous to MyISAM (algo 2): write_db.c and read_db.c
  • asynchronous (algo 3): write_aio.c and read_aio.c
  • memory mapped (algo 4): write_mmap.c and read_mmap.c

correctness

Is this code correct? I hope it is, I tested it. If there is something wrong with this program please drop me a line.

For checking the write-routines I used each algo and created 4 outfiles by using 4 different names and compared these files using the cmp-command. These files were identical.

The read-functions are tested by adding up all bytes read and comparing the final values for all algos (they were identical).
You can enable this by uncommenting a line in the header-file and recompile the sources. The time-consumption of each algo was measured with this feature disabled.

parameters

The program can write to a file and read from a file. For the write-operation these parameters are valid:
  • select the algorithm: 11, 12, 13 or 14 is valid.
  • number of lines: tell the program how many times the record has to be written to the file
  • optional is the name of the file. If no name is given the name as given in the header-file will be used (you may modify this name)
For the read-operations these parameters are valid:
  • select the algorithm: 1, 2, 3 or 4 are valid
  • optional is the name of the file (as described above)
If you simply start the program a text describing the parameters will be printed.

measuring time

The time-consumption of the program was measured by using the time-command of Linux. Let me give you an example:
august@AMD4:~/workspace/tst-disk-access/Debug$ time ./tst-disk-access 11 10000000
writing to file </home/august/Temp/ABDAOK.MYD> synchronized

<./tst-disk-access>: no of lines written: 10000000

real 0m29.832s
user 0m0.636s
sys 0m23.274s
august@AMD4:~/workspace/tst-disk-access/Debug$ 
This example calls the standard-algorithm, and this routine has written 10 million records (=lines) to the file.

Of interest of the timing-values is the value for real because we want to test how fast the program finishes and returns to do other work. The other values shows where most of the work is done, but this is not of interest here. These values are of interest if you want to know how much work you can put on an machine using different algorithms.

You will find a script test.sh in the Debug-directory included with the source-code. As you can see this shell-script uses the GNU-version of time because this allowed for better logging of the results. And you will also see that this shell-script removes the datafile from the disk before calling the program in the case of a write because not doing so added up to 5 seconds to the time needed.

results

The program was tested here on the machine but this machine is equipped with 32GB of RAM which allows the OS to cache a lot of data. So I tested the program again with only 8GB of RAM in the same machine. You can find the raw data of the tests in the file time32.txt and time8.txt in the Debug-folder.

The tests started with 10 million records (=filesize of approx. 1GB) and ended with 1,280 million records (=filesize of approx, 128 GB) by simply doubling the number of records.

The results had a range from 0.38 seconds up to 40 minutes and it's hard to present them in graphical form. As I was mostly interested not in the absolute numbers but in comparing the different algos I decided to set the (absolute) result of algo 1 to 100% and present the realative values for the other algorithms. As the filesize doubles with every test you can expect a doubling of the time needed for the job.

with 32GB of RAM

Enough words, let's look at the numbers for the PC equipped with 32GB of RAM:

with 8GB of RAM

And here are the numbers for the machine equipped with 8GB of RAM:

looking at the results

Writing to a file seems to be much faster than the algo 1, the results here are always below 100%. Reading from a file comes close to algo 1 if the file has a size that does no longer fit into the memory. So with a filesize of 32GB and up the OS can't be of help anymore and the numbers are very similar. If the machine is equipped with 8GB of RAM then this effect takes place with a filesize of 8GB (and up).

Please look at these pictures:
algo 1:
algo 2:
algo 3:
algo 4:
As you can see algo 1 utilizes one CPU for full because of the millions of read()- or write()-calls.

algo 1

This algorithm is really slow. You cannot use it for a DB-application. You can use it for writing/reading if the data is less than 1GB (or so) in size. But it's easy to code.

algo 2

This looks like the best algorithm; it's the algo that is used in the MyI(SAM-engine of MariaDB/MySQL. It's a bit more work for coding and testing but the results are convincing. If the filesize exceeds the available RAM the time-values are getting worse. You could expect doubling the time if the filesize doubles but at some point in the test this value jumped up because of this effect. In the case of reading one can see this clearly at 32GB and up for the PC with 32GB of RAM (and a filesize of 8GB and up for the PC with 8GB of RAM).

algo 3

In the case of the asynchronous write/read-operation I expected a faster execution of the program because of the overlapping work but the numbers show that thi is not the case. I assume that the reason for this behavior is the bypassing of the OS which included some tricks that not applied for this algorithm but instead for algo 2.

algo 4

I added this only for my curisoity. The values do look close to algo 2.

conclusion

The best algorithm seems to be algo 2, the version used in the code of the MyISAM-engine. Very close to this is algo 4 (memory mapped), and algo 3 (asynchronous) is disappointing. Forget algo 1.

some thoughts

In real world multiple users are requesting data from or sending data to the database. Many reads and writes occur in (quasi) parallel. The tests here do not mirror this situation. So take the results with a grain of salt.

SSD

What about using a SSD?

My machine here is equipped with a SSD; a Samsung 840 PRO, which has a capacity of 120GB, and also has a harddisk, a Toshiba DT01ACA200 with 2TB capacity. I did all the tests above using the harddisk but I decided to test my program using this SSD too. You can compare the results with the harddissk-tests:
writing/reading 64GB-file using algo 2
PC with 32 GB RAM
              SSD           harddisk           disk/SSD
writing:   182,70 sec.      412,63 sec.         225,85%
reading:   126,53 sec.      499,47 sec.         394,74%

PC with 8GB RAM
              SSD           harddisk           disk/SSD
writing:   173,90 sec.      425,86 sec.         244,89%
reading:   127,16 sec.      499,47 sec.         392,79%

For speeding up the DB-operation you should buy a SSD.

But please wait. Before you rush out and spend your money for a dozen of SSDs please look carefully at your database-design and the SQL-statements. Let me tell you a story: in a project some time ago we heard complains of the end-users about an application that took about 50 minutes before the first line of the result appears. 6 months later management mad a decision: these complains must disappear. So someone in the group got the job to look at this phenomenon. The application and the database were ported to our test-environment and in this environment the application was started. Here it took about 550 seconds until the first line of the result appears. Digging into the Java-code he found the SQL-statement for this job which was extracted and tested again using Squirrel. After some playing with this statement and looking into explain-data it was clear that all information was available but the database-server choose the wrong path for executing the request. So the statement was modified to convince the server of a better path to the result. With this modified statement the server presented the first line of the result within 10 seconds. The end-users were happy.

You can't achieve a speed-up of a factor of 50 by replacing the disks with SSDs.