Friday 27 January 2017

subselect: execution (3)


It wasn't planned to write a text with this content. Initially I wanted to write how the server executes another statement but when I played with the statement I found this behaviour of the server (described some lines below) and I decided to write something about this.

In my last post (subselect: execution (2)) I played with a special SQL-statement, a modified version of a statement I've used before using a subselect, and I showed how this statement is treated in the execution-stage of the server. For this text I simply change the IN-clause in the statement into an EXISTS-clause and look what's happening then. I only did this test in MariaDB, it's up to you to do these test again in MySQL if you are interested in it.

In this text I'm using information which I've already described in older posts and I don't want to repeat this information here, so please look at these posts: caches and od.


warning

Before I describe my observation, my ideas and my fix let me issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.


environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same tables, the same data in the tables etc.. I've changed the SQL-statement and therefore my focus changes. I'm looking in the handling of the extreme-situation when no indexes are defined on the tables (silly, but the server has to handle this somehow). In the older texts in this series I stopped my description when a call into the storage-engine is made. In this text I will mainly look into the code of the storage-engine and the engine used for storing the data in the tables is MyISAM, things may look different with other engines.


the SQL-statement

For this text this is the statement I want to inspect:
MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (25 min 4.84 sec)

MariaDB [TestOpt]> 


Let me add the explain the server returns:
MariaDB [TestOpt]> MariaDB [TestOpt]> explain select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020'\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]> 

The server starts with a table-scan on the table TestSmalll and in the case of a record matching the WHERE-condition it starts another query on the table TestBig. This query can only be done by a table-scan as no index on this table exists. This explains the time needed for the execution of this statement.


some calculations

Let's check the data and see how many records are involved. So here are the number of records in the table TestSmall that match the (partial) WHERE-condition:
MariaDB [TestOpt]>MariaDB [TestOpt]> select count(*) from TestSmall where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.39 sec)

MariaDB [TestOpt]>
There are 301.036 records in the table TestSmall1) and 60 of these match the (partial) WHERE-condition of the SQL-statement above.

And here is the data for the table TestBig:
MariaDB [TestOpt]>MariaDB [TestOpt]> select count(*) from TestBig where Hersteller = '36367';
+----------+
| count(*) |
+----------+
|   461016 |
+----------+
1 row in set (12.40 sec)

MariaDB [TestOpt]> 

Let's look at the first number: by executing the statement above 60 records in TestSmall match the WHERE-condition. When such a record is found the database-server has to switch to the table TestBig and start reading this table in a table-scan, in the worst case when no matching record is found in TestBig it hast to read this table to the end.

In an older post I introduced some counter-variables to count the number of times the function ha_myisam::rnd_next() is called. This is what these counters tell me about the statement shown above:
table = <TestBig>   counter table-scan: 573354474   index-scan: 0
table = <TestSmall> counter table-scan: 301037  index-scan: 0

As there are 301.036 records in the table TestSmall this table is indeed read once (add 1 for detecting the EOF-situation) but what about TestBig? When the first record matching the condition is found in TestBig the table-scan stops so there are less than 60*10mio. read-operations. Let's verify this: I added this text to the subselect-part of the statement above: and Id = 1. As there is no value 1 in the Id-column this part is always false and the server has to read the full table. For this statement it tells at the end:
table = <TestBig> counter table-scan: 600000060 index-scan: 0
table = <TestSmall> counter table-scan: 301037 index-scan: 0

As the table TestBig contains 10 mio. records (add 1 for the EOF) you can see that this table is indeed read 60 times.

Everything seems clear now. But wait: what about the 25 minutes?

Let's omit the time reading the table TestSmall, one third of a second is a tiny fraction of the 25 minutes the whole process takes, so let's only look at the time for reading TestBig. A full table-scan on this table takes about 12 seconds as shown some lines above. As this is done 60 times I would expect the whole process to take about 12 minutes but it does take twice as long. Why this?


into the code

For inspecting this situation I modified the code in ha_myisam.cc again:
  • added a global variable: time_t timeStart;
  • in ha_myisam::rnd_init() initialize the variable: timeStart = time( NULL );
  • in ha_myisam::rnd_end() there was already a line of code for outputting the value of the counter-variables. I modified this line to output the time too:
    fprintf(stderr, "\ntable = <%s>\tcounter\ttable-scan: %d\tindex-scan: %d\ttime = %ld\n", table_share->table_name.str, counterTableScan, counterIndexScan, time(NULL)-timeStart);
For measuring the time used for a table-scan in this environment it's sufficient to work with a precision of 1 second so the function time() may help here.

And here is the result of the next run of my statement:
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 26
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 27
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 27
......
As you can see the first scan is fast, as expected, but all other scans are slow. Why this?

By stepping into the code I found this function:
int _mi_read_cache(IO_CACHE *info, uchar *buff, my_off_t pos, uint length,
     int flag)
{
  ....
  if (pos < info->pos_in_file)
  {
    .....
    if (mysql_file_pread(info->file, buff, read_length, pos, MYF(MY_NABP)))
      DBUG_RETURN(1);
    if (!(length-=read_length))
      DBUG_RETURN(0);
    ....
  }
  DBUG_RETURN(0);
} /* _mi_read_cache */

As described in my older text the data is read from the file in a block of 128K bytes. The variable info->pos_in_file points to the position in the file containing the beginning of the bytes currently in the block (=buffer). When a record is read the bytes are copied from the (global) buffer into the local buffer - simplified. When the global buffer is at the end the next block of 128K bytes will be read from the file. At the end of the table-scan the global buffer contains the last bytes of the file. When the next table-scan starts from the beginning of the file this information will not be reset so the if()-statement is true in most cases because the bytes from the beginning of the table are requested. So the code following the if()-statement will be executed: the data will be read from the disk directly by calling the read()-function of the Std-library. As there are 10 mio. records in this table it means millions of calls of this function2). And that's slow compared to copying the bytes from the buffer.

How can this be fixed?


a quick and dirty fix

By looking into the code I found a function reinit_io_cache() which looks promising So I added a call to this function to the code. Where do I add this call? The solution I will present here is quick and dirty:
  • it's quick: it can be implemented and tested in minutes
  • it's dirty: I added it in the wrong place (I'm sure about this). And it's specific for my test-case.
Nevertheless it's a test only so I did this:
in ha_myisam.cc I added these lines:
  • a global variable: bool is1stScan = true;
  • in the function ha_myisam::rnd_init() I added these lines:
     counterTableScan = counterIndexScan = 0;
     char *tableName = table_share->path.str;
     if ( 0 == strcasecmp(tableName, "./TestOpt/TestBig") )
     {
         if ( !is1stScan )
             reinit_io_cache(&file->rec_cache,READ_CACHE,0L,0,0);
         is1stScan = false;    // for the following calls
     }
    

So recompiling and starting the server and in the next test this result showed up:
MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN and Id = 1) and B.Hersteller = '00020';
Empty set (13 min 22.58 sec)

MariaDB [TestOpt]> 

That looks better. Please keep in mind that this is he statement that enforces a table-scan up to the end (I've marked that part of the statement in bold).

And here are the timing-values of this execution:
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 13
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 13
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
.......
Please compare this to the values presented above: that looks good. I modified the SQL-statement back to the original one and let it run: it took about 11 minutes and 40 seconds, as expected.


That's it for today.



changing the order of the tables

What do you think about changing the place of the tables meaning putting the table TestSmall into the subquery and the table TestBig outside? So the execution will start whith a table-scan on the TestBig and read the table TestSmall only when a match is found? Loks like this:
MariaDB [TestOpt]> select SQL_NO_CACHE  B.PZN, B.ArtikelText  from TestBig B where exists ( select 1 from TestSmall A where A.Hersteller = '00020' and A.PZN = B.PZN and Id = 1) and B.Hersteller = '36367';
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted
MariaDB [TestOpt]> 

Forget about this statement. Let's do some quick calculations: it takes about 1/3 of a second to scan the table TestSmall. As there are 460.000 records in TestBig matching the WHERE-condition and for each record found the table TestSmall is scanned so it takes about 460.000/3 seconds (approx. 42 hours), so as I've already written forget about this.




correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors on my side can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.




some notes:
1) if you look into the explain you will find this number
2) approx. 20 mio. calls as in the case of MyISAM reading a record means at least 2 calls