Thursday 28 July 2016

the real JOIN: a different look at the case without any index


Please allow me to continue to play with my usual statement in my usual environment. I like to modify the statement a little bit and show you how much changed in executing the modified statement.

In this text I will often refer to something "old" like the "old statement" or the "old text". This always means this text (or something you can find in it): the real JOIN: the case without indexes. I will not set any link to this text in the text here, please keep this link in your mind.


warning

Before I dive into the code let me to issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.


environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same SQL-statement, the same tables, the same data in the tables etc.. Only my focus changes.


the old situation

In my old text I looked at this statement:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (34.99 sec)

MariaDB [TestOpt]> 

The explain tells us that the execution starts with reading the table TestSmall. After this is done the table TestBig is read. Diving into the code showed that by reading the table TestSmall and applying the WHERE-clause on a record just read, a record matching the condition is put into a buffer. When reading this table is finished the server starts reading the records from the table TestBig, applying the WHERE-clause and when a match is found it does the JOIN by looking through the buffer to find a record (from TestSmall) with the same value of PZN.


the new situation

For this text here I wan to change the order of the execution, first TestBig and then TestSmall, and see if and what's changing in the execution of this statement. Here is the statement and the explain:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (1 min 6.82 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

As you can see there is a difference: the new statement took 66 seconds vs. 23 seconds of the old statement: it took nearly as 3times as long. Why? The purpose of this text is to show you the reason of this.

Please look at the statement more closely: I've used the keyword straight_join instead of join and I've changed the order of the tables in the statement. Changing the order of the tables without using the keyword straight_join the optimizer would have chosen the same plan for the execution as before and I could not present this text to you.


some numbers

I want to show you some numbers from the data.

First of all TestSmall contains 301,036 records, TestBig contains 10 mio. records, which you can verify if you look into the explain-lines above.

In TestSmall and TestBig only a small fraction of the records fulfill the (partial) WHERE-condition:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A where A.Hersteller = '00020';
+--------------+
| count(A.PZN) |
+--------------+
|           60 |
+--------------+
1 row in set (0.44 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE  count(B.PZN) from TestBig B where B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|       461016 |
+--------------+
1 row in set (12.78 sec)

MariaDB [TestOpt]> 

These records have to be matched for joining whether the process starts with TestSmall (old case) or with TestBig (new case). So why does this take 23 seconds in the old case and 66 seconds in the new case?


general remarks

I don't want to show function hierarchies in this text, they are almost the same as in the old text. The action takes place in the function sub_select() and below, in the stage of the execution but it is above of any storage engine. I've used MyISAM as storage-engine in my tests but everything will happen identically in other storage-engines (but I didn't check this).


counting

In my old text I added some counters in the MyISAM-engine which lead to this output:
table = <TestSmall> counter table-scan:    301,037    index-scan: 0
table = <TestBig>   counter table-scan: 10,000,001    index-scan: 0

For the old case these numbers tell us that no index was used and every record was read once so the JOINing was done internally.

Let's look at the same numbers in the new case1):
table = <TestBig>   counter table-scan: 10,000,001    index-scan: 0
table = <TestSmall> counter table-scan: 17,159,109    index-scan: 0

The value for TestBig is identical (= no. of records in this table plus one) telling us that each record in TestBig is read once. In the new case the number for TestSmall looks confusing. Do not hesitate to take your calculator: 17,159,109 / 301,037 = 57. Does this mean that the server has read the table TestSmall 57times? At least this could explain the time-difference. But is this true?


the algorithm

Let's look at the algorithm in the old case: a record from TestSmall that matches the condition is put into a buffer, there were 60 records in the buffer after the table TestSmall is read. Now the server switches over to the table TestBig and read record after record from this table. If a record fulfills the WHERE-clause (aka Hersteller = '36367') the buffer is scanned for a match. This is realised as a double loop: an outer-loop (reading records from TestBig) and an inner loop (walking through the entries in the buffer). Counting the number of iterations in the inner loop and the outer loop leads to these numbers:
counterOuterLoop:    461,016
counterInnerLoop: 27,660,960

The number of iterations in the outer-loop is the number of matching records from TestBig, and 27,660,960 is the number of iterations in the inner-loop: this loop is called 461,016 times and iterates over the 60 entries in the buffer.


the new algorithm

The algorithm in the new case is identical to the old case, but there must be some differences that can explain the increase in the time needed. So let's start:
the table TestBig is read record after record and on each record the WHERE-condition is applied. When a match is found the needed information is put into the buffer, and the server continues reading the next record from TestBig, same as in the old case. But the buffer can hold 128K of data as this is the size that was allocated for it. After 8,205 matching records the buffer is full, it does not have enough space left for another entry. So the server starts the JOINing process: it reads the records from the table TestSmall, applies the WHERE-condition and if a match is found searches the entries in the buffer for a match in the column PZN.
When all records from TestSmall are read (and compared) the server marks the buffer as empty and continues reading the records from TestBig and putting matching records into the buffer. When this buffer is full again the server starts reading the records from TestSmall and searches through the buffer for matches. And so the server continues until all records are read from TestBig.

These are the numbers of iterations:
counterOuterLoop:      3,420
counterInnerLoop: 27,660,960

As you can see the number of iterations in the inner-loop are identical as the comparisons have to be made, independent of the order of the tables. But the number of iterations in the outer loop is much smaller.

Some lines above I showed you that there are 461,016 records in TestBig which fulfill the condition B.Hersteller = '36367'. Again please take your calculator and do this operation: 461,016 / 8,205 which results in a number of approx. 56.2. So the server fills the buffer 56times and has some records left for an additional iteration.

As you can see the server indeed reads the Table TestSmall 57times in a full table-scan. This explains the time needed for this statement.

And that ends my journey for today.


some experiments

Well, it doesn't, I'm still curious.

the buffer

As I've written above the server uses a buffer of size 128K. What about doubling the size of the buffer?

So let's do it. The buffer is allocated in this part of the code:
int JOIN_CACHE::alloc_buffer()
{
  ....
  buff_size= get_max_join_buffer_size(optimize_buff_size);
  ....
    if ((buff= (uchar*) my_malloc(buff_size, MYF(MY_THREAD_SPECIFIC)))) 
      break;
  ....
}

When we simply double the contents of the variable buff_size the buffer will have a size of 256K. Done, and here is the result:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (43.97 sec)

MariaDB [TestOpt]> 

Oh, 30% faster. And my counters tell me:
table = <TestBig>   counter table-scan: 10,000,001    index-scan: 0
table = <TestSmall> counter table-scan:  8,730,073    index-scan: 0

which means the table TestSmall is now read 29times.

So let's take back these modifications and continue with a buffer of size 128K.

the effect of the inner loop

The algorithm used in the code for the JOIN is a double loop. What if we eliminate the inner loop and see how much time is spent in this part of the code. Well, we will get a wrong number, but for this test I accepted this.
Here is the modified part of the code:
uchar *JOIN_CACHE_BNL::get_next_candidate_for_match()
{
//  if (!rem_records)
    return 0;
//  rem_records--;
//  return pos+base_prefix_length;
}

Compare this code with the original code (without the comment-characters) and you will see that the modified version simply returns 0. That says that there are no entries in the buffer and the inner loop will not be executed at all.
Here are the results:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|            0 |
+--------------+
1 row in set (10.37 sec)

ariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|            0 |
+--------------+
1 row in set (53.16 sec)

In the old case 50% of the time is spent in the inner loop, in the new case 20% of the time is spent in the inner loop. In the new case a lot of time is spent reading the table TestSmall multiple times, so only a smaller fraction will be gained by omitting the inner loop.


suggestion

First of all I want to write that I did not find an error in the code, it works and it does it's job. But I think there is room for improvement, there should be a better (=faster) algorithm. Unfortunately I cannot give you one. You say that when JOINing tables one usually creates an index over the column(s) involved. Agreed, but please keep in mind that sometimes the optimizer needs a temporary table for the execution of your statement and then the effect described here can happen.


correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors on my side can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.




some notes:
1) I had to modify my code a bit: the counters were initialized in ha_myisam::rnd_init(). This function was called multiple times whereas ha_myisam::rnd_end() was only called once so the counters were reset to 0 multiple times before the contents was written to the file. So I moved the initialization of the counters to the header-file. It's still not the correct place but for this test it was sufficient.