Thursday 28 April 2016

the real JOIN: the case without indexes


In my last texts I started with a SQL-statement and looked where and how this is handled inside the server. You can find my results of the handling of this statement in the optimizer-stage here: optimizer: no indexes. This optimizer-stage is followed by the execution-stage and you can find my results here: execution: with and without indexes. This last text describes how the server reads the records from the tables but the text omits the JOIN between the records of these two tables. And in this text I want to deliver this part.

In my last post I had two cases which I compared: I started without any indexes on the two tables involved in the SQl-statement, later I added one (usable) index on one of the tables. For this text I want to look only at the case without any index defined.


warning

Before I dive into the code let me to issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.


environment

Again I'm using my usual environment which I already described in my last post. It's the same SQL-statement, the same tables, the same data in the tables etc.. Only my focus changes.


this text

And the focus of this text is the JOIN. Where is the code that really puts the records from the two tables together? And how is this done?

So let's look into the code.


a reminder

Before I start let me show you the statement I'm inspecting including the explain:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.01 sec)

MariaDB [TestOpt]> 
Please look at the last line of the explain: this line describes what I'm looking at in this text.


execution

Let me describe the execution of this statement in some form of pseudocode:
In executing this explain the first step is reading the table TestSmall in a table-scan. Each record fetched is inspected, which means applying the (partial) WHERE-clause (aka comparing the value of the column Hersteller against the string-constant '00020'). If this record fulfils the condition it is stored in an internal buffer and the scan continues. In the end of this step the buffer contains the relevant records from TestSmall for the next step.
Next the table TestBig is read by a table-scan. On each record fetched the (partial) WHERE-clause is applied (aka comparing the value of the column Hersteller against the string-constant '36367'). If this is a valid record the JOIN is done with the records already in the buffer (the records from TestSmall). After this operation the table-scan continues until the end.


1st step: reading TestSmall and putting records into the buffer

So the steps of interest for this part of the text are:
  • where is the information put into the cache ?
  • which information is put into the cache ?
  • where is the memory for the cache allocated ?
So let's look what is done with the records fetched from TestSmall.


put record in cache

If a record is found which fulfils the (partial) WHERE-clause (aka Hersteller = '00020') it is put into the cache and later used when the table TestBig is read.
Here is the function hierarchy for putting the data into the cache:
do_select()
    sub_select()
        evaluate_join_record()
            sub_select_cache()
                JOIN_CACHE::put_record()
The records are fetched in sub_select() and inspected in evaluate_join_record(). If a record fulfils the WHERE-clause the function sub_select_cache() is called which calls put_record() to put the information from this record into the cache.

Let me go back a bit: in handling any query each table involved is internally represented by an object of type JOIN_TAB, so for this query two of these exist: one for table TestSmall and one for table TestBig.
Back to the hierarchy: in the function evaluate_join_record() the record read is inspected. If it fulfils the WEHERE-clause the next level is called by this statement: rc= (*join_tab->next_select)(join, join_tab+1, 0);. So next_select() calls the function sub_select_cache() with the JOIN_TAB representing the table TestBig as a parameter (please look at the +1 in the parameter-list).

And this is the way the code accesses the cache involved:
typedef struct st_join_table {
.....
  JOIN_CACHE *cache;
.....
} JOIN_TAB;

class JOIN_CACHE :public Sql_alloc
{
....
  /* Pointer to the beginning of the join buffer */
  uchar *buff;
....
};

And the data from TestSmall is put in a cache that is associated to the JOIN_TAB representing the table TestBig. And buff is the pointer to the memory-area used for this.


the information in the cache

So put_record() writes the data into the memory-area pointed to by buff. It doesn't do this directly but calls the function write_record_data() to do this. This data is written into the buffer1):
  • a flag: 0x80
  • length of next entry: 0x0006
  • the contents of the column PZN: '178324'
  • length of the next entry: 0x0005
  • the contents of the column Hersteller: '00020'
and these 16 bytes are the data put into the cache, extracted from the first record from TestSmall that matches the condition. As you can see from using length-information this information is not of constant length.

Here is the contents of the first entries in the buffer (all entries are in hex, only the first 3 of 60 entries are shown):

flag length PZN length Hersteller
0x80 0x06 0x00 0x31 0x37 0x38 0x33 0x32 0x34 0x05 0x00 0x30 0x30 0x30 0x32 0x30
0x80 0x07 0x00 0x33 0x37 0x31 0x37 0x39 0x36 0x38 0x05 0x00 0x30 0x30 0x30 0x32 0x30
0x80 0x07 0x00 0x33 0x37 0x31 0x37 0x39 0x37 0x34 0x05 0x00 0x30 0x30 0x30 0x32 0x30


allocating memory for the cache

Now some data is put into a buffer but where does this memory-area come from? Somewhere this area must be allocated, and this is the hierarchy:
mysql_select()
    JOIN::optimize()
        JOIN::optimize_inner()
            make_join_readinfo()
                check_join_cache_usage_for_tables()
                    check_join_cache_usage()
                        JOIN_CACHE_BNL::init()
                            JOIN_CACHE::init()
                                JOIN_CACHE::alloc_buffer()
                                    JOIN_CACHE::get_max_join_buffer_size()

The buffer is allocated in the optimizer-stage. The function alloc_buffer() reserves a memory-area of size 128K of bytes for this and stores the pointer to this cache in the var. buff as shown some lines above. The value 128K is returned from get_max_join_buffer_size() which gets this value from the expression size_t limit_sz= join->thd->variables.join_buff_size;.

This finishes the first step: we have the buffer, all relevant data is put into it and TestSmall is read until the end. Now we switch over and start reading the records from TestBig.


reading TestBig and JOINing the records with the cache

Next the table TestBig is read in a table-scan and the WHERE-condition is applied to each record fetched. If the record read fulfils the condition (aka Hersteller = '36367') a record with the same value in the column PZN from TestSmall is needed for a successful JOIN. As the records from TestSmall are already in the cache this cache is searched. This is done in this code inside the function JOIN_CACHE::join_matching_records()2):
  while (!(error= join_tab_scan->next()))                               outer loop
  {
    ......
    /* /* Read each possible candidate from the buffer and look for matches *//* 
    while ((rec_ptr= get_next_candidate_for_match()))                   inner loop
    { 
      /* 
        If only the first match is needed, and, it has been already found for
        the next record read from the join buffer, then the record is skipped.
        Also those records that must be null complemented are not considered
        as candidates for matches.
      */
      .........
    }
  }

OK, this doesn't explain too much so let's look into the details. Everything starts with reading the records from the table TestBig. For each record fetched the (partial) WHERE-condition is applied and if this condition is fulfilled we have a candidate for a JOIN with a record from TestSmall. This has been done within the function next(), in the box above I marked this with the words outer loop.

Of this record from TestBig the value of PZN is extracted and the list of records from TestSmall in the cache is searched for a match. The steps are:
  1. init the search through the cache
  2. get one entry from the cache
  3. compare the value of PZN with the value of PZN of the record from TestBig
  4. if they match: we found a record from TestSmall and a record from TestBig that do JOIN according to our SQL-statement
  5. continue with step 2 until all entries in the cache are checked
I called the steps numbered 2 to 5 the inner loop.

Here is the function hierarchy:
JOIN_CACHE::join_matching_records()
    outer loop:
    JOIN_TAB_SCAN::next()
        JOIN_CACHE_BNL::prepare_look_for_matches()
            JOIN_CACHE::reset()
    inner loop:
    JOIN_CACHE_BNL::get_next_candidate_for_match() 
    JOIN_CACHE_BNL::read_next_candidate_for_match(
        JOIN_CACHE::get_record()
            JOIN_CACHE::read_all_record_fields()
                JOIN_CACHE::read_flag_fields()
                JOIN_CACHE::read_record_field()
    JOIN_CACHE::generate_full_extensions()
        JOIN_CACHE::check_match()
            skip_record()
                end_send_group()        in case of a match, called via: join_tab->next_select)(join, join_tab+1, 0);
    end inner loop
    end outer loop

In the function skip_record() the ON-clause of my SQL-statement is executed. This is the hierarchy for the comparison of the PZN-values:
Item_func_eq::val_int()
    Arg_comparator::compare() 
        Arg_comparator::compare_string()
            Item_field::val_str()
            Item_field::val_str()
            sortcmp()

And that's the whole process for JOINing the records.

Additionally I want to show you some numbers I found by working with this statement.


some numbers

I've added some variables to count the number of iterations in the inner loop and the outer loop described. These are the results in my test:
counterOuterLoop:    461,016
counterInnerLoop: 27,660,960

Let's verify these numbers with some SQL-statements.
The number of records from TestSmall to be checked in the JOIN:
cMariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A where A.Hersteller = '00020';
+--------------+
| count(A.PZN) |
+--------------+
|           60 |
+--------------+
1 row in set (0.44 sec)

MariaDB [TestOpt]> 
The number of records from TestBig to be checked in the JOIN:
cMariaDB [TestOpt]> select SQL_NO_CACHE  count(B.PZN) from TestBig B where B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|       461016 |
+--------------+
1 row in set (12.78 sec)

MariaDB [TestOpt]> 

The first number verifies the number given above in the description of the first step. You will find the second number as the value of the variable counterOuterLoop. And the value of the variable counterInnerLoop is simply the product of these 2 number, meaning the server fetches 10 mio. records from TestBig, found 461,016 matching records and iterates through the inner loop about 27.6 million times in the handling of my SQL-statement.

And that ends my journey for today.


correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.




some notes:
1) please do not forget that this is the data found in my environment.
2) for the details of reading records from TestBig, the hierarchy of the code for doing this and some specialities omitted here please look at my last post.