Thursday 28 April 2016

the real JOIN: the case without indexes


In my last texts I started with a SQL-statement and looked where and how this is handled inside the server. You can find my results of the handling of this statement in the optimizer-stage here: optimizer: no indexes. This optimizer-stage is followed by the execution-stage and you can find my results here: execution: with and without indexes. This last text describes how the server reads the records from the tables but the text omits the JOIN between the records of these two tables. And in this text I want to deliver this part.

In my last post I had two cases which I compared: I started without any indexes on the two tables involved in the SQl-statement, later I added one (usable) index on one of the tables. For this text I want to look only at the case without any index defined.


warning

Before I dive into the code let me to issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.


environment

Again I'm using my usual environment which I already described in my last post. It's the same SQL-statement, the same tables, the same data in the tables etc.. Only my focus changes.


this text

And the focus of this text is the JOIN. Where is the code that really puts the records from the two tables together? And how is this done?

So let's look into the code.


a reminder

Before I start let me show you the statement I'm inspecting including the explain:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.01 sec)

MariaDB [TestOpt]> 
Please look at the last line of the explain: this line describes what I'm looking at in this text.


execution

Let me describe the execution of this statement in some form of pseudocode:
In executing this explain the first step is reading the table TestSmall in a table-scan. Each record fetched is inspected, which means applying the (partial) WHERE-clause (aka comparing the value of the column Hersteller against the string-constant '00020'). If this record fulfils the condition it is stored in an internal buffer and the scan continues. In the end of this step the buffer contains the relevant records from TestSmall for the next step.
Next the table TestBig is read by a table-scan. On each record fetched the (partial) WHERE-clause is applied (aka comparing the value of the column Hersteller against the string-constant '36367'). If this is a valid record the JOIN is done with the records already in the buffer (the records from TestSmall). After this operation the table-scan continues until the end.


1st step: reading TestSmall and putting records into the buffer

So the steps of interest for this part of the text are:
  • where is the information put into the cache ?
  • which information is put into the cache ?
  • where is the memory for the cache allocated ?
So let's look what is done with the records fetched from TestSmall.


put record in cache

If a record is found which fulfils the (partial) WHERE-clause (aka Hersteller = '00020') it is put into the cache and later used when the table TestBig is read.
Here is the function hierarchy for putting the data into the cache:
do_select()
    sub_select()
        evaluate_join_record()
            sub_select_cache()
                JOIN_CACHE::put_record()
The records are fetched in sub_select() and inspected in evaluate_join_record(). If a record fulfils the WHERE-clause the function sub_select_cache() is called which calls put_record() to put the information from this record into the cache.

Let me go back a bit: in handling any query each table involved is internally represented by an object of type JOIN_TAB, so for this query two of these exist: one for table TestSmall and one for table TestBig.
Back to the hierarchy: in the function evaluate_join_record() the record read is inspected. If it fulfils the WEHERE-clause the next level is called by this statement: rc= (*join_tab->next_select)(join, join_tab+1, 0);. So next_select() calls the function sub_select_cache() with the JOIN_TAB representing the table TestBig as a parameter (please look at the +1 in the parameter-list).

And this is the way the code accesses the cache involved:
typedef struct st_join_table {
.....
  JOIN_CACHE *cache;
.....
} JOIN_TAB;

class JOIN_CACHE :public Sql_alloc
{
....
  /* Pointer to the beginning of the join buffer */
  uchar *buff;
....
};

And the data from TestSmall is put in a cache that is associated to the JOIN_TAB representing the table TestBig. And buff is the pointer to the memory-area used for this.


the information in the cache

So put_record() writes the data into the memory-area pointed to by buff. It doesn't do this directly but calls the function write_record_data() to do this. This data is written into the buffer1):
  • a flag: 0x80
  • length of next entry: 0x0006
  • the contents of the column PZN: '178324'
  • length of the next entry: 0x0005
  • the contents of the column Hersteller: '00020'
and these 16 bytes are the data put into the cache, extracted from the first record from TestSmall that matches the condition. As you can see from using length-information this information is not of constant length.

Here is the contents of the first entries in the buffer (all entries are in hex, only the first 3 of 60 entries are shown):

flag length PZN length Hersteller
0x80 0x06 0x00 0x31 0x37 0x38 0x33 0x32 0x34 0x05 0x00 0x30 0x30 0x30 0x32 0x30
0x80 0x07 0x00 0x33 0x37 0x31 0x37 0x39 0x36 0x38 0x05 0x00 0x30 0x30 0x30 0x32 0x30
0x80 0x07 0x00 0x33 0x37 0x31 0x37 0x39 0x37 0x34 0x05 0x00 0x30 0x30 0x30 0x32 0x30


allocating memory for the cache

Now some data is put into a buffer but where does this memory-area come from? Somewhere this area must be allocated, and this is the hierarchy:
mysql_select()
    JOIN::optimize()
        JOIN::optimize_inner()
            make_join_readinfo()
                check_join_cache_usage_for_tables()
                    check_join_cache_usage()
                        JOIN_CACHE_BNL::init()
                            JOIN_CACHE::init()
                                JOIN_CACHE::alloc_buffer()
                                    JOIN_CACHE::get_max_join_buffer_size()

The buffer is allocated in the optimizer-stage. The function alloc_buffer() reserves a memory-area of size 128K of bytes for this and stores the pointer to this cache in the var. buff as shown some lines above. The value 128K is returned from get_max_join_buffer_size() which gets this value from the expression size_t limit_sz= join->thd->variables.join_buff_size;.

This finishes the first step: we have the buffer, all relevant data is put into it and TestSmall is read until the end. Now we switch over and start reading the records from TestBig.


reading TestBig and JOINing the records with the cache

Next the table TestBig is read in a table-scan and the WHERE-condition is applied to each record fetched. If the record read fulfils the condition (aka Hersteller = '36367') a record with the same value in the column PZN from TestSmall is needed for a successful JOIN. As the records from TestSmall are already in the cache this cache is searched. This is done in this code inside the function JOIN_CACHE::join_matching_records()2):
  while (!(error= join_tab_scan->next()))                               outer loop
  {
    ......
    /* /* Read each possible candidate from the buffer and look for matches *//* 
    while ((rec_ptr= get_next_candidate_for_match()))                   inner loop
    { 
      /* 
        If only the first match is needed, and, it has been already found for
        the next record read from the join buffer, then the record is skipped.
        Also those records that must be null complemented are not considered
        as candidates for matches.
      */
      .........
    }
  }

OK, this doesn't explain too much so let's look into the details. Everything starts with reading the records from the table TestBig. For each record fetched the (partial) WHERE-condition is applied and if this condition is fulfilled we have a candidate for a JOIN with a record from TestSmall. This has been done within the function next(), in the box above I marked this with the words outer loop.

Of this record from TestBig the value of PZN is extracted and the list of records from TestSmall in the cache is searched for a match. The steps are:
  1. init the search through the cache
  2. get one entry from the cache
  3. compare the value of PZN with the value of PZN of the record from TestBig
  4. if they match: we found a record from TestSmall and a record from TestBig that do JOIN according to our SQL-statement
  5. continue with step 2 until all entries in the cache are checked
I called the steps numbered 2 to 5 the inner loop.

Here is the function hierarchy:
JOIN_CACHE::join_matching_records()
    outer loop:
    JOIN_TAB_SCAN::next()
        JOIN_CACHE_BNL::prepare_look_for_matches()
            JOIN_CACHE::reset()
    inner loop:
    JOIN_CACHE_BNL::get_next_candidate_for_match() 
    JOIN_CACHE_BNL::read_next_candidate_for_match(
        JOIN_CACHE::get_record()
            JOIN_CACHE::read_all_record_fields()
                JOIN_CACHE::read_flag_fields()
                JOIN_CACHE::read_record_field()
    JOIN_CACHE::generate_full_extensions()
        JOIN_CACHE::check_match()
            skip_record()
                end_send_group()        in case of a match, called via: join_tab->next_select)(join, join_tab+1, 0);
    end inner loop
    end outer loop

In the function skip_record() the ON-clause of my SQL-statement is executed. This is the hierarchy for the comparison of the PZN-values:
Item_func_eq::val_int()
    Arg_comparator::compare() 
        Arg_comparator::compare_string()
            Item_field::val_str()
            Item_field::val_str()
            sortcmp()

And that's the whole process for JOINing the records.

Additionally I want to show you some numbers I found by working with this statement.


some numbers

I've added some variables to count the number of iterations in the inner loop and the outer loop described. These are the results in my test:
counterOuterLoop:    461,016
counterInnerLoop: 27,660,960

Let's verify these numbers with some SQL-statements.
The number of records from TestSmall to be checked in the JOIN:
cMariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A where A.Hersteller = '00020';
+--------------+
| count(A.PZN) |
+--------------+
|           60 |
+--------------+
1 row in set (0.44 sec)

MariaDB [TestOpt]> 
The number of records from TestBig to be checked in the JOIN:
cMariaDB [TestOpt]> select SQL_NO_CACHE  count(B.PZN) from TestBig B where B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|       461016 |
+--------------+
1 row in set (12.78 sec)

MariaDB [TestOpt]> 

The first number verifies the number given above in the description of the first step. You will find the second number as the value of the variable counterOuterLoop. And the value of the variable counterInnerLoop is simply the product of these 2 number, meaning the server fetches 10 mio. records from TestBig, found 461,016 matching records and iterates through the inner loop about 27.6 million times in the handling of my SQL-statement.

And that ends my journey for today.


correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.




some notes:
1) please do not forget that this is the data found in my environment.
2) for the details of reading records from TestBig, the hierarchy of the code for doing this and some specialities omitted here please look at my last post.

Tuesday 5 April 2016

execution: with and without indexes


in my last post I showed how the case without any indexes is handled by the optimizer. For this post I want to show how the plan found by the optimizer is executed. Additionally I want to compare the execution of this case with the execution of a case with one index added on one table.

So let me look into the code for the execution of the case without any indexes and start my description when the optimizer-stage is already done. And my description will stop when the storage-engine is entered, so the text describes the code in the database-server. For my tests I've used MariaDB, for MySQL things may look a bit different.

environment

Again I'm using my usual environment which I already described here: JOIN.

In the first step I want to inspect the case without any index but instead of dropping all indexes on the tables involved for my query I added one index to the table TestBig, this simplified the handling of my tests. For testing the case without any index I told the optimizer in my SQL-statement in the form of a hint to not use this existing index. In the end of this text I will explain why I did this.

So here is the environment:
MariaDB [TestOpt]> show indexes from TestSmall;
Empty set (0.00 sec)

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> 
So you can see there is no index on the table TestSmall and one index on the table TestBig, on the column PZN.

And here is my standard-statement:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.01 sec)

MariaDB [TestOpt]> 
I've included the explain and marked the hint IGNORE INDEX(PZN)1) in the SQL-statement and in the explain I also marked the access-type in bold in this box.

expected execution

I've already described the behavior of the server in the case with an index: JOIN (2). In the case described the server scans through the table TestSmall, reading and inspecting record after record, and when a matching record is found to switch over to the table TestBig and search for corresponding records in this table taking the WHERE- and ON-clause into account; after no more records can be found in TestBig the server returns to TestSmall and continues with the table-scan.
And for the case without any index I expected the server to behave in the same way. If no index is available there is no other way for searching the table TestBig than by a table-scan o I expected the execution of the statement to be very slow because for every record found in TestSmall a full table-scan of TestBig has to be made. But if you look at the box above you will see that this statement took 23 seconds for the execution and that's not so bad.

In an older post I introduced some counter-variables and by using these I can see this result in the file /var/log/mysql/error.log:
table = <TestSmall> counter table-scan: 301037      index-scan: 0
table = <TestBig>   counter table-scan: 10000001    index-scan: 0
When I look at these numbers I can see that table TestSmall is read once and also table TestBig is read only once2). Good. So how is this handled in the code?

execution

Before I continue with this text I want to issue a warning: the functions and hierarchies shown here are valid for the statement and environment given above. If you do your own tests this may look a bit different.

the case without index

Now I want to start with the execution of the statement without any indexes on the tables. Execution starts with the table TestSmall, so here is the function hierarchy for reading this table:
mysql_select()
    JOIN::exec()
        JOIN::exec_inner()
            do_select()
                sub_select()
                    // read one record from TestSmall
                    evaluate_join_record()

In the next lines of this text I want to inspect the function sub_select() and see what's happening there.
The code in sub_select() distinguishes between a first read and all following reads. For the first record read everything needed is initialized, the first record is read from TestSmall and this record is inspected. This is the code that reads the first record from the table TestSmall:
  if (rc != NESTED_LOOP_NO_MORE_ROWS)
  {
    error= (*join_tab->read_first_record)(join_tab); 
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

Let me omit the function evaluate_join_record() for some seconds. For read_first_record() the function-hierarchy looks like this:
(*join_tab->read_first_record)  -->  join_init_read_record()      
    init_read_record()
        handler::ha_rnd_init_with_error()
            handler::ha_rnd_init()
                ha_myisam::rnd_init()
    rr_sequential()
        handler::ha_rnd_next()
            ha_myisam::rnd_next()
I stopped the presentation of the hierarchy at the engine-level because I didn't want to dive deeper into the code for this post.

After the first record is read and inspected the code in sub_select() continues with a while-loop scanning the table TestSmall until EOF (or NESTED_LOOP_NO_MORE_ROWS). Here is the code:
  while (rc == NESTED_LOOP_OK && join->return_tab >= join_tab)
  {
    ....
    error= info->read_record(info);
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

Here is how the next record is fetched:
    info->read_record()  -->  rr_sequential()
        handler::ha_rnd_next()
            ha_myisam::rnd_next()

Now I will come back to the function evaluate_join_record(). Here is the code executed (simplified):
static enum_nested_loop_state
evaluate_join_record(JOIN *join, JOIN_TAB *join_tab,
                     int error)
{
  ....
  if (select_cond)
  {
    select_cond_result= MY_TEST(select_cond->val_int());
    ....
  }

  if (!select_cond || select_cond_result)
  {
   ....
    if (found)
    {
      enum enum_nested_loop_state rc;
      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0); 
    }
    ......
  }
  DBUG_RETURN(NESTED_LOOP_OK);
}

The record fetched is inspected in the macro MY_TEST3). If this record fulfills the condition the line with (*join_tab->next_select)() calls another function which in this case (the case without any index) is the function sub_select_cache()4). In this function the record is stored in an internal structure, a cache. It then returns to sub_select() and the cycle continues with the next fetch from the table TestSmall until all records from this table are read.


now with index

Please allow me to stop this description (I do know that I'm just in the middle of the execution) and switch to the case with an index on the table TestBig. First I want to show the statement:
MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.40 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

The explain tells me that the execution of this statement starts with a table-scan on TestSmall and the table TestBig is read via the index PZN. To verify this I looked at my counter-variables, these showed this result:
table = <TestSmall> counter table-scan: 301037  index-scan: 0
table = <TestBig>   counter table-scan: 0       index-scan: 3115)
and you see there is no table-scan on the table TestBig.

The execution starts with a table-scan on TestSmall. This is handled similar to the case without index except for the handling in the function evaluate_join_record(). Now the hierarchy looks like this:
sub_select()
    evaluate_join_record()
        sub_select()                via:  rc= (*join_tab->next_select)(join, join_tab+1, 0); 
            // read a record from TestBig (=join_tab+1)
            evaluate_join_record()
                end_send_group()

The fetching of the record from TestBig is handled in 2 cases (similar to the cases described above):
fetching the first record:
sub_select()
    evaluate_join_record()
        sub_select()
            (*join_tab->read_first_record)() ->  join_read_always_key()
                handler::ha_index_init()
                    ha_myisam::index_init()
                handler::ha_index_read_map()
                    ha_myisam::index_read_map()
            evaluate_join_record()
                end_send_group()

all other records are fetched this way:
sub_select()
    evaluate_join_record()
        sub_select()
            info->read_record() -> join_read_next_same()
                handler::ha_index_next_same()
                    ha_myisam::index_next_same()
            evaluate_join_record()
                end_send_group()

As you can see the structure of the code is similar but the functions called from evaluate_join_record() differs. In this case it extracts the value of the column PZN from the record from TestSmall and reads record after record from TestBig using the index PZN with this value. When no more records are found with this value in the column PZN it returns to TestSmall and continues the table-scan.6)

in general

In evaluate_join_record() the next level is called by the statement (*join_tab->next_select)(join, join_tab+1, 0). The variable next_select is defined here:
sql_select.h:
typedef struct st_join_table {
  ...
  READ_RECORD::Setup_func read_first_record;
  Next_select_func next_select;
  ...
} JOIN_TAB;
and the type Next_selec_func is defined this way:
typedef enum_nested_loop_state
(*Next_select_func)(JOIN *, struct st_join_table *, bool);
The return-type is an enumeration defined as:
enum enum_nested_loop_state
{
  NESTED_LOOP_KILLED= -2, NESTED_LOOP_ERROR= -1,
  NESTED_LOOP_OK= 0, NESTED_LOOP_NO_MORE_ROWS= 1,
  NESTED_LOOP_QUERY_LIMIT= 3, NESTED_LOOP_CURSOR_LIMIT= 4
};

join_tab+1

You certainly noticed the expression join_tab+1 occurring multiple times as a parameter in the function-calls in the code-boxes. The reason of this is: in the beginning of the optimizer-stage an object of type JOIN is created. This structure contains the variable join_tab which points to a list of entries of type JOIN_TAB. So the statement join_tab+1 points to the next entry in this list.

Let me show you some lines from the comment above the function sub_select():
  @param join      pointer to the structure providing all context info for
                   the query
  @param join_tab  the first next table of the execution plan to be retrieved
  @param end_records  true when we need to perform final steps of retrival
By calling this function the first parameter contains a lot of information about the query, the second parameter points to the JOIN_TAB to handle with this call and the last parameter is a flag.

If you look at the code-hierarchy above you will see that the function sub_select() is called recursively, so by the expression join_tab+1 the entries in this list are handled one after the other.

with index: final steps

Before I go back to the case without index (=my starting point) I want to show the last steps done for the case with index, it's not finished yet.

The central routines as described above were located in the function sub_select(). This function is called from do_select() in this form:
    if (join->outer_ref_cond && !join->outer_ref_cond->val_int())
      error= NESTED_LOOP_NO_MORE_ROWS;
    else
      error= sub_select(join,join_tab,0);    
    if ((error == NESTED_LOOP_OK || error == NESTED_LOOP_NO_MORE_ROWS) &&
        join->thd->killed != ABORT_QUERY)
      error= sub_select(join,join_tab,1);

In this code you can see that the function sub_select() is called twice. When called the first time all the work is done as described some lines above, it's called again with identical parameters except for the last one (above marked in red), which stands for bool end_of_records.

Now with a value of 1 for end_of_records the code executed in the function called looks like this:
in sub_select():
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records);
    DBUG_RETURN(nls);
  }
For this second call of sub_select() the hierarchy looks like this:
do_select()
    sub_select()                 with table TestSmall and a value of 1 for the parameter end_of_records
        sub_select()             with table TestBig (by join_tab+1)
            end_send_group()     with table NULL

This finishes the handling of this query, at least I will stop here with my description.


back to the case without index

Now I want to go back to my starting point which I've left in the middle of the execution.
As described above the table TestSmall is read sequentially, record after record is fetched and the record fetched is inspected in evaluate_join_record(). If it is accepted it is given to the function sub_select_cache() which stores this record in a cache. When the table TestSmall is read to the end the code returns from sub_select() to the calling function do_select(). And here my description stopped, up to this point only the table TestSmall is inspected, the table TestBig is not touched. So let me now describe how the table TestBig is handled.

Some lines above I showed some code in the function do_select(). In these lines in do_select() the function sub_select() is called twice and for the case without an index my description until now shows what's happening only in the first call of this function. For the second call of sub_select() the hierarchy looks like this:
do_select()
    sub_select()     called with table TestSmall, but now with a value of 1 for the parameter end_of_records
        sub_select_cache()     via (*join_tab->next_select), with table TestBig
            JOIN_CACHE::join_records()
                JOIN_CACHE::join_matching_records()
                    JOIN_TAB_SCAN::open()
                        join_init_read_record()
                            init_read_record()
                                handler::ha_rnd_init_with_error()
                                    handler::ha_rnd_init()
                                        ha_myisam::rnd_init()
                            rr_sequential()
                                handler::ha_rnd_next()
                                    ha_myisam::rnd_next()    1st record of table TestBig read
With the last function-call shown in this box the first record is read from TestBig.

Let's go back to the top of this hierarchy. The function do_select() calls sub_select() (the second call) with a value of 1 for the last parameter. In sub_select() this is the code for handling this:
in sub_select():
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records);
    DBUG_RETURN(nls);
  }

As end_of_records has a value of 1 the code following the if()-statement is executed and next_select() calls sub_select_cache() with the next entry from the JOIN_TAB-list (=TestBig). In this function you will find some code similar to the code in sub_select():
in sub_select_cache():
  if (end_of_records)
  {
    rc= cache->join_records(FALSE);
    if (rc == NESTED_LOOP_OK || rc == NESTED_LOOP_NO_MORE_ROWS)
      rc= sub_select(join, join_tab, end_of_records);
    DBUG_RETURN(rc);
  }

As end_of_records is still 1 the code within the braces is executed and the function JOIN_CACHE::join_records() is called. The hierarchy above shows the code-hierarchy executed up to the point where the first record is read from TestBig. With this record the code returns to JOIN_CACHE::join_matching_records(). Now a loop in this function handles reading the records from TestBig and it looks like that:
  while (!(error= join_tab_scan->next()))                               outer loop
  {
    ......
    /* /* Read each possible candidate from the buffer and look for matches *//* 
    while ((rec_ptr= get_next_candidate_for_match()))                   inner loop
    { 
      /* 
        If only the first match is needed, and, it has been already found for
        the next record read from the join buffer, then the record is skipped.
        Also those records that must be null complemented are not considered
        as candidates for matches.
      */
      .........
    }
  }

Record after record is fetched from TestBig by calling join_tab_scan->next():
JOIN_CACHE::join_matching_records()
    JOIN_TAB_SCAN::next()
        rr_sequential()            via info->read_record()
            handler::ha_rnd_next()
                ha_myisam::rnd_next()

With this record from TestBig the JOIN has to be made so the code walks sequentially through all entries of a cache containing the records from TestSmall and tries to find a match (see: inner loop in the box above). When this is done the next record is fetched from TestBig by the statement join_tab_scan->next(), the next iteration of the outer loop in the code-box above.

Two special cases happen when reading a record from TestBig. For the first case the code in JOIN_TAB_SCAN::next() is called with the record already read (as described above). This is handled by this statement:
if (is_first_record)
    is_first_record= FALSE;
  else
    err= info->read_record(info);

In the second case a record is read from TestBig which does not fulfill the (partial) WHERE-clause. This is handled here:
  while (!err && select && (skip_rc= select->skip_record(thd)) <= 0)
  {
    ....
    /* 
      Move to the next record if the last retrieved record does not
      meet the condition pushed to the table join_tab.
    */
    err= info->read_record(info);
    ....
  } 

The function SQL_SELECT::skip_record() contains this check and the code looks like:
  inline int skip_record(THD *thd)
  {
    int rc= MY_TEST(!cond || cond->val_int());
    if (thd->is_error())
      rc= -1;
    return rc;
  }
The check is done within the macro MY_TEST7) and so the function evaluate_join_record() is not needed.

One piece is missing: in sub_select_cache() the function JOIN_CACHE::join_records() is called and my description above showed that the table TestBig is accessed. But if you look at the lines in sub_select_cache() you will see that a second function is called: sub_select(). This call simply calls the function end_send_group() which finishes the whole activity.


And here I want to stop my description. Looking at my text I think that I've shown how the case without index is handled and also I compared it to the case with one index.
If you think some part is missing please drop me a line using the mail-function on the right side of this blog.


why I used the hint IGNORE INDEX

Initially I wanted to describe the case without any index only. But soon I came upt with the question: how is this handled in the case of an index? So I made a lot of comparisons how the 2 cases are treated in the code. To switch from one case to another I had to
  • drop the index and do the next test without any index
or for switching back
  • create the index and start the test-case using the index
Here you can see that dropping the index takes about 100 seonds, creating it again takes about 2 minutes:
MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> drop index PZN on TestBig;
Query OK, 10000000 rows affected (1 min 41.39 sec)     
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show indexes from TestBig;
Empty set (0.01 sec)

MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 59.57 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> 
Instead of waiting 2 minutes before I can start the test of the other case I used the hint, doing it this way was faster.




correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.




some notes:
1) also included in this statement is the hint SQL_NO_CACHE. This hint tells the database-server to do the fetches again instead of using the internal cache. This helped me using this statement again and again and ..... So this statement contains two hints.
2) the table TestSmall contains 301.036 and the table TestBig contains 10.000.000 records so each table is indeed scanned only once (add 1 for detecting the EOF-situation).
3) you will find a description of the inspection-process here: JOIN (3).
4) I will stop here, at least for this post.
5) the table TestBig is accessed every time a matching record is found in TestSmall so the number 311 does not tell that 311 records are read from TestBig but that TestBig is accessed 311 times including reporting NESTED_LOOP_NO_MORE_ROWS multiple times.
6) you will find mor details of this behavior here: JOIN (2)
7) I've already described this macro here: JOIN (3)