an unsystematic journey through the source-code of MySQL and MariaDB: 2016

Wednesday, 7 December 2016

subselect: execution (2)

in my last post (subselect: execution) I played with a special SQL-statement and showed how this statement is treated in the execution-stage of the server. For this text I simply want to change the order of the tables within this statement and look what's happening then. Most of this text looks at the code of MariaDB, in the end I will describe the difference to MySQL.

warning

Before I dive into the code let me issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.

environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same tables, the same data in the tables etc.. I've changed the SQL-statement and therefore my focus changes.

the SQL-statement

For this text this is the statement I want to inspect:

MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (17.84 sec)

MariaDB [TestOpt]>

Let me add the explain the server returns:

MariaDB [TestOpt]> explain select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020'\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: <subquery2>
         type: eq_ref
possible_keys: distinct_key
          key: distinct_key
      key_len: 7
          ref: func
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
3 rows in set (0.02 sec)

MariaDB [TestOpt]>

As you can see the statement looks very similar to the statement in my last post. Again it's a statement with a subselect. If you look at it a bit closer you will see that the order of the tables changed: now the table TestBig is in the subselect (=subquery) and TestSmall is in the outer part of the query. Also this query takes about 10% more time than the old query; this can be due to the fact that the subselect now returns much more rows (about 400K) than the old subselect (about 60). And also this query returns only 4 rows, the old one returned 14 rows.

Why do we now have only 4 rows returned? Let's look at the result of the old query again:

MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText from TestBig A where A.PZN in ( select B.PZN from TestSmall B where B.Hersteller = '00020') and A.Hersteller = '36367';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | YPSIFLEX 10CMX4M EL FIXIER |
| 178324 | STIBIUM SULF AURANT D 6    |
| 87886  | GILOFA BAS 40 KNSTR 1KANDI |
| 178324 | SOFTCLIX II LANZETTEN      |
| 87952  | RUTIVISCAL N               |
| 178324 | CROTAMITON                 |
| 87952  | NOBASTRETCH KRAEFT 7MX6CM  |
| 87952  | STECHBECKEN 31CM STAHL     |
| 87923  | LIPP LOVE LUXUS SUN PROT12 |
| 87952  | TASTATURHILFE UNT NC21017  |
| 178324 | HOLLISTER KOL KLEBE F 2178 |
| 87952  | MEDI TEST GLUCOSE          |
| 178324 | ESTOSAN TUBE               |
| 87952  | FRUCHTS DUSCHGEL ORANGE    |
+--------+----------------------------+
14 rows in set (17.34 sec)

MariaDB [TestOpt]>

If you look a bit closer at the result you will see duplicate values in the column PZN. This statements verifies this:

MariaDB [TestOpt]> select SQL_NO_CACHE  distinct A.PZN from TestBig A where A.PZN in ( select B.PZN from TestSmall B where B.Hersteller = '00020') and A.Hersteller = '36367';
+--------+
| PZN    |
+--------+
| 178324 |
| 87886  |
| 87952  |
| 87923  |
+--------+
4 rows in set (15.72 sec)

MariaDB [TestOpt]>

So there are multiple records with identical value in the column PZN in TestBig, but not in the table TestSmall. This explains the different results.

So this statement is treated identical to the one described in the last text? Yes. Here is a description of the engine used for the temp-table created internally: MEMORY Storage Engine. But one term caught my interest: max_heap_table_size. So what will happen when we the size of the table will exceed this value?

From the documentation given we know that this variable can contain a value between 16 MB and 4.294.966.272 (= 4GB - 1KB). So let's play with this value.

playing with max_heap_table_size

Let's start with the fefault-value:

MariaDB [TestOpt]> select @@max_heap_table_size;
+-----------------------+
| @@max_heap_table_size |
+-----------------------+
|              16777216 |
+-----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (16.66 sec)

MariaDB [TestOpt]>

As you can see the variable has a value of 16 MB. Let's try to set this to a minimal value:

MariaDB [TestOpt]> set max_heap_table_size = 1024;
Query OK, 0 rows affected, 1 warning (0.00 sec)

MariaDB [TestOpt]> select @@max_heap_table_size;
+-----------------------+
| @@max_heap_table_size |
+-----------------------+
|                 16384 |
+-----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (35.47 sec)

MariaDB [TestOpt]>

Something changes now. I tried to set the value of the variable max_heap_table_size to a minimal value but the server sets the value to 16 KB. And the execution of the statement takes a lot more of time, it more then doubles.

So why does this happen? And wehre does it happen?

Let's dive into the code.

identical behaviour

In the beginning the behaviour of the server is identical to the last example, so here is only a small reminder of the hierarchy for reading the first record from the table TestSmall¹⁾:

mysql_select()
    JOIN::exec()
        JOIN::exec_inner()
            do_select()
                sub_select()
                    join_init_read_record()           called via (*join_tab->read_first_record)()
                        rr_sequential()
                    evaluate_join_record()

As (in my case) evaluate_join_record() did not find a match the server goes to the main loop and reads record after record from TestBig and inspects each one:

sub_select()
    rr_sequential()       called via info->read_record()
    evaluate_join_record()

When evaluate_join_record() finds a matching record from TestSmall it acts as described in my last text: the server switches from level 1 (=reading from TestSmall) to level 2 (=handling the temp-table) to level 3 (= reading from TestBig and putting the data found into the temp-table). On this level the hierarchy looks like this:

sub_select()                                level 3
    evaluate_join_record()
        end_sj_materialize()                called via (*join_tab->next_select)()
            fill_record()
                Item_field::save_in_field()
                    save_field_in_field()
            handler::ha_write_tmp_row()
                ha_heap::write_row()

All the same, as before.

new behaviour

With the SQL-statement shown some lines above I've set the size of a table of type MEMORY to max. 16K bytes, and now things will change.

As there are approx. 400K records in TestBig that fulfill the WHERE-condition and therefore have to be put into the temp-table the size of this table will be at least approx. 2.4 MB (=400K records * 6 bytes for the PZN-value) which exceeds the new give max. value for this table. So let's look how this situation is detected and handled.

Somewhere in middle of the execution I looked into the /temp-directory and found this:

august@AMD4:~/MariaDB/pgm$ ls -l /tmp
insgesamt 36
....
-rw-rw---- 1 august august  4104 Nov 30 16:31 #sql_243a_0.MAD
-rw-rw---- 1 august august 16384 Nov 30 16:31 #sql_243a_0.MAI
....
august@AMD4:~/MariaDB/pgm$

As you can see 2 files are created with the name of the temp-table plus the extension MAD- and MAI. By looking into the MAD-file I found this:

august@AMD4:~/MariaDB/pgm$ od -t x1 -Ax /tmp/#sql_243a_0.MAD
000000 ff 35 33 35 38 31 38 20 ff 39 39 39 39 39 39 20       '535818 ' '999999 '
000010 ff 32 30 34 36 36 35 20 ff 35 31 33 35 37 33 20       '204665 ' '513573 '
000020 ff 33 38 32 39 37 30 20 ff 32 31 37 38 35 37 20       '382970 ' '217857 '
and so on

This looks like the data-file of the Maria-engine (this contains a mode almost identical to MyIsam²⁾) and the contents shown are the first values of the column PZN from the table TestBigl. And the MAI-file looked like the corresponding index-file. So where does this happen in the code?

It can only happen when a new record is added to the temp-table and with this record the temp-table exceeds some given limit. Let's look over there:

sub_select()                      we are at level 3
    evaluate_join_record()
        end_sj_materialize()
            handler::ha_write_tmp_row()
                ha_heap::write_row()
                    heap_write()
                        next_free_record_pos()

In the last function given in this hierarchy this situation is detected:

  if (!(block_pos=(info->records % info->block.records_in_block)))
  {
    if ((info->records > info->max_records && info->max_records) ||
        (info->data_length + info->index_length >= info->max_table_size))
    {
      ....
      my_errno=HA_ERR_RECORD_FILE_FULL;           = 135
      DBUG_RETURN(NULL);
    }
    ....
  }

In the case I inspected the var. info->max_records contained a value of 512 (more on this later), trying to insert the next records results in the function returning NULL. So the code marches the hierarchy up to the function end_sj_materialize() to the point where the call started. Here the code looks like this:

    if ((error= table->file->ha_write_tmp_row(table->record[0])))
    {
      /* create_myisam_from_heap will generate error if needed */
      if (table->file->is_fatal_error(error, HA_CHECK_DUP) &&
          create_internal_tmp_table_from_heap(thd, table,
                                              sjm->sjm_table_param.start_recinfo, 
                                              &sjm->sjm_table_param.recinfo, error, 1, NULL))
        DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    }

Normally the function ha_write_tmp_row() returns 0 (when everything is OK) or HA_ERR_FOUND_DUPP_KEY (when the value is already in the temp-table) but in the case of HA_ERR_RECORD_FILE_FULL the contents of the temp-table is written on the disk. If you look at the information some lines above you will see that an entry in the MAD-file is 8 bytes long. So with 513 records the size of the temp-file on disk is 4104 bytes.

From this moment on the temp-table is no longer of type MEMORY but of type MARIA, the table is no longer stored in RAM but is stored on disk and all reading and writing to the temp-table is handled with the disk-based version of the file (or with the cache of this file). You will loose the speed-advantage of the RAM-based table but the handling of the statement continues. The situation of overflowing the size-limit for the temp-table in RAM is detected and the server silently switches to a disk-based solution and continues.

This explains the increase in execution-time needed when the max. size of the RAM-based HEAP-table is changed to a minimal value.

Where does the value 512 for the switch comes from? You will find the code in heap_prepare_hp_create_info():

    case HA_KEY_ALG_HASH:
      keydef[key].algorithm= HA_KEY_ALG_HASH;
      mem_per_row+= sizeof(char*) * 2; // = sizeof(HASH_INFO)               result = 16
      break;
....
  mem_per_row+= MY_ALIGN(share->reclength + 1, sizeof(char*));              result = 32
....
  max_rows= (ha_rows) (hp_create_info->max_table_size / mem_per_row);       result = 512
....
  hp_create_info->max_records= (ulong) MY_MIN(max_rows, ULONG_MAX);         result = 512

In the beginning of the execution of this SQL-statement the value for max_records is computed (an estimation) and set.

Here my journey ends for today.

what about MySQL?

No, it's not finished yet. MySQL acts very similar but it creates a file on disk as a MyISAM-file. Here you can see the differences in the code:

MySQL³⁾	MariaDB⁴⁾
in end_sj_materialize(): if ((error= table->file->ha_write_row(table->record[0]))) { /* create_myisam_from_heap will generate error if needed / if (table->file->is_fatal_error(error, HA_CHECK_DUP) && create_myisam_from_heap(thd, table, sjm->table_param.start_recinfo, &sjm->table_param.recinfo, error, TRUE, NULL)) DBUG_RETURN(NESTED_LOOP_ERROR); / purecov: inspected */ }	in end_sj_materialize(): if ((error= table->file->ha_write_tmp_row(table->record[0]))) { /* create_myisam_from_heap will generate error if needed / if (table->file->is_fatal_error(error, HA_CHECK_DUP) && create_internal_tmp_table_from_heap(thd, table, sjm->sjm_table_param.start_recinfo, &sjm->sjm_table_param.recinfo, error, 1, NULL)) DBUG_RETURN(NESTED_LOOP_ERROR); / purecov: inspected */ }
/** If a MEMORY table gets full, create a disk-based table and copy all rows to this. ...... / bool create_myisam_from_heap(THD thd, TABLE table, MI_COLUMNDEF start_recinfo, MI_COLUMNDEF *recinfo, int error, bool ignore_last_dup, bool is_duplicate)	/* If a HEAP table gets full, create a internal table in MyISAM or Maria and copy all rows to this / bool create_internal_tmp_table_from_heap(THD thd, TABLE table, TMP_ENGINE_COLUMNDEF start_recinfo, TMP_ENGINE_COLUMNDEF *recinfo, int error, bool ignore_last_dupp_key_error, bool is_duplicate)

MySQL³⁾

MariaDB⁴⁾

in end_sj_materialize():
    if ((error= table->file->ha_write_row(table->record[0])))
    {
      /* create_myisam_from_heap will generate error if needed */
      if (table->file->is_fatal_error(error, HA_CHECK_DUP) &&
          create_myisam_from_heap(thd, table,
                                  sjm->table_param.start_recinfo, 
                                  &sjm->table_param.recinfo, error,
                                  TRUE, NULL))
        DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    }

in end_sj_materialize():
    if ((error= table->file->ha_write_tmp_row(table->record[0])))
    {
      /* create_myisam_from_heap will generate error if needed */
      if (table->file->is_fatal_error(error, HA_CHECK_DUP) &&
          create_internal_tmp_table_from_heap(thd, table,
                                              sjm->sjm_table_param.start_recinfo, 
                                              &sjm->sjm_table_param.recinfo, error, 1, NULL))
        DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    }

/**
  If a MEMORY table gets full, create a disk-based table and copy all rows
  to this.

......
*/

bool create_myisam_from_heap(THD *thd, TABLE *table,
                             MI_COLUMNDEF *start_recinfo,
                             MI_COLUMNDEF **recinfo, 
        int error, bool ignore_last_dup,
                             bool *is_duplicate)

/*
  If a HEAP table gets full, create a internal table in MyISAM or Maria
  and copy all rows to this
*/


bool
create_internal_tmp_table_from_heap(THD *thd, TABLE *table,
                                    TMP_ENGINE_COLUMNDEF *start_recinfo,
                                    TMP_ENGINE_COLUMNDEF **recinfo, 
                                    int error,
                                    bool ignore_last_dupp_key_error,
                                    bool *is_duplicate)

And that's it for today.

correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors on my side can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

some notes:

1) in the last text the reading starts with the table TestBig, in this case it starts with reading from the table TestSmall because this is the table in the outer past of the statement

2) I already described this here: od

3) MySQL vdersion 5.6.22

4) MariaDB version 10.0.10

Tuesday, 1 November 2016

subselect: execution

Some weeks ago I wanted to leave this topic and look at other parts of the code but I found some queries which looked interesting and therefore I stayed on my route. So I let me start this text with my usual introductory words.

warning

environment

the SQL-statement

So here is my statement which I want to inspect and describe my results here in this text:

MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestBig B where B.PZN in ( select A.PZN from TestSmall A where A.Hersteller = '00020') and B.Hersteller = '36367';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | YPSIFLEX 10CMX4M EL FIXIER |
| 178324 | STIBIUM SULF AURANT D 6    |
| 87886  | GILOFA BAS 40 KNSTR 1KANDI |
| 178324 | SOFTCLIX II LANZETTEN      |
| 87952  | RUTIVISCAL N               |
| 178324 | CROTAMITON                 |
| 87952  | NOBASTRETCH KRAEFT 7MX6CM  |
| 87952  | STECHBECKEN 31CM STAHL     |
| 87923  | LIPP LOVE LUXUS SUN PROT12 |
| 87952  | TASTATURHILFE UNT NC21017  |
| 178324 | HOLLISTER KOL KLEBE F 2178 |
| 87952  | MEDI TEST GLUCOSE          |
| 178324 | ESTOSAN TUBE               |
| 87952  | FRUCHTS DUSCHGEL ORANGE    |
+--------+----------------------------+
14 rows in set (16.24 sec)

MariaDB [TestOpt]>

As you can see I'm using a subselect now and I'm interested in how the whole statement and especially the subselect is treated in the execution-stage of MariaDB¹⁾.

environment

Just one more repetition: this is how the 2 tables look like:

MariaDB [TestOpt]> desc TestSmall;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| Id          | char(8)      | YES  |     | NULL    |       |
| PZN         | char(7)      | YES  |     | NULL    |       |
| EVP         | decimal(7,2) | YES  |     | NULL    |       |
| HAP         | decimal(7,2) | YES  |     | NULL    |       |
| ArtikelBez  | varchar(40)  | YES  |     | NULL    |       |
| ArtikelText | varchar(26)  | YES  |     | NULL    |       |
| Hersteller  | char(5)      | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
7 rows in set (0.00 sec)

MariaDB [TestOpt]>

The table TestBig has an identical structure; TestSmall contains 301,036 records, TestBig contains 10 mio. records. The data is from an old project but the contents of the data is not of interest here.

There are no indexes defined on these tables. Silly, I know, but the database-server has to handle this situation.

explain

Let's look at what the server tells us how it handles the statement:

MariaDB [TestOpt]> explain select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestBig B where B.PZN in ( select A.PZN from TestSmall A where A.Hersteller = '00020') and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: <subquery2>
         type: eq_ref
possible_keys: distinct_key
          key: distinct_key
      key_len: 7
          ref: func
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
3 rows in set (0.00 sec)
MariaDB [TestOpt]>

Oh, this looks different to the explains of the older examples. As each table is represented internally by an object of type JOIN_TAB we see that we have 3 JOIN_TABs in our example instead of the 2 objects in the older posts. So let's look into this situation and describe the results.

following the code

Before I jump into the code let me show you some numbers, they show what we have to expect.

some numbers

Let's split the SQL-statement into parts and look at the numbers these partial statements deliver:

MariaDB [TestOpt]> select A.PZN from TestSmall A where A.Hersteller = '00020';
+---------+
| PZN     |
+---------+
| 178324  |
| 3717968 |
| 3717974 |
..............
| 8425779 |
| 8425785 |
| 8494562 |
| 8494579 |
+---------+
60 rows in set (0.37 sec)

MariaDB [TestOpt]>

There are 60 records in TestSmall that match the WHERE-condition, so the inner SQL-statement (=the subselect) will deliver 60 records.

MariaDB [TestOpt]> select b.pzn from TestBig b where b.hersteller = '36367';
+--------+
| pzn    |
+--------+
| 535818 |
| 999999 |
| 999999 |
..............
| 19057  |
| 999999 |
| 999999 |
| 999999 |
| 999999 |
| 999999 |
| 999999 |
| 999999 |
+--------+
461016 rows in set (34.24 sec)

MariaDB [TestOpt]>

More then 400,000 records from TestBig will match the WHERE-condition and have to be matched for the IN-condition. A bit of work, this will take some time.

Let's look at how the server handles this statement.

overview

The function-hierarchy for the execution of this statement looks like this (simplified):

JOIN::exec()
    JOIN::exec_inner()
        do_select()
            sub_select()                                  level 1
                evaluate_join_record()
                    sub_select()                          level 2
                        join_tab_execution_startup()
                            sub_select()                  level 3
                                evaluate_join_record()

The server starts executing this statement by doing a table-scan on the table TestBig. When the first record matching the condition is found, it starts with the handling of a temp-file. In this step it reads the table TestSmall and putts the PZNs of the matching records into it. When this is done it checks if the PZN-value of a record from TestBig can be found in the temp-file and when found acts accordingly.

That's all.

The function sub_select() will appear multiple times in my text here so to avoid confusion let me introduce some conventions: I will write of level 1 when sub_select() handles the table TestBig, of level 2 when the temp-table is handled and of level 3 when the table TestSmall is handled. I hope this clears my text a bit.

jumping into the code

And now, really, I want to jump into the code. Let's start with the first call of the function sub_select().

level 1

When sub_select() is called for the first time it is given a JOIN_TAB representing the table TestBig (as expected from the explain). In this function it reads and inspects the first record from this table:

  if (rc != NESTED_LOOP_NO_MORE_ROWS)
  {
    error= (*join_tab->read_first_record)(join_tab); 
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

read_first_record() calls join_init_read_record() which calls rr_sequential() via the statement (*tab->read_record.read_record). So the first record from TestBig is read by this code. evaluate_join_record() inspects this records but in my case it didn't find a match.

So it continues with the main loop:

  while (rc == NESTED_LOOP_OK && join->return_tab >= join_tab)
  {
    .....
    error= info->read_record(info);
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

info->read_record() calls rr_sequential() which reads the next record from TestBig, evaluate_join_record() inspects the record read. When evaluate_join_record() finds a match this code is executed next:

in evaluate_join_record():
  if (select_cond)
  {
    select_cond_result= MY_TEST(select_cond->val_int());        applies the (partial) WHERE-condition
    .....
  }
  ....
    if (found)
    {
      ....
      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0);

level 2

This last line shown calls sub_select() with a JOIN_TAB for a table named /tmp/#sql_247d_0²⁾ of type MEMORY (in older versions this type was called HEAP). So the server is now at level 2 and here is what the server does with this empty table in sub_select() in this level:

in sub_select():
  if (rc != NESTED_LOOP_NO_MORE_ROWS && 
      (rc= join_tab_execution_startup(join_tab)) < 0)
    DBUG_RETURN(rc);

The code prepares for switching to the 3rd level:

 in join_tab_execution_startup():
  .......
  else if (tab->bush_children)
  {
    /* It's a merged SJM nest */
    .....
      /*
        Now run the join for the inner tables. The first call is to run the
        join, the second one is to signal EOF (this is essential for some
        join strategies, e.g. it will make join buffering flush the records)
      */
      if ((rc= sub_select(join, join_tab, FALSE/* no EOF */)) < 0 ||
          (rc= sub_select(join, join_tab, TRUE/* now EOF */)) < 0)
      {
        ....
      }

level 3

As you can see sub_select() is called again, this time with JOIN_TAB representing the table TestSmall. It is called twice, the second time for clean-up-operations.

Let's look at what the first call of sub_select() does: similar steps as described for TestBig some lines above are done so it reads the first record from TestSmall via (*join_tab->read_first_record)() and inspects it in evaluate_join_record((). As no match is found in my case it continues with the main loop, again as described some lines above for the table TestBig. Interesting things happen again when a match is found:

in evaluate_join_record():
      /* A match from join_tab is found for the current partial join. *
      rc= (*join_tab->next_select)(join, join_tab+1, 0);

with this call in the last line a lot of things happen here:

evaluate_join_record()
    end_sj_materialize()                      via (*join_tab->next_select)()                  
        fill_record()
            Item_field::save_in_field()
                save_field_in_field()
        handler::ha_write_tmp_row()
            ha_heap::write_row()
                heap_write()

As the server is currently reading the table TestSmall and there are 60 records matching the condition so there are 60 values of PZN taken from these records and stored into this temp-table.
At the end of the table-scan of TestSmall the function sub_select() is left with a return-value of NESTED_LOOP_OK and called immediately again with a value TRUE for the parameter end_of_records, called from join_tab_execution_startup():

      if ((rc= sub_select(join, join_tab, FALSE/* no EOF */)) < 0 ||
          (rc= sub_select(join, join_tab, TRUE/* now EOF */)) < 0)

In this clean-up-operation there is not so much that happens:

join_tab_execution_startup()  
    sub_select()
        end_sj_materialize

So it returns to to the calling function soon.

back to level 2

Now we're back at sub_select(), the table TestSmall is already read and the temp-table is filled with the records from TestSmall.

a short break

OK, a short break and I want to take the time to remember what the server has done already.

We've read record after record from TestBig until we found the first record which matches the (partial) condition. We keep the PZN-value from this record, switch over to a temp-table and read the records from TestSmall, check these records and if a records matches the (partial) condition on TestSmall we put the PZN-value of this record (from TestSmall) into the temp-table.

And I've stopped at this step.

The next step will be to continue reading and inspecting the records from TestBig. If we found a matching record we will look for the PZN-value in the temp-table. As we already have a matching record read from TestBig so the server will immediately start with the lookup in the temp-table.

So we continue our journey at level 2.

continuing at level 2

Now we try to find the PZN-value (from the record from TestBig) in our temp-table.

sub_select()
    join_read_key()               via (*join_tab->read_first_record)(join_tab);
        join_read_key2()
            handler::ha_index_read_map()
                ha_heap::index_read_map()

In my case it did not find this value in the index so it returns the value HA_ERR_KEY_NOT_FOUND (converted to -1). Next evaluate_join_record() is called which checks the error-code and returns NESTED_LOOP_NO_MORE_ROWS. For this reason sub_select() is left and we're back in evaluate_join_record() of level 1.

back to level 1

So in sub_select() we're in the main-loop again, reading and evaluating record after record from TestBig (already described some lines above).

In evaluate_join_record() the record read is inspected and if it does not match the condition it returns. This cycle, reading and inspecting, continues until a match is found. In this case:

sub_select()                                   level 1
    evaluate_join_record()                     calls the next level via (*join_tab->next_select)(join, join_tab+1, 0);
        sub_select                             level 2   
            join_read_key()                    by reading the first record, via (*join_tab->read_first_record)(join_tab);
                join_read_key2()
                    handler::ha_index_read_map()
                        ha_heap::index_read_map()

so the server switches to level 2 and at this level in sub_select() it reaches the code for reading a first record from the temp-table using the PZN-value of the current record from TestBig as a key to look for (in the temp-table) in JOIN_TAB->ref->key_buff.

So now the steps are: read records from TestBig (in level 1). When a match is found in evaluate_join_record() the server switches to level 2 and look if the PZN-value can be found in the temp-table. These steps continues until all records from TestBig are read and treated.

I hope you did follow this text until here, it was a bit of work understanding and describing these steps.

correctness

1) on MySQL explain also delivers 3 JOIN_TABs, but they look differently

2) did I mention that this may look differently on your machine?

Thursday, 28 July 2016

the real JOIN: a different look at the case without any index

Please allow me to continue to play with my usual statement in my usual environment. I like to modify the statement a little bit and show you how much changed in executing the modified statement.

In this text I will often refer to something "old" like the "old statement" or the "old text". This always means this text (or something you can find in it): the real JOIN: the case without indexes. I will not set any link to this text in the text here, please keep this link in your mind.

warning

Before I dive into the code let me to issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.

environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same SQL-statement, the same tables, the same data in the tables etc.. Only my focus changes.

the old situation

In my old text I looked at this statement:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (34.99 sec)

MariaDB [TestOpt]>

The explain tells us that the execution starts with reading the table TestSmall. After this is done the table TestBig is read. Diving into the code showed that by reading the table TestSmall and applying the WHERE-clause on a record just read, a record matching the condition is put into a buffer. When reading this table is finished the server starts reading the records from the table TestBig, applying the WHERE-clause and when a match is found it does the JOIN by looking through the buffer to find a record (from TestSmall) with the same value of PZN.

the new situation

For this text here I wan to change the order of the execution, first TestBig and then TestSmall, and see if and what's changing in the execution of this statement. Here is the statement and the explain:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (1 min 6.82 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

As you can see there is a difference: the new statement took 66 seconds vs. 23 seconds of the old statement: it took nearly as 3times as long. Why? The purpose of this text is to show you the reason of this.

Please look at the statement more closely: I've used the keyword straight_join instead of join and I've changed the order of the tables in the statement. Changing the order of the tables without using the keyword straight_join the optimizer would have chosen the same plan for the execution as before and I could not present this text to you.

some numbers

I want to show you some numbers from the data.

First of all TestSmall contains 301,036 records, TestBig contains 10 mio. records, which you can verify if you look into the explain-lines above.

In TestSmall and TestBig only a small fraction of the records fulfill the (partial) WHERE-condition:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A where A.Hersteller = '00020';
+--------------+
| count(A.PZN) |
+--------------+
|           60 |
+--------------+
1 row in set (0.44 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE  count(B.PZN) from TestBig B where B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|       461016 |
+--------------+
1 row in set (12.78 sec)

MariaDB [TestOpt]>

These records have to be matched for joining whether the process starts with TestSmall (old case) or with TestBig (new case). So why does this take 23 seconds in the old case and 66 seconds in the new case?

general remarks

I don't want to show function hierarchies in this text, they are almost the same as in the old text. The action takes place in the function sub_select() and below, in the stage of the execution but it is above of any storage engine. I've used MyISAM as storage-engine in my tests but everything will happen identically in other storage-engines (but I didn't check this).

counting

In my old text I added some counters in the MyISAM-engine which lead to this output:

table = <TestSmall> counter table-scan:    301,037    index-scan: 0
table = <TestBig>   counter table-scan: 10,000,001    index-scan: 0

For the old case these numbers tell us that no index was used and every record was read once so the JOINing was done internally.

Let's look at the same numbers in the new case¹⁾:

table = <TestBig>   counter table-scan: 10,000,001    index-scan: 0
table = <TestSmall> counter table-scan: 17,159,109    index-scan: 0

The value for TestBig is identical (= no. of records in this table plus one) telling us that each record in TestBig is read once. In the new case the number for TestSmall looks confusing. Do not hesitate to take your calculator: 17,159,109 / 301,037 = 57. Does this mean that the server has read the table TestSmall 57times? At least this could explain the time-difference. But is this true?

the algorithm

Let's look at the algorithm in the old case: a record from TestSmall that matches the condition is put into a buffer, there were 60 records in the buffer after the table TestSmall is read. Now the server switches over to the table TestBig and read record after record from this table. If a record fulfills the WHERE-clause (aka Hersteller = '36367') the buffer is scanned for a match. This is realised as a double loop: an outer-loop (reading records from TestBig) and an inner loop (walking through the entries in the buffer). Counting the number of iterations in the inner loop and the outer loop leads to these numbers:

counterOuterLoop:    461,016
counterInnerLoop: 27,660,960

The number of iterations in the outer-loop is the number of matching records from TestBig, and 27,660,960 is the number of iterations in the inner-loop: this loop is called 461,016 times and iterates over the 60 entries in the buffer.

the new algorithm

The algorithm in the new case is identical to the old case, but there must be some differences that can explain the increase in the time needed. So let's start:
the table TestBig is read record after record and on each record the WHERE-condition is applied. When a match is found the needed information is put into the buffer, and the server continues reading the next record from TestBig, same as in the old case. But the buffer can hold 128K of data as this is the size that was allocated for it. After 8,205 matching records the buffer is full, it does not have enough space left for another entry. So the server starts the JOINing process: it reads the records from the table TestSmall, applies the WHERE-condition and if a match is found searches the entries in the buffer for a match in the column PZN.
When all records from TestSmall are read (and compared) the server marks the buffer as empty and continues reading the records from TestBig and putting matching records into the buffer. When this buffer is full again the server starts reading the records from TestSmall and searches through the buffer for matches. And so the server continues until all records are read from TestBig.

These are the numbers of iterations:

counterOuterLoop:      3,420
counterInnerLoop: 27,660,960

As you can see the number of iterations in the inner-loop are identical as the comparisons have to be made, independent of the order of the tables. But the number of iterations in the outer loop is much smaller.

Some lines above I showed you that there are 461,016 records in TestBig which fulfill the condition B.Hersteller = '36367'. Again please take your calculator and do this operation: 461,016 / 8,205 which results in a number of approx. 56.2. So the server fills the buffer 56times and has some records left for an additional iteration.

As you can see the server indeed reads the Table TestSmall 57times in a full table-scan. This explains the time needed for this statement.

And that ends my journey for today.

some experiments

Well, it doesn't, I'm still curious.

the buffer

As I've written above the server uses a buffer of size 128K. What about doubling the size of the buffer?

So let's do it. The buffer is allocated in this part of the code:

int JOIN_CACHE::alloc_buffer()
{
  ....
  buff_size= get_max_join_buffer_size(optimize_buff_size);
  ....
    if ((buff= (uchar*) my_malloc(buff_size, MYF(MY_THREAD_SPECIFIC)))) 
      break;
  ....
}

When we simply double the contents of the variable buff_size the buffer will have a size of 256K. Done, and here is the result:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (43.97 sec)

MariaDB [TestOpt]>

Oh, 30% faster. And my counters tell me:

table = <TestBig>   counter table-scan: 10,000,001    index-scan: 0
table = <TestSmall> counter table-scan:  8,730,073    index-scan: 0

which means the table TestSmall is now read 29times.

So let's take back these modifications and continue with a buffer of size 128K.

the effect of the inner loop

The algorithm used in the code for the JOIN is a double loop. What if we eliminate the inner loop and see how much time is spent in this part of the code. Well, we will get a wrong number, but for this test I accepted this.
Here is the modified part of the code:

uchar *JOIN_CACHE_BNL::get_next_candidate_for_match()
{
//  if (!rem_records)
    return 0;
//  rem_records--;
//  return pos+base_prefix_length;
}

Compare this code with the original code (without the comment-characters) and you will see that the modified version simply returns 0. That says that there are no entries in the buffer and the inner loop will not be executed at all.
Here are the results:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|            0 |
+--------------+
1 row in set (10.37 sec)

ariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestBig A IGNORE INDEX(PZN) straight_join TestSmall B  on (A.PZN = B.PZN) where B.Hersteller = '00020' and A.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|            0 |
+--------------+
1 row in set (53.16 sec)

In the old case 50% of the time is spent in the inner loop, in the new case 20% of the time is spent in the inner loop. In the new case a lot of time is spent reading the table TestSmall multiple times, so only a smaller fraction will be gained by omitting the inner loop.

suggestion

First of all I want to write that I did not find an error in the code, it works and it does it's job. But I think there is room for improvement, there should be a better (=faster) algorithm. Unfortunately I cannot give you one. You say that when JOINing tables one usually creates an index over the column(s) involved. Agreed, but please keep in mind that sometimes the optimizer needs a temporary table for the execution of your statement and then the effect described here can happen.

correctness

1) I had to modify my code a bit: the counters were initialized in ha_myisam::rnd_init(). This function was called multiple times whereas ha_myisam::rnd_end() was only called once so the counters were reset to 0 multiple times before the contents was written to the file. So I moved the initialization of the counters to the header-file. It's still not the correct place but for this test it was sufficient.

Wednesday, 8 June 2016

the real JOIN: the case with an index

This is another text where I want to describe how the server handles certain statements. In my older texts I described some parts of the action inside the server, the focus of my last texts were the handling of a JOIN-operation. In my text the real JOIN: the case without indexes I described the unusual case of JOINing 2 tables without any index that could help. For this text I want to show the usual case: doing the JOIN with a useful index.

When an SQL-statement is sent to the database-server a lot of things happen: among other things the statement is checked for syntax and access rights, a plan for execution has to be searched for and the pan found is executed. This text focuses on the execution-step and especially on the way the JOIN is done.

warning

environment

Again I'm using my usual environment which I already described in my last post. It's the same SQL-statement, the same tables, the same data in the tables etc.. Only my focus changes.

the statement

To start I want to show the statement used for this text:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.38 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.00 sec)

MariaDB [TestOpt]>

As you can see I've already included the explain.

explaining the explain

The execution of this statement is really fast compared to the time for answering the request in the case without any index (= 23 sec.). Let me describe in some words what the lines above mean.

The execution starts with reading the table TestSmall (aka A) in a table-scan. On every record read the WHERE-clause is applied, at least parts of the WHERE. When the record fulfills the condition the server extracts the value of the column PZN from this record, switches to the table TestBig (aka B) and reads all records from TestBig using the index PZN for the given PZN-value. On every record read from TestBig the (partial) WHERE-condition is applied. If a match is found this record from TestBig together with the record from TestSmall are matches for the query. When all records for the given PZN-value are read from TestBig the server returns to the table TestSmall and continues reading the next record, applying the WHERE etc. until the end of TestSmall is reached.

Let's look into the code for the first part, the table-scan on the table TestSmall. This is the hierarchy:

mysql_select()
    JOIN::exec()
        JOIN::exec_inner()
            do_select()
                sub_select()

Let's stay here and look for the action. The function sub_select() is called with a pointer to a structure of type JOIN_TAB as a parameter, in this JOIN_TAB you will find the routines to call for reading a record. This is the code in sub_select() that handles reading the first record from the table:

  if (rc != NESTED_LOOP_NO_MORE_ROWS)
  {
    error= (*join_tab->read_first_record)(join_tab);
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

some lines later this code handles all following read-operations:

  while (rc == NESTED_LOOP_OK && join->return_tab >= join_tab)
  {
    ....
    error= info->read_record(info);
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

In the JOIN_TAB-structure the variables read_first_record and read_record are set before entering this level.¹⁾. So here is an excerpt of the structure JOIN_TAB:

typedef struct st_join_table {
  ...
  READ_RECORD::Setup_func read_first_record;
  Next_select_func next_select;
  READ_RECORD read_record;
  ...
} JOIN_TAB;

After a record is read the function evaluate_join_record() is called. This function applies the partial WHERE-condition to this record and if it does not fulfill the condition the code returns to the calling function and there it continues by reading the net record from the table TestSmall. If the record from TestSmall fulfills the (partial) WHERE-condition it calls another function:

static enum_nested_loop_state
evaluate_join_record(JOIN *join, JOIN_TAB *join_tab,
                     int error)
{
    ....
  if (select_cond)
  {
    select_cond_result= MY_TEST(select_cond->val_int());      the test: does it fulfill the partial WHERE-condition ?

    /* check for errors evaluating the condition */
    ....
  }

    if (found)
    {
      ....
      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0);  in this call use the pointer to the next JOIN_TAB
    ....
}

When a match is found in the test shown above the next level is called via JOIN_TAB->next_select() which in my case here is the function sub_select() again. So the hierarchy is now:

sub_select()
    evaluate_join_record()
        sub_select()

On the 2nd level one of the parameters in the call to sub_select() is a pointer to the JOIN_TAB representing the table TestBig (in the function-call I marked this in bold, it now points to the next JOIN_TAB in the list).

The same code in sub_select() is executed so we have the code for the first access and all following accesses to the table. As this is now the table TestBig the vars. JOIN_TAB.read_first_record and JOIN_TAB.read_record point to the functions for accessing the table via the index PZN. When these routines return NESTED_LOOP_NO_MORE_ROWS the code in sub_select() terminates and returns to evaluate_join_record() which itself returns to sub_select() and there the table-scan on the table TestSmall continues till the end of the file (again signalled by NESTED_LOOP_NO_MORE_ROWS).

remarks

That's it. If you compare this to the action described in my last post you will notice a difference in the amount of work which results in the time needed in handling the same query.

correctness

1) you will find a description of the function-hierarchies for accessing a table here and here and other texts on this site.

Thursday, 28 April 2016

the real JOIN: the case without indexes

In my last texts I started with a SQL-statement and looked where and how this is handled inside the server. You can find my results of the handling of this statement in the optimizer-stage here: optimizer: no indexes. This optimizer-stage is followed by the execution-stage and you can find my results here: execution: with and without indexes. This last text describes how the server reads the records from the tables but the text omits the JOIN between the records of these two tables. And in this text I want to deliver this part.

In my last post I had two cases which I compared: I started without any indexes on the two tables involved in the SQl-statement, later I added one (usable) index on one of the tables. For this text I want to look only at the case without any index defined.

warning

environment

Again I'm using my usual environment which I already described in my last post. It's the same SQL-statement, the same tables, the same data in the tables etc.. Only my focus changes.

this text

And the focus of this text is the JOIN. Where is the code that really puts the records from the two tables together? And how is this done?

So let's look into the code.

a reminder

Before I start let me show you the statement I'm inspecting including the explain:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

Please look at the last line of the explain: this line describes what I'm looking at in this text.

execution

Let me describe the execution of this statement in some form of pseudocode:
In executing this explain the first step is reading the table TestSmall in a table-scan. Each record fetched is inspected, which means applying the (partial) WHERE-clause (aka comparing the value of the column Hersteller against the string-constant '00020'). If this record fulfils the condition it is stored in an internal buffer and the scan continues. In the end of this step the buffer contains the relevant records from TestSmall for the next step.
Next the table TestBig is read by a table-scan. On each record fetched the (partial) WHERE-clause is applied (aka comparing the value of the column Hersteller against the string-constant '36367'). If this is a valid record the JOIN is done with the records already in the buffer (the records from TestSmall). After this operation the table-scan continues until the end.

1st step: reading TestSmall and putting records into the buffer

So the steps of interest for this part of the text are:

where is the information put into the cache ?
which information is put into the cache ?
where is the memory for the cache allocated ?

So let's look what is done with the records fetched from TestSmall.

put record in cache

If a record is found which fulfils the (partial) WHERE-clause (aka Hersteller = '00020') it is put into the cache and later used when the table TestBig is read.
Here is the function hierarchy for putting the data into the cache:

do_select()
    sub_select()
        evaluate_join_record()
            sub_select_cache()
                JOIN_CACHE::put_record()

The records are fetched in sub_select() and inspected in evaluate_join_record(). If a record fulfils the WHERE-clause the function sub_select_cache() is called which calls put_record() to put the information from this record into the cache.

Let me go back a bit: in handling any query each table involved is internally represented by an object of type JOIN_TAB, so for this query two of these exist: one for table TestSmall and one for table TestBig.
Back to the hierarchy: in the function evaluate_join_record() the record read is inspected. If it fulfils the WEHERE-clause the next level is called by this statement: rc= (*join_tab->next_select)(join, join_tab+1, 0);. So next_select() calls the function sub_select_cache() with the JOIN_TAB representing the table TestBig as a parameter (please look at the +1 in the parameter-list).

And this is the way the code accesses the cache involved:

typedef struct st_join_table {
.....
  JOIN_CACHE *cache;
.....
} JOIN_TAB;

class JOIN_CACHE :public Sql_alloc
{
....
  /* Pointer to the beginning of the join buffer */
  uchar *buff;
....
};

And the data from TestSmall is put in a cache that is associated to the JOIN_TAB representing the table TestBig. And buff is the pointer to the memory-area used for this.

the information in the cache

So put_record() writes the data into the memory-area pointed to by buff. It doesn't do this directly but calls the function write_record_data() to do this. This data is written into the buffer¹⁾:

a flag: 0x80
length of next entry: 0x0006
the contents of the column PZN: '178324'
length of the next entry: 0x0005
the contents of the column Hersteller: '00020'

and these 16 bytes are the data put into the cache, extracted from the first record from TestSmall that matches the condition. As you can see from using length-information this information is not of constant length.

Here is the contents of the first entries in the buffer (all entries are in hex, only the first 3 of 60 entries are shown):

flag	length	PZN	length	Hersteller
0x80	0x06 0x00	0x31 0x37 0x38 0x33 0x32 0x34	0x05 0x00	0x30 0x30 0x30 0x32 0x30
0x80	0x07 0x00	0x33 0x37 0x31 0x37 0x39 0x36 0x38	0x05 0x00	0x30 0x30 0x30 0x32 0x30
0x80	0x07 0x00	0x33 0x37 0x31 0x37 0x39 0x37 0x34	0x05 0x00	0x30 0x30 0x30 0x32 0x30

allocating memory for the cache

Now some data is put into a buffer but where does this memory-area come from? Somewhere this area must be allocated, and this is the hierarchy:

mysql_select()
    JOIN::optimize()
        JOIN::optimize_inner()
            make_join_readinfo()
                check_join_cache_usage_for_tables()
                    check_join_cache_usage()
                        JOIN_CACHE_BNL::init()
                            JOIN_CACHE::init()
                                JOIN_CACHE::alloc_buffer()
                                    JOIN_CACHE::get_max_join_buffer_size()

The buffer is allocated in the optimizer-stage. The function alloc_buffer() reserves a memory-area of size 128K of bytes for this and stores the pointer to this cache in the var. buff as shown some lines above. The value 128K is returned from get_max_join_buffer_size() which gets this value from the expression size_t limit_sz= join->thd->variables.join_buff_size;.

This finishes the first step: we have the buffer, all relevant data is put into it and TestSmall is read until the end. Now we switch over and start reading the records from TestBig.

reading TestBig and JOINing the records with the cache

Next the table TestBig is read in a table-scan and the WHERE-condition is applied to each record fetched. If the record read fulfils the condition (aka Hersteller = '36367') a record with the same value in the column PZN from TestSmall is needed for a successful JOIN. As the records from TestSmall are already in the cache this cache is searched. This is done in this code inside the function JOIN_CACHE::join_matching_records()²⁾:

  while (!(error= join_tab_scan->next()))                               outer loop
  {
    ......
    /* /* Read each possible candidate from the buffer and look for matches *//* 
    while ((rec_ptr= get_next_candidate_for_match()))                   inner loop
    { 
      /* 
        If only the first match is needed, and, it has been already found for
        the next record read from the join buffer, then the record is skipped.
        Also those records that must be null complemented are not considered
        as candidates for matches.
      */
      .........
    }
  }

OK, this doesn't explain too much so let's look into the details. Everything starts with reading the records from the table TestBig. For each record fetched the (partial) WHERE-condition is applied and if this condition is fulfilled we have a candidate for a JOIN with a record from TestSmall. This has been done within the function next(), in the box above I marked this with the words outer loop.

Of this record from TestBig the value of PZN is extracted and the list of records from TestSmall in the cache is searched for a match. The steps are:

init the search through the cache
get one entry from the cache
compare the value of PZN with the value of PZN of the record from TestBig
if they match: we found a record from TestSmall and a record from TestBig that do JOIN according to our SQL-statement
continue with step 2 until all entries in the cache are checked

I called the steps numbered 2 to 5 the inner loop.

Here is the function hierarchy:

JOIN_CACHE::join_matching_records()
    outer loop:
    JOIN_TAB_SCAN::next()
        JOIN_CACHE_BNL::prepare_look_for_matches()
            JOIN_CACHE::reset()
    inner loop:
    JOIN_CACHE_BNL::get_next_candidate_for_match() 
    JOIN_CACHE_BNL::read_next_candidate_for_match(
        JOIN_CACHE::get_record()
            JOIN_CACHE::read_all_record_fields()
                JOIN_CACHE::read_flag_fields()
                JOIN_CACHE::read_record_field()
    JOIN_CACHE::generate_full_extensions()
        JOIN_CACHE::check_match()
            skip_record()
                end_send_group()        in case of a match, called via: join_tab->next_select)(join, join_tab+1, 0);
    end inner loop
    end outer loop

In the function skip_record() the ON-clause of my SQL-statement is executed. This is the hierarchy for the comparison of the PZN-values:

Item_func_eq::val_int()
    Arg_comparator::compare() 
        Arg_comparator::compare_string()
            Item_field::val_str()
            Item_field::val_str()
            sortcmp()

And that's the whole process for JOINing the records.

Additionally I want to show you some numbers I found by working with this statement.

some numbers

I've added some variables to count the number of iterations in the inner loop and the outer loop described. These are the results in my test:

counterOuterLoop:    461,016
counterInnerLoop: 27,660,960

Let's verify these numbers with some SQL-statements.
The number of records from TestSmall to be checked in the JOIN:

cMariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A where A.Hersteller = '00020';
+--------------+
| count(A.PZN) |
+--------------+
|           60 |
+--------------+
1 row in set (0.44 sec)

MariaDB [TestOpt]>

The number of records from TestBig to be checked in the JOIN:

cMariaDB [TestOpt]> select SQL_NO_CACHE  count(B.PZN) from TestBig B where B.Hersteller = '36367';
+--------------+
| count(B.PZN) |
+--------------+
|       461016 |
+--------------+
1 row in set (12.78 sec)

MariaDB [TestOpt]>

The first number verifies the number given above in the description of the first step. You will find the second number as the value of the variable counterOuterLoop. And the value of the variable counterInnerLoop is simply the product of these 2 number, meaning the server fetches 10 mio. records from TestBig, found 461,016 matching records and iterates through the inner loop about 27.6 million times in the handling of my SQL-statement.

And that ends my journey for today.

correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

some notes:

1) please do not forget that this is the data found in my environment.

2) for the details of reading records from TestBig, the hierarchy of the code for doing this and some specialities omitted here please look at my last post.

Tuesday, 5 April 2016

execution: with and without indexes

in my last post I showed how the case without any indexes is handled by the optimizer. For this post I want to show how the plan found by the optimizer is executed. Additionally I want to compare the execution of this case with the execution of a case with one index added on one table.

So let me look into the code for the execution of the case without any indexes and start my description when the optimizer-stage is already done. And my description will stop when the storage-engine is entered, so the text describes the code in the database-server. For my tests I've used MariaDB, for MySQL things may look a bit different.

environment

Again I'm using my usual environment which I already described here: JOIN.

In the first step I want to inspect the case without any index but instead of dropping all indexes on the tables involved for my query I added one index to the table TestBig, this simplified the handling of my tests. For testing the case without any index I told the optimizer in my SQL-statement in the form of a hint to not use this existing index. In the end of this text I will explain why I did this.

So here is the environment:

MariaDB [TestOpt]> show indexes from TestSmall;
Empty set (0.00 sec)

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

So you can see there is no index on the table TestSmall and one index on the table TestBig, on the column PZN.

And here is my standard-statement:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (23.03 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B IGNORE INDEX (PZN) on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

I've included the explain and marked the hint IGNORE INDEX(PZN)¹⁾ in the SQL-statement and in the explain I also marked the access-type in bold in this box.

expected execution

I've already described the behavior of the server in the case with an index: JOIN (2). In the case described the server scans through the table TestSmall, reading and inspecting record after record, and when a matching record is found to switch over to the table TestBig and search for corresponding records in this table taking the WHERE- and ON-clause into account; after no more records can be found in TestBig the server returns to TestSmall and continues with the table-scan.
And for the case without any index I expected the server to behave in the same way. If no index is available there is no other way for searching the table TestBig than by a table-scan o I expected the execution of the statement to be very slow because for every record found in TestSmall a full table-scan of TestBig has to be made. But if you look at the box above you will see that this statement took 23 seconds for the execution and that's not so bad.

In an older post I introduced some counter-variables and by using these I can see this result in the file /var/log/mysql/error.log:

table = <TestSmall> counter table-scan: 301037      index-scan: 0
table = <TestBig>   counter table-scan: 10000001    index-scan: 0

When I look at these numbers I can see that table TestSmall is read once and also table TestBig is read only once²⁾. Good. So how is this handled in the code?

execution

Before I continue with this text I want to issue a warning: the functions and hierarchies shown here are valid for the statement and environment given above. If you do your own tests this may look a bit different.

the case without index

Now I want to start with the execution of the statement without any indexes on the tables. Execution starts with the table TestSmall, so here is the function hierarchy for reading this table:

mysql_select()
    JOIN::exec()
        JOIN::exec_inner()
            do_select()
                sub_select()
                    // read one record from TestSmall
                    evaluate_join_record()

In the next lines of this text I want to inspect the function sub_select() and see what's happening there.
The code in sub_select() distinguishes between a first read and all following reads. For the first record read everything needed is initialized, the first record is read from TestSmall and this record is inspected. This is the code that reads the first record from the table TestSmall:

  if (rc != NESTED_LOOP_NO_MORE_ROWS)
  {
    error= (*join_tab->read_first_record)(join_tab); 
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

Let me omit the function evaluate_join_record() for some seconds. For read_first_record() the function-hierarchy looks like this:

(*join_tab->read_first_record)  -->  join_init_read_record()      
    init_read_record()
        handler::ha_rnd_init_with_error()
            handler::ha_rnd_init()
                ha_myisam::rnd_init()
    rr_sequential()
        handler::ha_rnd_next()
            ha_myisam::rnd_next()

I stopped the presentation of the hierarchy at the engine-level because I didn't want to dive deeper into the code for this post.

After the first record is read and inspected the code in sub_select() continues with a while-loop scanning the table TestSmall until EOF (or NESTED_LOOP_NO_MORE_ROWS). Here is the code:

  while (rc == NESTED_LOOP_OK && join->return_tab >= join_tab)
  {
    ....
    error= info->read_record(info);
    ....
    rc= evaluate_join_record(join, join_tab, error);
  }

Here is how the next record is fetched:

    info->read_record()  -->  rr_sequential()
        handler::ha_rnd_next()
            ha_myisam::rnd_next()

Now I will come back to the function evaluate_join_record(). Here is the code executed (simplified):

static enum_nested_loop_state
evaluate_join_record(JOIN *join, JOIN_TAB *join_tab,
                     int error)
{
  ....
  if (select_cond)
  {
    select_cond_result= MY_TEST(select_cond->val_int());
    ....
  }

  if (!select_cond || select_cond_result)
  {
   ....
    if (found)
    {
      enum enum_nested_loop_state rc;
      /* A match from join_tab is found for the current partial join. */
      rc= (*join_tab->next_select)(join, join_tab+1, 0); 
    }
    ......
  }
  DBUG_RETURN(NESTED_LOOP_OK);
}

The record fetched is inspected in the macro MY_TEST³⁾. If this record fulfills the condition the line with (*join_tab->next_select)() calls another function which in this case (the case without any index) is the function sub_select_cache()⁴⁾. In this function the record is stored in an internal structure, a cache. It then returns to sub_select() and the cycle continues with the next fetch from the table TestSmall until all records from this table are read.

now with index

Please allow me to stop this description (I do know that I'm just in the middle of the execution) and switch to the case with an index on the table TestBig. First I want to show the statement:

MariaDB [TestOpt]> select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
|           14 |
+--------------+
1 row in set (0.40 sec)

MariaDB [TestOpt]> explain select SQL_NO_CACHE  count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: B
         type: ref
possible_keys: PZN
          key: PZN
      key_len: 8
          ref: TestOpt.A.PZN
         rows: 448
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

The explain tells me that the execution of this statement starts with a table-scan on TestSmall and the table TestBig is read via the index PZN. To verify this I looked at my counter-variables, these showed this result:

table = <TestSmall> counter table-scan: 301037  index-scan: 0
table = <TestBig>   counter table-scan: 0       index-scan: 311⁵⁾

and you see there is no table-scan on the table TestBig.

The execution starts with a table-scan on TestSmall. This is handled similar to the case without index except for the handling in the function evaluate_join_record(). Now the hierarchy looks like this:

sub_select()
    evaluate_join_record()
        sub_select()                via:  rc= (*join_tab->next_select)(join, join_tab+1, 0); 
            // read a record from TestBig (=join_tab+1)
            evaluate_join_record()
                end_send_group()

The fetching of the record from TestBig is handled in 2 cases (similar to the cases described above):

fetching the first record:
sub_select()
    evaluate_join_record()
        sub_select()
            (*join_tab->read_first_record)() ->  join_read_always_key()
                handler::ha_index_init()
                    ha_myisam::index_init()
                handler::ha_index_read_map()
                    ha_myisam::index_read_map()
            evaluate_join_record()
                end_send_group()

all other records are fetched this way:
sub_select()
    evaluate_join_record()
        sub_select()
            info->read_record() -> join_read_next_same()
                handler::ha_index_next_same()
                    ha_myisam::index_next_same()
            evaluate_join_record()
                end_send_group()

As you can see the structure of the code is similar but the functions called from evaluate_join_record() differs. In this case it extracts the value of the column PZN from the record from TestSmall and reads record after record from TestBig using the index PZN with this value. When no more records are found with this value in the column PZN it returns to TestSmall and continues the table-scan.⁶⁾

in general

In evaluate_join_record() the next level is called by the statement (*join_tab->next_select)(join, join_tab+1, 0). The variable next_select is defined here:

sql_select.h:
typedef struct st_join_table {
  ...
  READ_RECORD::Setup_func read_first_record;
  Next_select_func next_select;
  ...
} JOIN_TAB;

and the type Next_selec_func is defined this way:

typedef enum_nested_loop_state
(*Next_select_func)(JOIN *, struct st_join_table *, bool);

The return-type is an enumeration defined as:

enum enum_nested_loop_state
{
  NESTED_LOOP_KILLED= -2, NESTED_LOOP_ERROR= -1,
  NESTED_LOOP_OK= 0, NESTED_LOOP_NO_MORE_ROWS= 1,
  NESTED_LOOP_QUERY_LIMIT= 3, NESTED_LOOP_CURSOR_LIMIT= 4
};

join_tab+1

You certainly noticed the expression join_tab+1 occurring multiple times as a parameter in the function-calls in the code-boxes. The reason of this is: in the beginning of the optimizer-stage an object of type JOIN is created. This structure contains the variable join_tab which points to a list of entries of type JOIN_TAB. So the statement join_tab+1 points to the next entry in this list.

Let me show you some lines from the comment above the function sub_select():

  @param join      pointer to the structure providing all context info for
                   the query
  @param join_tab  the first next table of the execution plan to be retrieved
  @param end_records  true when we need to perform final steps of retrival

By calling this function the first parameter contains a lot of information about the query, the second parameter points to the JOIN_TAB to handle with this call and the last parameter is a flag.

If you look at the code-hierarchy above you will see that the function sub_select() is called recursively, so by the expression join_tab+1 the entries in this list are handled one after the other.

with index: final steps

Before I go back to the case without index (=my starting point) I want to show the last steps done for the case with index, it's not finished yet.

The central routines as described above were located in the function sub_select(). This function is called from do_select() in this form:

    if (join->outer_ref_cond && !join->outer_ref_cond->val_int())
      error= NESTED_LOOP_NO_MORE_ROWS;
    else
      error= sub_select(join,join_tab,0);    
    if ((error == NESTED_LOOP_OK || error == NESTED_LOOP_NO_MORE_ROWS) &&
        join->thd->killed != ABORT_QUERY)
      error= sub_select(join,join_tab,1);

In this code you can see that the function sub_select() is called twice. When called the first time all the work is done as described some lines above, it's called again with identical parameters except for the last one (above marked in red), which stands for bool end_of_records.

Now with a value of 1 for end_of_records the code executed in the function called looks like this:

in sub_select():
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records);
    DBUG_RETURN(nls);
  }

For this second call of sub_select() the hierarchy looks like this:

do_select()
    sub_select()                 with table TestSmall and a value of 1 for the parameter end_of_records
        sub_select()             with table TestBig (by join_tab+1)
            end_send_group()     with table NULL

This finishes the handling of this query, at least I will stop here with my description.

back to the case without index

Now I want to go back to my starting point which I've left in the middle of the execution.
As described above the table TestSmall is read sequentially, record after record is fetched and the record fetched is inspected in evaluate_join_record(). If it is accepted it is given to the function sub_select_cache() which stores this record in a cache. When the table TestSmall is read to the end the code returns from sub_select() to the calling function do_select(). And here my description stopped, up to this point only the table TestSmall is inspected, the table TestBig is not touched. So let me now describe how the table TestBig is handled.

Some lines above I showed some code in the function do_select(). In these lines in do_select() the function sub_select() is called twice and for the case without an index my description until now shows what's happening only in the first call of this function. For the second call of sub_select() the hierarchy looks like this:

do_select()
    sub_select()     called with table TestSmall, but now with a value of 1 for the parameter end_of_records
        sub_select_cache()     via (*join_tab->next_select), with table TestBig
            JOIN_CACHE::join_records()
                JOIN_CACHE::join_matching_records()
                    JOIN_TAB_SCAN::open()
                        join_init_read_record()
                            init_read_record()
                                handler::ha_rnd_init_with_error()
                                    handler::ha_rnd_init()
                                        ha_myisam::rnd_init()
                            rr_sequential()
                                handler::ha_rnd_next()
                                    ha_myisam::rnd_next()    1st record of table TestBig read

With the last function-call shown in this box the first record is read from TestBig.

Let's go back to the top of this hierarchy. The function do_select() calls sub_select() (the second call) with a value of 1 for the last parameter. In sub_select() this is the code for handling this:

in sub_select():
  if (end_of_records)
  {
    enum_nested_loop_state nls=
      (*join_tab->next_select)(join,join_tab+1,end_of_records);
    DBUG_RETURN(nls);
  }

As end_of_records has a value of 1 the code following the if()-statement is executed and next_select() calls sub_select_cache() with the next entry from the JOIN_TAB-list (=TestBig). In this function you will find some code similar to the code in sub_select():

in sub_select_cache():
  if (end_of_records)
  {
    rc= cache->join_records(FALSE);
    if (rc == NESTED_LOOP_OK || rc == NESTED_LOOP_NO_MORE_ROWS)
      rc= sub_select(join, join_tab, end_of_records);
    DBUG_RETURN(rc);
  }

As end_of_records is still 1 the code within the braces is executed and the function JOIN_CACHE::join_records() is called. The hierarchy above shows the code-hierarchy executed up to the point where the first record is read from TestBig. With this record the code returns to JOIN_CACHE::join_matching_records(). Now a loop in this function handles reading the records from TestBig and it looks like that:

  while (!(error= join_tab_scan->next()))                               outer loop
  {
    ......
    /* /* Read each possible candidate from the buffer and look for matches *//* 
    while ((rec_ptr= get_next_candidate_for_match()))                   inner loop
    { 
      /* 
        If only the first match is needed, and, it has been already found for
        the next record read from the join buffer, then the record is skipped.
        Also those records that must be null complemented are not considered
        as candidates for matches.
      */
      .........
    }
  }

Record after record is fetched from TestBig by calling join_tab_scan->next():

JOIN_CACHE::join_matching_records()
    JOIN_TAB_SCAN::next()
        rr_sequential()            via info->read_record()
            handler::ha_rnd_next()
                ha_myisam::rnd_next()

With this record from TestBig the JOIN has to be made so the code walks sequentially through all entries of a cache containing the records from TestSmall and tries to find a match (see: inner loop in the box above). When this is done the next record is fetched from TestBig by the statement join_tab_scan->next(), the next iteration of the outer loop in the code-box above.

Two special cases happen when reading a record from TestBig. For the first case the code in JOIN_TAB_SCAN::next() is called with the record already read (as described above). This is handled by this statement:

if (is_first_record)
    is_first_record= FALSE;
  else
    err= info->read_record(info);

In the second case a record is read from TestBig which does not fulfill the (partial) WHERE-clause. This is handled here:

  while (!err && select && (skip_rc= select->skip_record(thd)) <= 0)
  {
    ....
    /* 
      Move to the next record if the last retrieved record does not
      meet the condition pushed to the table join_tab.
    */
    err= info->read_record(info);
    ....
  }

The function SQL_SELECT::skip_record() contains this check and the code looks like:

  inline int skip_record(THD *thd)
  {
    int rc= MY_TEST(!cond || cond->val_int());
    if (thd->is_error())
      rc= -1;
    return rc;
  }

The check is done within the macro MY_TEST⁷⁾ and so the function evaluate_join_record() is not needed.

One piece is missing: in sub_select_cache() the function JOIN_CACHE::join_records() is called and my description above showed that the table TestBig is accessed. But if you look at the lines in sub_select_cache() you will see that a second function is called: sub_select(). This call simply calls the function end_send_group() which finishes the whole activity.

And here I want to stop my description. Looking at my text I think that I've shown how the case without index is handled and also I compared it to the case with one index.
If you think some part is missing please drop me a line using the mail-function on the right side of this blog.

why I used the hint IGNORE INDEX

Initially I wanted to describe the case without any index only. But soon I came upt with the question: how is this handled in the case of an index? So I made a lot of comparisons how the 2 cases are treated in the code. To switch from one case to another I had to

drop the index and do the next test without any index

or for switching back

create the index and start the test-case using the index

Here you can see that dropping the index takes about 100 seonds, creating it again takes about 2 minutes:

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> drop index PZN on TestBig;
Query OK, 10000000 rows affected (1 min 41.39 sec)     
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show indexes from TestBig;
Empty set (0.01 sec)

MariaDB [TestOpt]> create index PZN on TestBig(PZN);
Query OK, 10000000 rows affected (1 min 59.57 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestOpt]> show indexes from TestBig;
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestBig |          1 | PZN      |            1 | PZN         | A         |       22321 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

Instead of waiting 2 minutes before I can start the test of the other case I used the hint, doing it this way was faster.

correctness

1) also included in this statement is the hint SQL_NO_CACHE. This hint tells the database-server to do the fetches again instead of using the internal cache. This helped me using this statement again and again and ..... So this statement contains two hints.

2) the table TestSmall contains 301.036 and the table TestBig contains 10.000.000 records so each table is indeed scanned only once (add 1 for detecting the EOF-situation).

3) you will find a description of the inspection-process here: JOIN (3).

4) I will stop here, at least for this post.

5) the table TestBig is accessed every time a matching record is found in TestSmall so the number 311 does not tell that 311 records are read from TestBig but that TestBig is accessed 311 times including reporting NESTED_LOOP_NO_MORE_ROWS multiple times.

6) you will find mor details of this behavior here: JOIN (2)

7) I've already described this macro here: JOIN (3)