Wednesday 7 December 2016

subselect: execution (2)


in my last post (subselect: execution) I played with a special SQL-statement and showed how this statement is treated in the execution-stage of the server. For this text I simply want to change the order of the tables within this statement and look what's happening then. Most of this text looks at the code of MariaDB, in the end I will describe the difference to MySQL.


warning

Before I dive into the code let me issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.


environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same tables, the same data in the tables etc.. I've changed the SQL-statement and therefore my focus changes.


the SQL-statement

For this text this is the statement I want to inspect:
MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (17.84 sec)

MariaDB [TestOpt]>

Let me add the explain the server returns:
MariaDB [TestOpt]> explain select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020'\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: <subquery2>
         type: eq_ref
possible_keys: distinct_key
          key: distinct_key
      key_len: 7
          ref: func
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
3 rows in set (0.02 sec)

MariaDB [TestOpt]>

As you can see the statement looks very similar to the statement in my last post. Again it's a statement with a subselect. If you look at it a bit closer you will see that the order of the tables changed: now the table TestBig is in the subselect (=subquery) and TestSmall is in the outer part of the query. Also this query takes about 10% more time than the old query; this can be due to the fact that the subselect now returns much more rows (about 400K) than the old subselect (about 60). And also this query returns only 4 rows, the old one returned 14 rows.

Why do we now have only 4 rows returned? Let's look at the result of the old query again:
MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText from TestBig A where A.PZN in ( select B.PZN from TestSmall B where B.Hersteller = '00020') and A.Hersteller = '36367';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | YPSIFLEX 10CMX4M EL FIXIER |
| 178324 | STIBIUM SULF AURANT D 6    |
| 87886  | GILOFA BAS 40 KNSTR 1KANDI |
| 178324 | SOFTCLIX II LANZETTEN      |
| 87952  | RUTIVISCAL N               |
| 178324 | CROTAMITON                 |
| 87952  | NOBASTRETCH KRAEFT 7MX6CM  |
| 87952  | STECHBECKEN 31CM STAHL     |
| 87923  | LIPP LOVE LUXUS SUN PROT12 |
| 87952  | TASTATURHILFE UNT NC21017  |
| 178324 | HOLLISTER KOL KLEBE F 2178 |
| 87952  | MEDI TEST GLUCOSE          |
| 178324 | ESTOSAN TUBE               |
| 87952  | FRUCHTS DUSCHGEL ORANGE    |
+--------+----------------------------+
14 rows in set (17.34 sec)

MariaDB [TestOpt]> 

If you look a bit closer at the result you will see duplicate values in the column PZN. This statements verifies this:
MariaDB [TestOpt]> select SQL_NO_CACHE  distinct A.PZN from TestBig A where A.PZN in ( select B.PZN from TestSmall B where B.Hersteller = '00020') and A.Hersteller = '36367';
+--------+
| PZN    |
+--------+
| 178324 |
| 87886  |
| 87952  |
| 87923  |
+--------+
4 rows in set (15.72 sec)

MariaDB [TestOpt]> 

So there are multiple records with identical value in the column PZN in TestBig, but not in the table TestSmall. This explains the different results.

So this statement is treated identical to the one described in the last text? Yes. Here is a description of the engine used for the temp-table created internally: MEMORY Storage Engine. But one term caught my interest: max_heap_table_size. So what will happen when we the size of the table will exceed this value?

From the documentation given we know that this variable can contain a value between 16 MB and 4.294.966.272 (= 4GB - 1KB). So let's play with this value.


playing with max_heap_table_size

Let's start with the fefault-value:
MariaDB [TestOpt]> select @@max_heap_table_size;
+-----------------------+
| @@max_heap_table_size |
+-----------------------+
|              16777216 |
+-----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (16.66 sec)

MariaDB [TestOpt]>

As you can see the variable has a value of 16 MB. Let's try to set this to a minimal value:
MariaDB [TestOpt]> set max_heap_table_size = 1024;
Query OK, 0 rows affected, 1 warning (0.00 sec)

MariaDB [TestOpt]> select @@max_heap_table_size;
+-----------------------+
| @@max_heap_table_size |
+-----------------------+
|                 16384 |
+-----------------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE  A.PZN, A.ArtikelText  from TestSmall A where A.PZN in ( select B.PZN from TestBig B where B.Hersteller = '36367') and A.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (35.47 sec)

MariaDB [TestOpt]>

Something changes now. I tried to set the value of the variable max_heap_table_size to a minimal value but the server sets the value to 16 KB. And the execution of the statement takes a lot more of time, it more then doubles.

So why does this happen? And wehre does it happen?

Let's dive into the code.


identical behaviour

In the beginning the behaviour of the server is identical to the last example, so here is only a small reminder of the hierarchy for reading the first record from the table TestSmall1):
mysql_select()
    JOIN::exec()
        JOIN::exec_inner()
            do_select()
                sub_select()
                    join_init_read_record()           called via (*join_tab->read_first_record)()
                        rr_sequential()
                    evaluate_join_record()

As (in my case) evaluate_join_record() did not find a match the server goes to the main loop and reads record after record from TestBig and inspects each one:
sub_select()
    rr_sequential()       called via info->read_record()
    evaluate_join_record()

When evaluate_join_record() finds a matching record from TestSmall it acts as described in my last text: the server switches from level 1 (=reading from TestSmall) to level 2 (=handling the temp-table) to level 3 (= reading from TestBig and putting the data found into the temp-table). On this level the hierarchy looks like this:
sub_select()                                level 3
    evaluate_join_record()
        end_sj_materialize()                called via (*join_tab->next_select)()
            fill_record()
                Item_field::save_in_field()
                    save_field_in_field()
            handler::ha_write_tmp_row()
                ha_heap::write_row()

All the same, as before.


new behaviour

With the SQL-statement shown some lines above I've set the size of a table of type MEMORY to max. 16K bytes, and now things will change.

As there are approx. 400K records in TestBig that fulfill the WHERE-condition and therefore have to be put into the temp-table the size of this table will be at least approx. 2.4 MB (=400K records * 6 bytes for the PZN-value) which exceeds the new give max. value for this table. So let's look how this situation is detected and handled.

Somewhere in middle of the execution I looked into the /temp-directory and found this:
august@AMD4:~/MariaDB/pgm$ ls -l /tmp
insgesamt 36
....
-rw-rw---- 1 august august  4104 Nov 30 16:31 #sql_243a_0.MAD
-rw-rw---- 1 august august 16384 Nov 30 16:31 #sql_243a_0.MAI
....
august@AMD4:~/MariaDB/pgm$ 

As you can see 2 files are created with the name of the temp-table plus the extension MAD- and MAI. By looking into the MAD-file I found this:
august@AMD4:~/MariaDB/pgm$ od -t x1 -Ax /tmp/#sql_243a_0.MAD
000000 ff 35 33 35 38 31 38 20 ff 39 39 39 39 39 39 20       '535818 ' '999999 '
000010 ff 32 30 34 36 36 35 20 ff 35 31 33 35 37 33 20       '204665 ' '513573 '
000020 ff 33 38 32 39 37 30 20 ff 32 31 37 38 35 37 20       '382970 ' '217857 '
and so on

This looks like the data-file of the Maria-engine (this contains a mode almost identical to MyIsam2)) and the contents shown are the first values of the column PZN from the table TestBigl. And the MAI-file looked like the corresponding index-file. So where does this happen in the code?

It can only happen when a new record is added to the temp-table and with this record the temp-table exceeds some given limit. Let's look over there:
sub_select()                      we are at level 3
    evaluate_join_record()
        end_sj_materialize()
            handler::ha_write_tmp_row()
                ha_heap::write_row()
                    heap_write()
                        next_free_record_pos()

In the last function given in this hierarchy this situation is detected:
  if (!(block_pos=(info->records % info->block.records_in_block)))
  {
    if ((info->records > info->max_records && info->max_records) ||
        (info->data_length + info->index_length >= info->max_table_size))
    {
      ....
      my_errno=HA_ERR_RECORD_FILE_FULL;           = 135
      DBUG_RETURN(NULL);
    }
    ....
  }

In the case I inspected the var. info->max_records contained a value of 512 (more on this later), trying to insert the next records results in the function returning NULL. So the code marches the hierarchy up to the function end_sj_materialize() to the point where the call started. Here the code looks like this:
    if ((error= table->file->ha_write_tmp_row(table->record[0])))
    {
      /* create_myisam_from_heap will generate error if needed */
      if (table->file->is_fatal_error(error, HA_CHECK_DUP) &&
          create_internal_tmp_table_from_heap(thd, table,
                                              sjm->sjm_table_param.start_recinfo, 
                                              &sjm->sjm_table_param.recinfo, error, 1, NULL))
        DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    }

Normally the function ha_write_tmp_row() returns 0 (when everything is OK) or HA_ERR_FOUND_DUPP_KEY (when the value is already in the temp-table) but in the case of HA_ERR_RECORD_FILE_FULL the contents of the temp-table is written on the disk. If you look at the information some lines above you will see that an entry in the MAD-file is 8 bytes long. So with 513 records the size of the temp-file on disk is 4104 bytes.

From this moment on the temp-table is no longer of type MEMORY but of type MARIA, the table is no longer stored in RAM but is stored on disk and all reading and writing to the temp-table is handled with the disk-based version of the file (or with the cache of this file). You will loose the speed-advantage of the RAM-based table but the handling of the statement continues. The situation of overflowing the size-limit for the temp-table in RAM is detected and the server silently switches to a disk-based solution and continues.

This explains the increase in execution-time needed when the max. size of the RAM-based HEAP-table is changed to a minimal value.

Where does the value 512 for the switch comes from? You will find the code in heap_prepare_hp_create_info():
    case HA_KEY_ALG_HASH:
      keydef[key].algorithm= HA_KEY_ALG_HASH;
      mem_per_row+= sizeof(char*) * 2; // = sizeof(HASH_INFO)               result = 16
      break;
....
  mem_per_row+= MY_ALIGN(share->reclength + 1, sizeof(char*));              result = 32
....
  max_rows= (ha_rows) (hp_create_info->max_table_size / mem_per_row);       result = 512
....
  hp_create_info->max_records= (ulong) MY_MIN(max_rows, ULONG_MAX);         result = 512

In the beginning of the execution of this SQL-statement the value for max_records is computed (an estimation) and set.

Here my journey ends for today.


what about MySQL?

No, it's not finished yet. MySQL acts very similar but it creates a file on disk as a MyISAM-file. Here you can see the differences in the code:
MySQL3) MariaDB4)
in end_sj_materialize():
    if ((error= table->file->ha_write_row(table->record[0])))
    {
      /* create_myisam_from_heap will generate error if needed */
      if (table->file->is_fatal_error(error, HA_CHECK_DUP) &&
          create_myisam_from_heap(thd, table,
                                  sjm->table_param.start_recinfo, 
                                  &sjm->table_param.recinfo, error,
                                  TRUE, NULL))
        DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    }
in end_sj_materialize():
    if ((error= table->file->ha_write_tmp_row(table->record[0])))
    {
      /* create_myisam_from_heap will generate error if needed */
      if (table->file->is_fatal_error(error, HA_CHECK_DUP) &&
          create_internal_tmp_table_from_heap(thd, table,
                                              sjm->sjm_table_param.start_recinfo, 
                                              &sjm->sjm_table_param.recinfo, error, 1, NULL))
        DBUG_RETURN(NESTED_LOOP_ERROR); /* purecov: inspected */
    }
/**
  If a MEMORY table gets full, create a disk-based table and copy all rows
  to this.

......
*/

bool create_myisam_from_heap(THD *thd, TABLE *table,
                             MI_COLUMNDEF *start_recinfo,
                             MI_COLUMNDEF **recinfo, 
        int error, bool ignore_last_dup,
                             bool *is_duplicate)

/*
  If a HEAP table gets full, create a internal table in MyISAM or Maria
  and copy all rows to this
*/


bool
create_internal_tmp_table_from_heap(THD *thd, TABLE *table,
                                    TMP_ENGINE_COLUMNDEF *start_recinfo,
                                    TMP_ENGINE_COLUMNDEF **recinfo, 
                                    int error,
                                    bool ignore_last_dupp_key_error,
                                    bool *is_duplicate)


And that's it for today.




correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors on my side can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.




some notes:
1) in the last text the reading starts with the table TestBig, in this case it starts with reading from the table TestSmall because this is the table in the outer past of the statement
2) I already described this here: od
3) MySQL vdersion 5.6.22
4) MariaDB version 10.0.10