an unsystematic journey through the source-code of MySQL and MariaDB: 2017

Monday, 17 April 2017

subselect: execution (5)

I want to come back to my initial plan and look into the code, following the flow of C-statements while executing one given SQL-statement. For this text I will look in detail how the server executes the WHERE-clause. I will start my description with the the WHERE applied to the table TestBig (the subselect-part), and then look at the handling of the WHERE for the table TestSmall.

warning

Before I describe my observation let me issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.

environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same tables, the same data in the tables, it's even the same SQL-statement as in my last post. But for this post my focus changes. In this text I will look into the code that realizes the WHERE-clause.

the SQL-statement

For this text this is the statement I want to inspect:

MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (25 min 4.84 sec)

MariaDB [TestOpt]>

WHERE

As I've written in my last text the WHERE-clause for the table TestBig looks like that:

For this text I will sharpen my look and will only look at this detail: the WHERE-clause. So where is the code that executes this WHERE?

the code

You will find the code for handling the WHERE in the file sql_select.cc:

static enum_nested_loop_state
evaluate_join_record(JOIN *join, JOIN_TAB *join_tab,
                     int error)
{
  ....
  COND *select_cond= join_tab->select_cond;
  ....
  if (select_cond)
  {
    select_cond_result= MY_TEST(select_cond->val_int());
  ....
}

I've marked the relevant part in bold.

MY_TEST is simply a macro that converts any output into 1 or 0. So the WHERE-clause is represented by the call of val_int() of the object pointed to by the variable select_cond. That's all.

Naturally that's not all. For this concrete example I will show you how the tree, given in graphical form above, is handled by this simple call. The tree is constructed somewhere else, but this is not part of this text. So let's simply start with an example to show the function-hierarchy that is equivalent to the tree.

applying WHERE to TestBig

And as I've already written I will start with the WHERE applied to records read from the table TestBig.

So here is the record that I will examine:

MariaDB [TestOpt]> select * from TestBig limit 1;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

As you can see the value in the column Hersteller does not match the condition so the left part of the tree results in false. As the right and the left part of the tree are combined via an AND-statement the right part does not need to be examined.

first example: left part of the tree

For the record given here his is the hierarchy that detects and handles this:

Item_cond_and::val_int()        called via the variable select_cond
    Item::val_bool()
        Item_func_eq::val_int()
            Arg_comparator::compare()
                Arg_comparator::compare_string()
                    Item_field::val_str()
                    Item_string::val_str()
                    sortcmp()

The function sortcmp() returns with a value of -1 ( = not equal to 0) because the values differ (36367 ≠ 25677). So the code jumps back, the function item_func_eq::val_int() sets the return-value to 0 (= false) and at the AND-level will returns with false as a return-value.
This record from TestBig is thrown away. The server continues with reading the next record from TestBig.

second example: right part of the tree

The next record fetched from TestBig looks like this:

MariaDB [TestOpt]> select * from TestBig limit 1,1;
+----------+------+------+-------+-------------------------+---------------------+------------+
| Id       | PZN  | EVP  | HAP   | ArtikelBez              | ArtikelText         | Hersteller |
+----------+------+------+-------+-------------------------+---------------------+------------+
| 01000001 | 111  | 0.00 | 18.91 | PK FUER TIERE    250 ML | URETERSCHIENEAC4P66 | 36367      |
+----------+------+------+-------+-------------------------+---------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestOpt]>

As you can see for this record the column Hersteller contains the correct value so the left part of the tree returns true. Now the right part has to be examined. As I've already described the left part of the tree I will generously skip over this part. Here is how this record is checked:

Item_cond_and::val_int()        called via the variable select_cond
    // checking the column Hersteller, same as above, omitted here
    // as the return-value is true for this case the comparison continues:
    Item::val_bool():   
        Item_func_eq::val_int()
            Arg_comparator::compare()
                Arg_comparator::compare_string()
                    Item_field::val_str()
                    Item_field::val_str()
                    sortcmp()

The only difference to the hierarchy given above is inside the function compare_string(). Now two columns have to be compared instead of a column with a string-constant, so compare TestSmall.PZN and TestBig.PZN. As the values are 111 (for A aka TestBig) and 12 (for B = TestSmall, shown later in this text) so the comparisons results in false (aka not equal) and this record is thrown away. The server continues with reading the next record from TestBig and so on.

As you can see from the box in the beginning of this text there are 4 records that will be returned by this statement so there must be 4 records in the table TestBig where both parts of the tree result in true. These records are handled further by the server but this is not part of this text.

coding the AND

Here is the code that realizes the AND:

longlong Item_cond_and::val_int()
{
  DBUG_ASSERT(fixed == 1);
  List_iterator_fast li(list);
  Item *item;
  null_value= 0;
  while ((item=li++))
  {
    if (!item->val_bool())
    {
      if (abort_on_null || !(null_value= item->null_value))
    return 0;               // return FALSE
    }
  }
  return null_value ? 0 : 1;
}

The while-loop realizes the AND, that's all. If one of the computations inside the brackets returns a false the whole statement is false (it's an AND !) so the function is left with a return-value FALSE.

applying WHERE to TestSmall

So I've shown how the WHERE in the subselect-part of the statement is handled. Now I want to show how the WHERE is applied to the table TestSmall. And this is, in graphical form, the condition to be applied:

third example: checking TestSmall

Here is the first record from TestSmall that that matches the WHERE-condition:

MariaDB [TestOpt]> select * from TestSmall where Hersteller = '00020' limit 1;
+---------+------+------+------+--------------------------------+----------------------------+------------+
| Id      | PZN  | EVP  | HAP  | ArtikelBez                     | ArtikelText                | Hersteller |
+---------+------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 12   | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND  | 00020      |
+---------+------+------+------+--------------------------------+----------------------------+------------+
1 row in set (0.04 sec)

MariaDB [TestOpt]>

Again I omit the left part of the tree as it's already shown above. So let's look at what happens on the right side of this tree. Here is the function-hierarchy:

Item_cond_and::val_int()        called via the variable select_cond
    // checking the column Hersteller, same as above, omitted here
    // as the return-value is true for this case the comparison continues:
    Item_cache_wrapper::val_bool()
        Item_cache_wrapper::cache()
            Item_cache_int::cache_value()
                Item::val_int_result() 
                    Item_exists_subselect::val_int()
                        Item_subselect::exec()
                             subselect_single_select_engine::exec()
                                JOIN::exec()
                                    JOIN::exec_inner()
                                        do_select()
                                            sub_select() 
                                                // read record from table
                                                // evaluate this record
                                                // until EOF

In the graphical representation the right part of the tree simply contains the box SUBSELCT. This is a statement of its own which is handled as (partially) described above. The result is put into a cache. The hierarchy shows in the lower parts how the table TestBig is scanned and in the upper parts how the cache is checked for results. But that's another story.

correctness

This text is a simplified presentation of what's going on in the software. As I'm still learning the inner-workings of this software errors on my side can happen. In case something is wrong with my description please use the mail-function of this site and send me a mail describing my error. I will look at it. Thanks.

Wednesday, 15 February 2017

subselect: execution (4)

Initially I didn't plan to write this text. My intention was to write how the server executes a special statement but by collecting the data for my text I found something that was more interesting than my initial idea (the same happened with the last text). I decided to put my plan aside and write about my observation.

As this text is about engine condition pushdown (ECP) implemented in MariaDB so please look into an older post of mine with the information about this topic: WHERE. I will use the information presented there and especially the piece of code described in this text.

I want to show and describe what the engine condition pushdown reports on my SQL-statement and to make it short: I don't like the result. Maybe it's a bug in the code, maybe I did something wrong. Let's start with the example.

warning

environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same tables, the same data in the tables etc.. I've changed the SQL-statement and therefore my focus changes. I'm looking in the handling of the extreme-situation when no indexes are defined on the tables (silly, but the server has to handle this somehow). In this text I want to describe the contents of the condition-tree; as this happens in the layer above the storage-engine I think my description is valid for all engines but I didn't check this. Nevertheless I want to tell you I've used the MyISAM-engine for my tests.

the SQL-statement

For this text this is the statement I want to inspect:

MariaDB [TestOpt]> set optimizer_switch='engine_condition_pushdown=on';
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020';

No result given here, I will focus on the data in the condition-tree.

the condition-tree

As condition-pushdown is enabled for this test the function ha_myisam::cond_push() is called and it is called multiple times as I will show you soon. Using the code given here I will receive this output in the console-window of Eclipse:

table_name = <TestSmall>    <select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020'>
COND-ITEM   args: 0 type=[COND_AND_FUNC]    
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [B] [Hersteller]  name=<Hersteller>
STRING-ITEM     str_length=<5>  str_value=<00020>   name=<00020>
SUBSELECT-ITEM  

table_name = <TestBig>  <select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020'>
COND-ITEM   args: 0 type=[COND_AND_FUNC]    
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [A] [Hersteller]  name=<Hersteller>
STRING-ITEM     str_length=<5>  str_value=<36367>   name=<36367>
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [A] [PZN] name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] name=<PZN>

This text will appear before any data is read from the tables.

This form looks ugly so I will present the same data (= the condition-tree) in a better form:
for the table TestSmall:

and for the table TestBig¹⁾:

These 2 trees were given to the function ha_myisam::cond_push() before any data was read from the tables. So let's go to the execution-stage.

executing the statement

As this is the same statement as in my last post so please look into this text for a description of the steps for executing this statement.

The execution begins with a table-scan on the table TestSmall. Before starting this table-scan the function cond_push() is called again with the condition-tree for this table which is a little bit modified, but this is not of interest here. Then reading the table TestSmall starts and when a record matching the condition is found the server switches over to the table TestBig and searches for corresponding records in this table. As described the only way to do this is by a table-scan. Before this table-scan starts cond_push() is called again, this time with the condition-tree for the table TestBig (the call is done in init_read_record()). And this condition-tree is of interest, it looks like this:

table_name = <TestBig>  <select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020'>
COND-ITEM   args: 0 type=[COND_AND_FUNC]    
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [A] [Hersteller]  name=<Hersteller>
STRING-ITEM     str_length=<5>  str_value=<36367>   name=<36367>
FUNC-ITEM   [=] args: 2 type=[EQ_FUNC]  
FIELD-ITEM  [TestOpt] [A] [PZN] name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] name=<PZN>

Of interest are the last 2 lines, for this reason I've marked them in bold. For the next examples I will only present these lines.

If you compare this output with the output presented some lines above you will not see any difference. Instead of explaining what I expected I would like to let the server continue it's work: it starts reading record after record from TestBig and when a match is found it stops reading TestBig and returns to TestSmall. There it continues reading.

When the next match is found in TestSmall it will switch over to the table TestBig again. Before reading the first record from TestBig it calls cond_push(). The line of interest now looks like this:

FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<2>  str_value=<37>  name=<PZN>

Something has changed in the condition-tree, I've marked this part in bold: the line contains a value from the column PZN of the current record in TestSmall (aka B). Let's loop the server through the records in TestBig, maybe until the end when no match is found. It then returns to the table TestSmall and continues reading records from this table.

OK, this behaviour repeats, in total 60 times (for my data here, as described in my last post). So I want to present only these 2 lines from the output of the condition-tree and only for the first 5 calls²⁾:

1st match (repeated):
FIELD-ITEM  [TestOpt] [A] [PZN]                                      name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN]                                      name=<PZN>

2nd match:
FIELD-ITEM  [TestOpt] [A] [PZN] str_length=<6>  str_value=<      >   name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<2>  str_value=<37>       name=<PZN>

3rd match:
FIELD-ITEM  [TestOpt] [A] [PZN] str_length=<6>  str_value=<      >   name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<7>  str_value=<222    >  name=<PZN>

4th match:
FIELD-ITEM  [TestOpt] [A] [PZN] str_length=<3>  str_value=<<   >     name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<3>  str_value=<371>      name=<PZN>

5th match:
FIELD-ITEM  [TestOpt] [A] [PZN] str_length=<6>  str_value=<      >   name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<7>  str_value=<3717997>  name=<PZN>

and so on

For getting the information presented above I used the member-functions of the Item-class (aka COND-class) of derived classes especially the member str_value (of type String). If this is the wrong way to access the information please drop me a line. Thanks.

So let's look at the data and compare these values with the data in the box above:

MariaDB [TestOpt]> select * from TestSmall where Hersteller = '00020' limit 5;
+---------+---------+------+------+--------------------------------+----------------------------+------------+
| Id      | PZN     | EVP  | HAP  | ArtikelBez                     | ArtikelText                | Hersteller |
+---------+---------+------+------+--------------------------------+----------------------------+------------+
| 1002100 | 12      | 3.95 | 1.83 | TINY TOON ADVENTURES KIND 1 ST | TINY TOON ADVENTURES KIND  | 00020      |
| 1025266 | 3717968 | 0.00 | 2.90 | PRESSOTHERM KALT/WA 13X14 1 ST | PRESSOTHERM KALT/WA 13X14  | 00020      |
| 1025267 | 222     | 0.00 | 4.45 | PRESSOTHERM KALT/WA 12X29 1 ST | PRESSOTHERM KALT/WA 12X29  | 00020      |
| 1025268 | 3717980 | 0.00 | 6.30 | PRESSOTHERM KALT/WA 16X26 1 ST | PRESSOTHERM KALT/WA 16X26  | 00020      |
| 1025269 | 3717997 | 0.00 | 6.30 | PRESSOTHERM KALT/WA 20X20 1 ST | PRESSOTHERM KALT/WA 20X20  | 00020      |
+---------+---------+------+------+--------------------------------+----------------------------+------------+
5 rows in set (0.06 sec)

MariaDB [TestOpt]>

expectations

So what did I expect? Engine condition pushdown gives the storage engine a chance to use the information in the WHERE-clause of the statement to speed-up a query. I like the idea.

In my case here the function cond_push() is initially called for the table TestSmall and then for the table TestBig. Then the server starts reading the table TestSmall and calls cond_push() again. Agreed. When it switches over to the table TestBig it calls cond_push() again, giving it the condition-tree for this table. Again: agreed. But if I look at the 5 outputs of the condition-tree I do not see the values found in the PZN-column shown above.

inspecting the condition-tree

The function cond_push() is called before any real access of the data-file happens When this happens the first time a value of 12 is found in the column PZN of the table TestSmall and I expected to see this value in the corresponding line:

output found:
FIELD-ITEM  [TestOpt] [A] [PZN]                                      name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN]                                      name=<PZN>

output expected:
FIELD-ITEM  [TestOpt] [A] [PZN] str_length=<6>  str_value=<      >   name=<PZN>
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<2>  str_value=<12>       name=<PZN>

So let me show in short form the output (the last line of the condition-tree-output) of all 5 calls of cond_push() that I would expect:

1st match:
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<2>  str_value=<12>       name=<PZN>

2nd match:
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<7>  str_value=<3717968>  name=<PZN>

3rd match:
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<3>  str_value=<222>      name=<PZN>

4th match:
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<7>  str_value=<3717980>  name=<PZN>

5th match:
FIELD-ITEM  [TestOpt] [B] [PZN] str_length=<7>  str_value=<3717997>  name=<PZN>

and so on

Please compare these lines with the real output, presented some lines above. These are the differences:

the value from the current record of TestSmall is given in the correct form in the condition-tree representing the WHERE-condition for TestBig
currently the length-information is incorrect, it's one row behind. This should be the length of the string in the current row, e.g. compare the length-information for the 3rd match: actual it's 7 but it should be 3.

evaluation of the results

Engine condition pushdown gives the storage-engine the information which it can use internally to speed-up the request. It should be given the correct information and all the information available. For the statement presented here this means that when cond_push() is called before the first access of the table TestBig the condition-tree should contain the entry 12 (= the value of the column PZN of the current row of TestSmall in my example). And for the 2nd access the tree should contain the value 3717968 (= value of the column PZN of the then current row of the table TestSmall. And so on.

conclusion

I see these 2 possibilities:

it's a bug in the code

the information is there but I used the wrong function for accessing it

I will ask the developers.

update

2017/05/10: problem fixed. You will find a corrected version of the code presented here in this file: WHERE.cc. With this code I got the the values as expected.

correctness

1) I had to give the equal-functions different names otherwise the code for drawing the tree would have some trouble with drawing the correct relations

2) let's ignore the first (initial call to cond_push(), I want to look at the calls before the table-scan starts

Friday, 27 January 2017

subselect: execution (3)

It wasn't planned to write a text with this content. Initially I wanted to write how the server executes another statement but when I played with the statement I found this behaviour of the server (described some lines below) and I decided to write something about this.

In my last post (subselect: execution (2)) I played with a special SQL-statement, a modified version of a statement I've used before using a subselect, and I showed how this statement is treated in the execution-stage of the server. For this text I simply change the IN-clause in the statement into an EXISTS-clause and look what's happening then. I only did this test in MariaDB, it's up to you to do these test again in MySQL if you are interested in it.

In this text I'm using information which I've already described in older posts and I don't want to repeat this information here, so please look at these posts: caches and od.

warning

Before I describe my observation, my ideas and my fix let me issue a warning: the functions, hierarchies and data-examples shown here are valid for the environment on my PC. If you do your own tests this may look a bit different.

environment

Again I'm using my usual environment which I already described in one of my older posts. It's the same tables, the same data in the tables etc.. I've changed the SQL-statement and therefore my focus changes. I'm looking in the handling of the extreme-situation when no indexes are defined on the tables (silly, but the server has to handle this somehow). In the older texts in this series I stopped my description when a call into the storage-engine is made. In this text I will mainly look into the code of the storage-engine and the engine used for storing the data in the tables is MyISAM, things may look different with other engines.

the SQL-statement

For this text this is the statement I want to inspect:

MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020';
+--------+----------------------------+
| PZN    | ArtikelText                |
+--------+----------------------------+
| 178324 | TINY TOON ADVENTURES KIND  |
| 87886  | MONDOS KONDOME COLOR       |
| 87923  | MONDOS KONDOME NATURE      |
| 87952  | MONDOS KONDOME XXL         |
+--------+----------------------------+
4 rows in set (25 min 4.84 sec)

MariaDB [TestOpt]>

Let me add the explain the server returns:

MariaDB [TestOpt]> MariaDB [TestOpt]> explain select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN) and B.Hersteller = '00020'\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: B
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 301036
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: A
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
2 rows in set (0.01 sec)

MariaDB [TestOpt]>

The server starts with a table-scan on the table TestSmalll and in the case of a record matching the WHERE-condition it starts another query on the table TestBig. This query can only be done by a table-scan as no index on this table exists. This explains the time needed for the execution of this statement.

some calculations

Let's check the data and see how many records are involved. So here are the number of records in the table TestSmall that match the (partial) WHERE-condition:

MariaDB [TestOpt]>MariaDB [TestOpt]> select count(*) from TestSmall where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|       60 |
+----------+
1 row in set (0.39 sec)

MariaDB [TestOpt]>

There are 301.036 records in the table TestSmall¹⁾ and 60 of these match the (partial) WHERE-condition of the SQL-statement above.

And here is the data for the table TestBig:

MariaDB [TestOpt]>MariaDB [TestOpt]> select count(*) from TestBig where Hersteller = '36367';
+----------+
| count(*) |
+----------+
|   461016 |
+----------+
1 row in set (12.40 sec)

MariaDB [TestOpt]>

Let's look at the first number: by executing the statement above 60 records in TestSmall match the WHERE-condition. When such a record is found the database-server has to switch to the table TestBig and start reading this table in a table-scan, in the worst case when no matching record is found in TestBig it hast to read this table to the end.

In an older post I introduced some counter-variables to count the number of times the function ha_myisam::rnd_next() is called. This is what these counters tell me about the statement shown above:

table = <TestBig>   counter table-scan: 573354474   index-scan: 0
table = <TestSmall> counter table-scan: 301037  index-scan: 0

As there are 301.036 records in the table TestSmall this table is indeed read once (add 1 for detecting the EOF-situation) but what about TestBig? When the first record matching the condition is found in TestBig the table-scan stops so there are less than 60*10mio. read-operations. Let's verify this: I added this text to the subselect-part of the statement above: and Id = 1. As there is no value 1 in the Id-column this part is always false and the server has to read the full table. For this statement it tells at the end:

table = <TestBig> counter table-scan: 600000060 index-scan: 0
table = <TestSmall> counter table-scan: 301037 index-scan: 0

As the table TestBig contains 10 mio. records (add 1 for the EOF) you can see that this table is indeed read 60 times.

Everything seems clear now. But wait: what about the 25 minutes?

Let's omit the time reading the table TestSmall, one third of a second is a tiny fraction of the 25 minutes the whole process takes, so let's only look at the time for reading TestBig. A full table-scan on this table takes about 12 seconds as shown some lines above. As this is done 60 times I would expect the whole process to take about 12 minutes but it does take twice as long. Why this?

into the code

For inspecting this situation I modified the code in ha_myisam.cc again:

added a global variable: time_t timeStart;
in ha_myisam::rnd_init() initialize the variable: timeStart = time( NULL );

in ha_myisam::rnd_end() there was already a line of code for outputting the value of the counter-variables. I modified this line to output the time too:

fprintf(stderr, "\ntable = <%s>\tcounter\ttable-scan: %d\tindex-scan: %d\ttime = %ld\n", table_share->table_name.str, counterTableScan, counterIndexScan, time(NULL)-timeStart);

For measuring the time used for a table-scan in this environment it's sufficient to work with a precision of 1 second so the function time() may help here.

And here is the result of the next run of my statement:

table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 26
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 27
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 27
......

As you can see the first scan is fast, as expected, but all other scans are slow. Why this?

By stepping into the code I found this function:

int _mi_read_cache(IO_CACHE *info, uchar *buff, my_off_t pos, uint length,
     int flag)
{
  ....
  if (pos < info->pos_in_file)
  {
    .....
    if (mysql_file_pread(info->file, buff, read_length, pos, MYF(MY_NABP)))
      DBUG_RETURN(1);
    if (!(length-=read_length))
      DBUG_RETURN(0);
    ....
  }
  DBUG_RETURN(0);
} /* _mi_read_cache */

As described in my older text the data is read from the file in a block of 128K bytes. The variable info->pos_in_file points to the position in the file containing the beginning of the bytes currently in the block (=buffer). When a record is read the bytes are copied from the (global) buffer into the local buffer - simplified. When the global buffer is at the end the next block of 128K bytes will be read from the file. At the end of the table-scan the global buffer contains the last bytes of the file. When the next table-scan starts from the beginning of the file this information will not be reset so the if()-statement is true in most cases because the bytes from the beginning of the table are requested. So the code following the if()-statement will be executed: the data will be read from the disk directly by calling the read()-function of the Std-library. As there are 10 mio. records in this table it means millions of calls of this function²⁾. And that's slow compared to copying the bytes from the buffer.

How can this be fixed?

a quick and dirty fix

By looking into the code I found a function reinit_io_cache() which looks promising So I added a call to this function to the code. Where do I add this call? The solution I will present here is quick and dirty:

it's quick: it can be implemented and tested in minutes
it's dirty: I added it in the wrong place (I'm sure about this). And it's specific for my test-case.

Nevertheless it's a test only so I did this:

in ha_myisam.cc I added these lines:

a global variable: bool is1stScan = true;

in the function ha_myisam::rnd_init() I added these lines:

 counterTableScan = counterIndexScan = 0;
 char *tableName = table_share->path.str;
 if ( 0 == strcasecmp(tableName, "./TestOpt/TestBig") )
 {
     if ( !is1stScan )
         reinit_io_cache(&file->rec_cache,READ_CACHE,0L,0,0);
     is1stScan = false;    // for the following calls
 }

So recompiling and starting the server and in the next test this result showed up:

MariaDB [TestOpt]> select SQL_NO_CACHE   B.PZN, B.ArtikelText  from TestSmall B where exists ( select 1 from TestBig A where A.Hersteller = '36367' and A.PZN = B.PZN and Id = 1) and B.Hersteller = '00020';
Empty set (13 min 22.58 sec)

MariaDB [TestOpt]>

That looks better. Please keep in mind that this is he statement that enforces a table-scan up to the end (I've marked that part of the statement in bold).

And here are the timing-values of this execution:

table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 13
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 13
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
table = <TestBig> counter table-scan: 10000001 index-scan: 0 time = 12
.......

Please compare this to the values presented above: that looks good. I modified the SQL-statement back to the original one and let it run: it took about 11 minutes and 40 seconds, as expected.

That's it for today.

changing the order of the tables

What do you think about changing the place of the tables meaning putting the table TestSmall into the subquery and the table TestBig outside? So the execution will start whith a table-scan on the TestBig and read the table TestSmall only when a match is found? Loks like this:

MariaDB [TestOpt]> select SQL_NO_CACHE  B.PZN, B.ArtikelText  from TestBig B where exists ( select 1 from TestSmall A where A.Hersteller = '00020' and A.PZN = B.PZN and Id = 1) and B.Hersteller = '36367';
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted
MariaDB [TestOpt]>

Forget about this statement. Let's do some quick calculations: it takes about 1/3 of a second to scan the table TestSmall. As there are 460.000 records in TestBig matching the WHERE-condition and for each record found the table TestSmall is scanned so it takes about 460.000/3 seconds (approx. 42 hours), so as I've already written forget about this.

correctness

1) if you look into the explain you will find this number

2) approx. 20 mio. calls as in the case of MyISAM reading a record means at least 2 calls