environment
You may look here to find a description of the environment used in this post. Only this has changed:MariaDB [TestOpt]> show indexes from TestSmall;
Empty set (0.01 sec)
MariaDB [TestOpt]> show indexes from TestBig;
Empty set (0.01 sec)
MariaDB [TestOpt]>
So you see there are no indexes defined on the tables and I want to describe how the database-server handles a statement in this case.warning
All posts in this series refer to MariaDB, the code-examples presented here are taken from the source-code of MariaDB 10.0.10, but the explanations should also be valid for other versions and for MySQL.The example uses MyISAM as the storage engine for the tables, some of the statements here are only valid for this engine.
Please do not generalize the results presented here. A lot of my statements in this post depend on the data in the tables, and this is my data. If you do your own tests don't expect to see the same results.
the statement
This is the statement I'm looking at:MariaDB [TestOpt]> select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367';
+--------------+
| count(A.PZN) |
+--------------+
| 14 |
+--------------+
1 row in set (24.22 sec)
MariaDB [TestOpt]> explain select SQL_NO_CACHE count(A.PZN) from TestSmall A join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '00020' and B.Hersteller = '36367'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: A
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 301036
Extra: Using where
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: B
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 10000000
Extra: Using where; Using join buffer (flat, BNL join)
2 rows in set (0.00 sec)
MariaDB [TestOpt]>
As you can see I've included the explain.Please do not ask me about the meaning of this statement, it is only used for this demonstration, nothing else.
As before there are 301,036 records in the table TestSmall and 10 million records in the table TestBig.
In some posts before I described how the same statement is executed: a table-scan is done on TestSmall. Each record read is inspected and if it satisfies the (partial) WHERE-condition the table TestBig is read, usually via an index using the information from the record just read from TestSmall plus the information in the SQL-statement to fetch the corresponding records from TestBig.
In this post I assume no indexes on the table so following this strategy would mean: read the table TestSmall in a table-scan and when a match is found search the full table TestBig for the corresponding record(s) in TestBig. This means that the big table TestBig will be read by a table-scan multiple times. Or the execution starts with TestBig and for each match it searches the table TestSmall by a table-scan, So it would search the second table multiple times.
Sounds silly. If you look at the statement above you will see that on my PC the server needed about 25 seconds for presenting the result of the query. This look like the server did go a different way to handle the query.
counters
In an older post I introduced some counter-variables to count the number of times a function is called, and to count this at runtime. Using this approach I get the following result in the filetable = <testsmall> counter table-scan: 301,037 index-scan: 0
table = <testbig> counter table-scan: 10,000,001 index-scan: 0
So you can see for this statement no index-function is accessed as there are no indexes. And each table is accessed once for each record in it (plus one more time to detect the end-situation).Looks like it scans one table and then the other table and does the JOIN-operation elsewhere. Setting a breakpoint in the function ha_myisam::rnd_next() confirmed this observation.
function hierarchy
In this series I'm mostly interested in the handling of the statement in the optimizer-stage and I will do so again in this post and omit a lot of other aspects. Here is the function-hierarchy in the optimizer in the handling of the query: make_join_statistics()
JOIN_TAB::scan_time() // for table TestSmall
TABE::stat_records()
handler::scan_time()
JOIN_TAB::scan_time() // for table TestBig
TABE::stat_records()
handler::scan_time()
choose_plan()
greedy_search()
best_extension_by_limited_search()
best_access_path() // start with table TestSmall
JOIN_TAB::scan_time()
TABE::stat_records()
handler::scan_time()
// recursion:
best_extension_by_limited_search() // add table TestBig to the plan
best_access_path()
JOIN_TAB::scan_time()
TABE::stat_records()
handler::scan_time()
// start a new plan:
best_access_path() // start with table TestBig
JOIN_TAB::scan_time()
TABE::stat_records()
handler::scan_time()
Again I have to write: don't generalize this hierarchy. It depends on the type of statement I'm looking at but it should give you an overview of the process.
pseudo-code
I'll try to explain the handling: we have 2 tables, TestSmall and TestBig, and we have the SQL-statement shown above. So:the inspection starts with the table TestSmall. The time for reading this table in a table-scan is computed to:
best_access_path():
records_read = 301,036
read_time = 6,127
The cost for this (partial) plan is computed to:
best_extension_by_limited_search():
/* Compute the cost of extending the plan with 's' */
current_record_count= record_count * position->records_read;
current_read_time=read_time + position->read_time +
current_record_count / (double) TIME_FOR_COMPARE;
and the results are:
current_record_count = 301,036 = 1 * 301,036
current_read_time = 66,334 = 0 + 6,127 + 301,036/5
As we do not have another plan yet this plan is the best plan so far and therefore the values are stored:
best_extension_by_limited_search():
best_record_count= current_record_count;
best_read_time= current_read_time;
Next we extend this partial plan by adding the table TestBig. For computing the cost-values the function best_extension_by_limited_search() is called again (a recursion) and the results are:
best_access_path():
records_read = 10,000,000
read_time = 6,348,409
so for accessing these tables the cost is computed to:
best_extension_by_limited_search():
current_record_count = 3,010,360,000,000 = 301,036 * 10,000,000
current_read_time = 602,078,414,743 = 66,334 + 6,348,409 + (3,010,360,000,000 / 5)
Ans as this is the best plan so far the values are stored:
best_extension_by_limited_search():
best_record_count= current_record_count;
best_read_time= current_read_time;
Now we have to store the plan, not only the results:
best_extension_by_limited_search():
if (current_read_time < join->best_read)
{
memcpy((uchar*) join->best_positions, (uchar*) join->positions,
sizeof(POSITION) * (idx + 1));
join->record_count= partial_join_cardinality;
join->best_read= current_read_time - 0.001;
}
and as join->best_read was initialized with DBL_MAX the expression in the if() computes to true and the plan is copied to join->best_positions.
The variable partial_join_cardinality is computed some lines above in this way:
best_extension_by_limited_search():
double partial_join_cardinality= current_record_count *
pushdown_cond_selectivity; // = (301,036 * 10,000,000) * 1
As there are no more tables left to inspect this iteration ends.In the next iteration it starts fresh with the table TestBig. These are the values for reading this table:
best_access_path():
records_read = 10,000,000
read_time = 211,613
so the cost for accessing this table is computed to:
best_extension_by_limited_search():
current_record_count = 10,000,000 = 1 * 10,000,000
current_read_time = 2,211,613 = 0 + 211,613 + 10,000,000/5
Next the optimizer would have to extend this plan with the next table in the list, the table TestSmall. But before doing this it checks whether it can expect a better plan:
best_extension_by_limited_search():
/*
Prune some less promising partial plans. This heuristic may miss
the optimal QEPs, thus it results in a non-exhaustive search.
*/
if (best_record_count > current_record_count ||
best_read_time > current_read_time ||
....
The if() computes to:
if ( 301,036 > 10,000,000 ||
66,334 > 221,1613 ||
....
this computes to false so the computation stops here, this plan is not extended.Now there are no more tables so the whole inspection of plans stops here. And a plan was found: start with TestSmall and extend it by TestBig.
remark
Some lines above I presented this line:current_record_count = 3,010,360,000,000 = 301,036 * 10,000,000
This line shows the calculation of the number of records fetched for the plan currently inspected. This multiplication tells us that for every record fetched from TestSmall the whole table TestBig must be read (or vice versa). This contradicts the final plan where each table is read only once, so an addition of the two numbers should be used instead of a multiplication. At least this did not affect the decision for a plan.