an unsystematic journey through the source-code of MySQL and MariaDB: table-scan or index-scan

observation

in one of the tests I made some time ago I made the observation that adding an index to an existing table slowed down a SELECT-statement. That's something I didn't expect so I started looking into this effect. And I would like to present my results so far here.

prerequisites

Instead of creating and dropping the index over and over again I created 2 tables of identical structure and put the same data in these. So I created the tables TestWitoutI (table without index) and TestWithI (table with index). And here they are:

 MariaDB [TestCol]> show create table TestWithI\G
*************************** 1. row ***************************
       Table: TestWithI
Create Table: CREATE TABLE `TestWithI` (
  `Id` char(8) DEFAULT NULL,
  `PZN` char(7) DEFAULT NULL,
  `EVP` char(8) DEFAULT NULL,
  `HAP` char(8) DEFAULT NULL,
  `ArtikelBez` char(40) DEFAULT NULL,
  `ArtikelText` char(26) DEFAULT NULL,
  `Hersteller` char(5) DEFAULT NULL,
  KEY `Hersteller` (`Hersteller`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

MariaDB [TestCol]> show create table TestWithoutI\G
*************************** 1. row ***************************
       Table: TestWithoutI
Create Table: CREATE TABLE `TestWithoutI` (
  `Id` char(8) DEFAULT NULL,
  `PZN` char(7) DEFAULT NULL,
  `EVP` char(8) DEFAULT NULL,
  `HAP` char(8) DEFAULT NULL,
  `ArtikelBez` char(40) DEFAULT NULL,
  `ArtikelText` char(26) DEFAULT NULL,
  `Hersteller` char(5) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

MariaDB [TestCol]>

the observation

and here are the results of my test. At first the version without any index:

 MariaDB [TestCol]> show index from TestWithoutI;
Empty set (0.00 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (10.74 sec)

MariaDB [TestCol]>

And now the same test but the table contains an index on one column (this column does appear in the WHERE-clause of the query):

 MariaDB [TestCol]> show index from TestWithI;
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table     | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestWithI |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+-----------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.01 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where Hersteller <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (18.93 sec)

MariaDB [TestCol]>

Surprise: on the table TestWitoutI the SELECT-statement took approx. 10 seconds, on the table TestWithI it took approx. 18 seconds.

suspicion

Could it be that in the case of the table TetWithI the database-server prefers an index-scan over a table-scan? Let the server explain what it did:

 MariaDB [TestCol]> explain select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller <> 'ABCDE'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: TestWithoutI
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
1 row in set (0.00 sec)

MariaDB [TestCol]> explain select SQL_NO_CACHE count(*) from TestWithI where Hersteller <> 'ABCDE'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: TestWithI
         type: index
possible_keys: Hersteller
          key: Hersteller
      key_len: 6
          ref: NULL
         rows: 10000000
        Extra: Using where; Using index
1 row in set (0.01 sec)

MariaDB [TestCol]>

Indeed it chooses the index-scan over the table-scan. To be absolutely sure we will look into the code to verify this.

modifying code

Let's simply count the number of read-accesses of the table and the index-tree.

Therefore we need:

2 counters
initialize them
count every time the next record or the next entry from the index-tree is read
and finally output the values

The tables for testing this behaviour are of type MyISAM so we have to modify the files ha_myisam.h and ha_myisam.cc, you will find both in the directory storage.
In the file ha_myisam.h please add this line at the end of the definition of the class. It should now look like this:

 
  int counterTableScan, counterIndexScan;

};

Now in ha_myiam.cc please add these lines in the functions as described:

 in the function ha_myisam::index_init():
    counterTableScan = counterIndexScan = 0;

in the function ha_myisam::rnd_init():
    counterTableScan = counterIndexScan = 0;


in the function ha_myisam::rnd_next():
    counterTableScan++;

in the function ha_myisam::index_next():
    counterIndexScan++;


in the function ha_myisam::index_end():
  fprintf(stderr, "\ntable = <%s>\tcounter\ttable-scan: %d\tindex-scan: %d\n\n", table_share->table_name.str, counterTableScan, counterIndexScan);

in the function ha_myisam::rnd_end():
  fprintf(stderr, "\ntable = <%s>\tcounter\ttable-scan: %d\tindex-scan: %d\n\n", table_share->table_name.str, counterTableScan, counterIndexScan);

Add these lines, compile your server, start it and you are ready for the tests.

As you can see the counters are set to zero, incremented and finally written to stderr (you will find the output in the error-log which is the file /var/log/mysql/error.log in my installation).
table-scan: the initialisation-function is rnd_init(). Every time a row is fetched the function rnd_next() is called. And finally the scan ends with a call of rnd_end().
index-scan: things are not so easy here because there are a lot more functions for handling this. But our case is easy so we can live with these functions: index_init() for initialization, index_next() for one entry fetched and index_end() for termination.

result

Let' do our tests. The results are the same as above but in the log-file we find these lines:

 
table = <TestWithoutI> counter table-scan: 10000001 index-scan: 0
table = <TestWithI> counter table-scan: 0 index-scan: 10000000

So our suspicion is verified¹⁾: for the table TestWithoutI a table-scan is made because no other way for accessing the data is available. In the case of the table TestWithI two ways are possible: use a full table-scan or go via the index Hersteller. And in this case the way via the index is chosen.

optimizer

In the handling of a SELECT-statement the optimizer is the part responsible for finding a good way for accessing the data. So let's step away from any further testing or playing with code and look at the sources.
It would help us if we can find an introductory text about the optimizer used in MySQL/MariaDB. Unfortunately I didn't find much except this book: here and here. You will find a short introduction into this topic on page 170 of Sasha's book. The book was published in 2007 so keep in mind that the code in MySQL/MariaDB has evolved in the meantime but the book is still helpful (and this is valid not only for this topic).

The optimizer works by finding all possible paths to answer the query²⁾. For every possible path it calculates a number which we can interpret as the 'cost' for this path and finally it compare these numbers to choose the path with the lowest 'cost'. Don't take the 'cost' literally, it's just a number representing the time and the resources needed for handling the query. The more time and resources it needs the higher the number has to be, the numbers have no other meanings.

In the file handler.h you will find these functions for calculating 'costs':

   virtual double scan_time()
  { return ulonglong2double(stats.data_file_length) / IO_SIZE + 2; }

  /**
     The cost of reading a set of ranges from the table using an index
     to access it.
     
     @param index  The index number.
     @param ranges The number of ranges to be read.
     @param rows   Total number of rows to be read.
     
     This method can be used to calculate the total cost of scanning a table
     using an index by calling it using read_time(index, 1, table_size).
  */
  virtual double read_time(uint index, uint ranges, ha_rows rows)
  { return rows2double(ranges+rows); }

  /**
    Calculate cost of 'keyread' scan for given index and number of records.

     @param index    index to read
     @param ranges   #of ranges to read
     @param rows     #of records to read
  */
  virtual double keyread_time(uint index, uint ranges, ha_rows rows);

and in handler.cc:

double handler::keyread_time(uint index, uint ranges, ha_rows rows)
{
  /*
    It is assumed that we will read trough the whole key range and that all
    key blocks are half full (normally things are much better). It is also
    assumed that each time we read the next key from the index, the handler
    performs a random seek, thus the cost is proportional to the number of
    blocks read. This model does not take into account clustered indexes -
    engines that support that (e.g. InnoDB) may want to overwrite this method.
    The model counts in the time to read index entries from cache.
  */
  ulong len= table->key_info[index].key_length + ref_length;
  if (index == table->s->primary_key && table->file->primary_key_is_clustered())
    len= table->s->stored_rec_length;
  double keys_per_block= (stats.block_size/2.0/len+1);
  return (rows + keys_per_block-1)/ keys_per_block +
         len*rows/(stats.block_size+1)/TIME_FOR_COMPARE ;
}

These are the basic functions valid for all storage-engines. If the writer of a new storage-engine has a different opinion then he must overwrite these functions in his derived class.

Back to MyISAM: let me describe these functions as they were used in my test-case:
read_time(): in my tests this function was never called so I omit this one here.
scan_time(): estimating a value for doing a full table-scan on the whole file. In my tests the size of the datafile is about 1GB, IO_SIZE is defined as 4K, so the result was the value 251466.84375. Take it as it is.
keyread_time(): does some computations on the structure of the MYI-file³⁾. the variable len is calculated ass 13 ( = 6 + 7), a leaf on the BTREE-index has the size of 1K so the var. keys_per_block gets the value 40.384615384615387 (this says that there are 40 entries in a block when a block is half filled). In the next step 2 values are computed, added and returned. The first value computes the (estimated) number of blocks in the index by dividing the number of rows in the table by the number of entries in one block; this gives a value of 247620.02285714282. The second value is computed as the len of all entries of the column Hersteller (= 10mio * 13) divided by the size of an index-block, this value is divided by 5: the result was 25365. Add these 2 number and you will get a return value of 272985.02285714285. Take these number as given and do not forget that these numbers are valid only for my test-case.

As you can see: 272985 > 251466 so the 'cost' of an index-scan is estimated to be higher than the 'cost' of a table-scan.

So why does the server choose an index-scan?

overwriting the result

As Sasha described in his book the server breaks a statement into pieces called JOINs. In my test-case a simple statement was used so there is only one JOIN (a degenerated one). You may look into the file sql/sql_select.cc to see how the server handles this statement. There is a structure for handling this: JOIN_TAB (in sql/sql_select.h). This struct contains a member-variable read_first_record which was set with the address of the the function join_init_read_record(). This value will be overwritten later in the function make_join_readinfo() by these lines:

       tab->index=find_shortest_key(table, & table->covering_keys);  
       tab->read_first_record= join_read_first;                          // overwrites: join_init_read_record
       /* Read with index_first / index_next */
       tab->type= tab->type == JT_ALL ? JT_NEXT : JT_HASH_NEXT;

The comment is not included in the original source, I've added it.

The old value of the member-var. was the address of the function join_init_read_record(), a function that starts a full table-scan. The new value is the address of the function join_read_first()., this function does the same thing but for an index-scan.

If you look in the code above the place where the member-var. read_first_record is overwritten you see this statement in an if()-construct: else if ( !table->covering_keys.is_clear_all() &&
covering_keys is a bitmap that contains a list of usable indexes. To simplify it: if this bitmap is not zero then an index exists with usable information so the optimizer decidess to use it. And in this test-case an index on the column Hersteller exists and this column is also referenced in the WHERE-clause.

So the decision is: if there is an index and it contains usable information then prefer the index over a table-scan.

To verify this you can simply change the WHERE-clause of the SQL-statement so that it uses a different column for the comparison, like: WHERE PZN <> 'ABCDE'
You will see that the optimizer chooses a table-scan.

 MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where PZN <> 'ABCDE';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (12.69 sec)

MariaDB [TestCol]> explain select SQL_NO_CACHE count(*) from TestWithI where PZN <> 'ABCDE'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: TestWithI
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10000000
        Extra: Using where
1 row in set (0.01 sec)

MariaDB [TestCol]>

judgment

Is the decision to prefer an index-scan over a table-scan correct? I think so.

Let me show you this:

 MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where Hersteller = 'ABCDE';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller = 'ABCDE';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (8.65 sec)

MariaDB [TestCol]>

This example clearly showed the effect of using an index. The difference between the example in the intro of this post and this one are the number of records inspected. The more entries of the index-tree the software has to fetch the more time it needs for the work.

Let me give you another example:

 MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithI where Hersteller = '00020'; 
+----------+
| count(*) |
+----------+
|     1903 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestCol]> select SQL_NO_CACHE count(*) from TestWithoutI where Hersteller = '00020';
+----------+
| count(*) |
+----------+
|     1903 |
+----------+
1 row in set (7.90 sec)

MariaDB [TestCol]>

Nearly 2000 records (out of 10 mio) are found and this was done fast by using an index. In the example of the test for equality no record was found in the tree Our example of testing for a not-existing value is an extreme example, it's not useful as a counter-example because this test should happen only infrequently.

criticisms

By playing with the code and my examples I found that some functions were called mulitple times and they are called with identical parameters. This is unnecessary.

scan_time(): this function was called 3times (SELECT from TestWithI) or twice (SELECT from TestWithoutI) and in all cases the function did the same calculations returning the same value. To be fair: the computations were simple.

keyread_time(): this function was only called in the case of the SELECT from TestWithI. It contains a paramater ranges that is not used in the computations but contained different values in the calls. This function was called 4times, returning the same value in each call.

a personal word: this looks like I a harsh criticism. But these things happen, nevertheless this is unnecessary. It should be cleaned someday.
If you have a chance to look at code I wrote I'm sure you can find similar effects. Please be lenient toward me.

some notes:

1) in case that you've seen a difference of 1 in these numbers: I've forgotten to count the call of index_first(). A bug, I didn't correct it.

2) this is over-simplified. In his book Sasha describes why this approach is unrealistic and what the optimizer really does. For our case the description is valid so far.

3) you will find more information about the structure of the MYI-file here: 20.2 The .MYI File

Wednesday, 1 October 2014

table-scan or index-scan