Monday 4 August 2014

columns: some remarks

this is my fifth post about this topic: storing data in a column-oriented form. You can find the other posts here: columns, columns (2), columns (3) and columns (4).

I started these posts with the idea that storing the data in column-oriented form would speed-up the database-operations, especially it would speed-up the queries. And I tried to implement such a structure within MariaDB/MySQL.

After playing with this code and thinking about improving the algorithms for a while I'm not so sure anymore. Yes, this approach is faster, but only in some cases and it seems that these are specialized cases.

descriptions

In the first post about this topic I gave a short explanation of storing the data row-oriented or column-oriented in a file. I don't want to repeat this text so let me describe here in a few words the ways I used for some experiments: MYCOLS1, MYCOLS2, MYCOLS3. I tried some more algorithms which I don't want to publish (and naturally other people have their own algorithms too).

first shot

In the first post some (silly) code was presented to show which functions were of interest for our topic.

MYCOLS1

The algo in MYCOLS1 was primitive, when a row was written the data of one column was written to an extra file in sequential form. Querying the data was done by scanning through the extra file if this column was found in a condition in the WHERE-clause. If a match was found in the extra file the corresponding row was read and given back to the database-server to do the rest of the filtering on this row in the usual way. If a more complex WHERE-condition exists this algo could not check for all parts of the condition because other columns were not handled by this algo but could be involved in the query.1)

MYCOLS2

In MYCOL2 the extra storage was rearranged to a dictionary-like approach. The storage of the column-data was split into a dictionary- and a reference-part. The dictionary (the .key-file) contained the unique entries of this column (think of an index of a book), the .data-file contained the references to the rows (again think of the index of a book: under the entry you find the page-numbers on which this keyword can be found. Or you may think of a cross-reference-listing where all variables of a program are listed together with the line-numbers of their occurrences in the source-file).

MYCOLS3

In the examples above each query was handled by a table-scan. In MYCOLS3 this changed to an index: MYCOLS3 used the same layout of the column-data as MYCOLS2, but in a query the additional data was treated like an index.
The Microsoft SQL Server 2012 has a feature called Column Store Index. Maybe it's a similar approach.


The algorithms presented use a lot of the code of the underlying MyISAM-engine, This can be removed but for this demonstration this was not done. Removing additional MyISAM-code could speed up the read-operation a bit.

All the algos presented here use a hybrid storage: the data was stored in a MYD-file plus in one or more additional files, similar to the example of the book with the additional index or the cross-reference-listing. In handling the query the extra file was scanned and finally when a condition was fulfilled the row was read and given to the database-server for further handling. So all algos presented here were modifications of a row-based approach; they did not overcome the concept of row-based storage.

Further improvements of the algos are possible. With the current implementation in MariaDB/MySQL the optimizer knows nothing about the data-structure of our 'index' (in MYCOLS3). If you look at the GROUP BY-example you see that the data-structure of MYCOLS3 fits into the query. The only thing needed is a call to fetch one key from the storage-engine together with the number of occurrences of this key in the data; this can easily be done with the data available in the 'index'. But this means extending the API of the storage-engine and rewriting the optimizer-code to take advantage of this new function.2)



from a different angle

Let's look at the same problem from a different angle: data is stored on one or more harddisks. The steps involved in answering a query contain some steps for reading data from the harddisk. Harddisks are fast but have always been slow compared to RAM or CPUs. Harddisks need some milliseconds to deliver data; current CPUs are able to execute some billion of assembly-instructions per second, so in a millisecond (= 0.001 second) the CPU is able to execute some 10 million assembly-instructions (and harddisks do need more than one millisecond to deliver the data). We can speed things up by adding a cache, an additional layer between RAM and harddik using some RAM to keep parts of the data in RAM based on statistical asumptions. This can speed things up but the whole data does not fit into the cache so it can help only sometimes.

Let me describe the disk-operations: one can distinguish between two general types of disk-operations: sequential and random. To simplify the argumentation I will only describe reads.

sequential read:
  • operating system tells the disk which part to read
  • move disk-head to correct track(cylinder)
  • wait in the track until sector arrives
  • start reading (simplified)
=> from this moment on the disk can deliver a lot of data (until the full track is read and we have to move the head again). So multiple records can be read one after the other.

random read:
  • operating system tells the disk which part to read
  • move disk-head over correct track(cylinder)
  • wait in the track until sector arrives
  • read sector(s)
=> looks similar to the description given above but do not forget: we read only one or some sector(s) and only one record will be given back to the OS. Usually the next records is on a different position on the disk and therefore the whole process starts again. So we do spend some milliseconds by waiting for the disk-head to move to the next correct position.

The optimizer chooses the path to read the data, via a table-scan (=sequential read) or via an index (=random read), if an index is available.3)

Observations here on my PC show: sequential reads are fast:
  • the harddisk on my PC is able to read up to 130 MB/sec
  • the SSD is able to read approx. 400 MB/sec

The general idea for speeding up the query is to reduce the amount of data read from the disk. This is the reason why an additional organization of the data (=index) is helpful. Reducing the amount of data read is a must when tables grew bigger. With the numbers above reading a single table sequentially takes about 6 to 7 seconds for each GB of data plus the necessary time for further handling the data (filtering, joining, grouping, sorting etc.). This can add to minutes if you think of SQL-statements involving 20 (or more) tables and this time required is unacceptable.
So we have to organize the data anyway and we can try to find a layout of the data that speeds up the query (and writing and updating and deleting and checking for a condition and ....). For approx. 30 years we live with relational databases and have learned to work with them. We know about tables and records and we think this way. And therefore the data comes in into a system in the form of rows and is handled by our software this way. Our concepts are based on rows (or a set of rows). Maybe we need a different model to handle columnar-storage of the data. Unfortunately I can't show you a better way.



some notes:
1) this is valid for all algos presented here.
2) To be fair: adding such a function would also speed up the operation with the existing indexes because it can easily be implemented using the conventional index-structure (conventional = BTREE = as implemented in MyISAM).
3) In one of my tests the optimizer chooses the wrong path: a sequential-read took approx. 10 secs whereas a random-read took approx. 17 seconds and it chooses the index when it was availabe (the test was made with a table of type MyISAM, 10 mio. records in the table, about 1GB in size). Maybe this is a topic for another post here.