Thursday 28 August 2014

encryption (3)

this is the third part of my posting about encrypting data in MariaDB/MySQL. You will find the other parts here: first post and second post.

missing

In the last post I described an approach using a storage engine for encrypting the data. But the examples showed tables consisting of columns with types CHAR and VARCHAR. Other types are possible but were omitted, so I decided to implement, test and demonstrate these types too.

So I will present here a test-case where the table contains columns of these types: INTEGER, DECIMAL, NUMERIC, DATETIME, TIMESTAMP. For handling these types I had to modify my code a bit so I updated the archive containing the code. You will still find the archive here.

why MyISAM

I used MyISAM as the basis of my code because by inheritance I got a relative feature-rich engine in a fast way. And when you write your own storage engine you have to deal with MyISAM anyway because the data of a record is given from the server to your engine in a buffer in which ist is formatted in MyISAM-style.And you'll have to give the data back from the engine to the server in this format.

code

As I have to modify the code I'll take the code from the second post as the basis for my decription here.

In the file ha_mintempl.h please add these lines at the proper place:
 private:
  void     do_encrypt ( uchar *, int);
  void     do_decrypt ( uchar *, int);

In the file ha_mintempl.cc please add these functions at some place:
void ha_mintempl::do_encrypt ( uchar *buf, int len)
{
 // do an XOR with the char 0x55
 for ( int i = 0; i < len; i++)
 {
  buf[i] ^= 0x55;
 }
 return ;
}

void ha_mintempl::do_decrypt ( uchar *buf, int len)
{
 // do an XOR with the char 0x55
 for ( int i = 0; i < len; i++)
 {
  buf[i] ^= 0x55;
 }
 return ;
}

You also have to replace these functions with this new code:
void ha_mintempl::encrypt_full_row( void )
{
    my_bitmap_map *old_sets[2];

    dbug_tmp_use_all_columns(table, old_sets, table->read_set, table->write_set);
    for ( Field **ptr2Col = (Field**) table->field; *ptr2Col ; ptr2Col++)
    {
        if ( !((*ptr2Col)->is_null()) )
        {
            if ( (*ptr2Col)->type() != MYSQL_TYPE_VARCHAR )
            {
                do_encrypt ( (*ptr2Col)->ptr, (*ptr2Col)-> pack_length_in_rec());
            }
            else
            {
                String oldData; //  = new String();
                (*ptr2Col)->val_str(&oldData, &oldData);           // first argument is marked as:  __attribute__((unused))
                String * newData = do_encrypt( &oldData );
                (*ptr2Col)->store( newData->ptr(), newData->length(), system_charset_info);
            }
        }
    }
    dbug_tmp_restore_column_maps(table->read_set, table->write_set, old_sets);
    return;
}
void ha_mintempl::decrypt_full_row( void )
{
    my_bitmap_map *old_sets[2];

    dbug_tmp_use_all_columns(table, old_sets, table->read_set, table->write_set);
    for ( Field **ptr2Col = (Field**) table->field; *ptr2Col ; ptr2Col++)
    {
 if ( !((*ptr2Col)->is_null()) )
 {
            if ( (*ptr2Col)->type() != MYSQL_TYPE_VARCHAR )
     {
  do_decrypt ( (*ptr2Col)->ptr, (*ptr2Col)-> pack_length_in_rec());
     }
     else
     {
                String oldData; //  = new String();
  (*ptr2Col)->val_str(&oldData, &oldData);           // first argument is marked as:  __attribute__((unused))
  String * newData = do_decrypt( &oldData );
  (*ptr2Col)->store( newData->ptr(), newData->length(), system_charset_info);
     }
 }
    }
    dbug_tmp_restore_column_maps(table->read_set, table->write_set, old_sets);
    return;
}

As you can see a type of VARCHAR is handled as before only for other types a pointer to the internal representation in the buffer is read and the len (in internal representation of the data). With these information the data is encrypted directly in the buffer, no conversion between types is needed because this encryption will result in errors: think of an integer, read from the buffer in string-format, which is "encrypted" in our way and then given back via a store()-function. By checking this data an error occurs.......
Handling a VARCHAR is a bit more difficult because you shouldn't modify the len-byte(s), so this is handled the old-way.

testing

create: let's create tables with some more column-types: TestNormalCols is a table of the standard MyISAM-type, used for comparison, TestEncCols shall contain the same data but in our encrypted form:
 MariaDB [TestEnc]> create table TestNormalCols (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  integer,
    ->                 EVP  decimal(6, 2),
    ->                 HAP  numeric(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5),
    ->                 InsertedOn datetime,
    ->                 InsertedTStamp timestamp )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestEnc]> desc TestNormalCols;
+----------------+--------------+------+-----+-------------------+-----------------------------+
| Field          | Type         | Null | Key | Default           | Extra                       |
+----------------+--------------+------+-----+-------------------+-----------------------------+
| Id             | char(8)      | YES  |     | NULL              |                             |
| PZN            | int(11)      | YES  |     | NULL              |                             |
| EVP            | decimal(6,2) | YES  |     | NULL              |                             |
| HAP            | decimal(7,2) | YES  |     | NULL              |                             |
| ArtikelBez     | varchar(40)  | YES  |     | NULL              |                             |
| ArtikelText    | varchar(26)  | YES  |     | NULL              |                             |
| Hersteller     | char(5)      | YES  |     | NULL              |                             |
| InsertedOn     | datetime     | YES  |     | NULL              |                             |
| InsertedTStamp | timestamp    | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+----------------+--------------+------+-----+-------------------+-----------------------------+
9 rows in set (0.01 sec)

MariaDB [TestEnc]> create table TestEncCols (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  integer,
    ->                 EVP  decimal(6, 2),
    ->                 HAP  numeric(7,2),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5),
    ->                 InsertedOn datetime,
    ->                 InsertedTStamp timestamp )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestEnc]> 

INSERT and SELECT: next we will insert one record in both tables and check what's in the table:
 MariaDB [TestEnc]> insert into TestNormalCols values ( '01000000', 999999, 132.36, 0.00, 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', '25677', '2014-08-26 14:59:00', current_timestamp);
Query OK, 1 row affected (0.01 sec)

MariaDB [TestEnc]> insert into TestEncCols values ( '01000000', 999999, 132.36, 0.00, 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', '25677', '2014-08-26 14:59:00', current_timestamp);
Query OK, 1 row affected (0.00 sec)

MariaDB [TestEnc]> select * from TestNormalCols;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+---------------------+---------------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller | InsertedOn          | InsertedTStamp      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+---------------------+---------------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      | 2014-08-26 14:59:00 | 2014-08-26 15:45:21 |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+---------------------+---------------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> select * from TestEncCols;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+---------------------+---------------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller | InsertedOn          | InsertedTStamp      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+---------------------+---------------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      | 2014-08-26 14:59:00 | 2014-08-26 15:45:27 |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+---------------------+---------------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> 

Looks identical except for the timestamp, but let's dive into the hex-values.

into hex

Our single records looked identical when selected from both tables. To verify that the values are indeed encrypted in the data-file we have to look into the files TestNormalCols.MYD and TestEncCols.MYD.
For understanding the contents of the MYD-files we need some explanations which you can find here od and here 20.1.2 Physical Attributes of Columns.
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestNormalCols.MYD
000000 03 00 64 00                              block-header
       00 00                                    flags
       30 31 30 30 30 30 30 30                  Id
       3f 42 0f 00                              PZN = 0x0f423f = 999999
       80 84 24                                 EVP = flag + 132 (=0x84) + 36 (=0x24) 
       80 00 00 00                              HAP = flag + 0 + 0
       22 55 52 4f 53 41 46 45                  ArtikelBez
       20 42 45 49 4e 42 45 55 
       54 45 4c 20 37 37 33 32 
       43 20 20 20 20 20 20 31 
       20 20 4c 
       1a 53 50 52 49 4e 47 20                  ArtikelText
       44 45 20 4c 55 20 41 4d 
       20 4b 31 20 4e 4f 20 54 
       4f 50 35 
       32 35 36 37 37                           Hersteller
       6c b4 c9 66 51 12 00 00                  InsertedOn = 0x125166c9b46c = 20140826145900 = YYYYMMDDHHMMSS
       71 8f fc 53                              InsertedTStamp = 0x53fc8f71 = 1409060721 = Tue, 26 Aug 2014 13:45:21 GMT
                                                               = no. of seconds since 1970 (Unix-like)

august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestEncCols.MYD
000000 03 00 64 00                              block-header 
       00 00                                    flags
       65 64 65 65 65 65 65 65                  Id
       6a 17 5a 55                              PZN
       d5 d1 71                                 EVP
       d5 55 55 55                              HAP
       22 00 07 1a 06 14 13 10                  ArtikelBez                  
       75 17 10 1c 1b 17 10 00 
       01 10 19 75 62 62 66 67 
       16 75 75 75 75 75 75 64 
       75 75 19 
       1a 06 05 07 1c 1b 12 75                  ArtikelText
       11 10 75 19 00 75 14 18 
       75 1e 64 75 1b 1a 75 01
       1a 05 60 
       67 60 63 62 62 39                        Hersteller
       e1 9c 33 04 47 55 55                     InsertedOn
       22 da a9 06                              InsertedTStamp
000068
august@AMD4:~/MariaDB/data/TestEnc$ 

verification

For the table TestEncCols the data looks encrypted, but let me test this for 2 columns.

PZN: the type of this column is INTEGER and these are stored as 4-byte entries, so let's take the bytes and decrypt them:
PZN:   6a 17 5a 55
       6a:     6a XOR 55 = 3f
       17:     17 XOR 55 = 42
       5a:     5a XOR 55 = 0f
       55:     55 XOR 55 = 00
As you can see the value from the table TestEncCols are identical after you decrypted them.

InsertedTStamp: let's look at another column, one of type TIMESTAMP:
InsertedTStamp:   22 da a9 06 
       22:     22 XOR 55 = 77
       da:     da XOR 55 = 8f
       a9:     a9 XOR 55 = fc
       06:     06 XOR 55 = 53
In the table TestNormalCols you see a value of 0x53fc8f71 in the column InsertedTStamp, in TestEncCols you see a value of 0x53fc8f77 after decryption. The difference is a value of 6, this number tells you that it took me 6 seconds to start the second test after the first one.
Converting the given hex-values into decimals and then using these numbers in this site we can convert these numbers into a human readable form. Looks like the encryption worked.

I've only used INTEGERs for verifying the encryption, you may do these steps for the other types by yourself.

final words

Looks like encrypting data is possible within MariaDB/MySQL. Naturally problems will arise during realisation because this is normal in software-projects, but I'm sure these can be solved. Just do it.

Tuesday 26 August 2014

encryption (2)

this is the second part of my posting about encrypting data in MariaDB/MySQL. You will find the first part here.

In this post I like to demonstrate an approach using a storage-engine.

MINTEMPL

Basis of my code is the engine MINTEMPL which I will extend a bit. You can download the original files from here. As this code is derived from MyISAM so my modified MINTEMPL-engine will also create the standard MYD and MYI-files.

You can copy and paste the code presented in the next lines and insert it into the proper places of the original code of MINTEMPL or use this archive, which contains all the code.

what to add: if you use the original code of MINTEMPL you need to modify 3 files of the MINTEMPL-source. Let me start with the header ha_mintempl.h. Please add these line to this file:
 public:
 int write_row(uchar * buf);
 int update_row(const uchar * old_data, uchar * new_data);
 int rnd_next( uchar *);
 int index_first(uchar *);
 int index_next(uchar *);

 private:
  void encrypt_full_row( void );
  void decrypt_full_row( void );
  String * do_encrypt( String *);
  String * do_decrypt( String *);

Next in the source-file ha_mintempl.cc please add these lines:
 int ha_mintempl::write_row(uchar * buf)
{
    DBUG_ENTER("ha_mintempl::write_row");
    my_bitmap_map *old_sets[2];

    encrypt_full_row();
    DBUG_RETURN( ha_myisam::write_row(buf) );
}

int ha_mintempl::rnd_next(uchar *buf)
{
    DBUG_ENTER("ha_mintempl::rnd_next");
    int result = ha_myisam::rnd_next(buf);

    decrypt_full_row();
    DBUG_RETURN( result );
}

int ha_mintempl::update_row(const uchar *old_data, uchar *new_data)
{
  DBUG_ENTER("ha_mintempl::update_row");
  encrypt_full_row();
  DBUG_RETURN( ha_myisam::update_row(old_data, new_data ) );
}

int ha_mintempl::index_first(uchar *buf)
{
   DBUG_ENTER("ha_mintempl::index_first");
   int retCode = ha_myisam::index_first( buf );
   decrypt_full_row();
   DBUG_RETURN( retCode );
}

int ha_mintempl::index_next(uchar *buf)
{
  DBUG_ENTER("ha_mintempl::index_next");
  int retCode = ha_myisam::index_next( buf );
  decrypt_full_row();
  DBUG_RETURN( retCode );
}


void ha_mintempl::encrypt_full_row( void )
{
    my_bitmap_map *old_sets[2];

    dbug_tmp_use_all_columns(table, old_sets, table->read_set, table->write_set);
    for ( Field **ptr2Col = (Field**) table->field; *ptr2Col ; ptr2Col++)
    {
        if ( !((*ptr2Col)->is_null()) )
        {
            String oldData; //  = new String();
            (*ptr2Col)->val_str(&oldData, &oldData);           // first argument is marked as:  __attribute__((unused))
            String * newData = do_encrypt( &oldData );
            (*ptr2Col)->store( newData->ptr(), newData->length(), system_charset_info);
       }
    }
    dbug_tmp_restore_column_maps(table->read_set, table->write_set, old_sets);
    return;
}

void ha_mintempl::decrypt_full_row( void )
{
    my_bitmap_map *old_sets[2];

    dbug_tmp_use_all_columns(table, old_sets, table->read_set, table->write_set);
    for ( Field **ptr2Col = (Field**) table->field; *ptr2Col ; ptr2Col++)
    {
        if ( !((*ptr2Col)->is_null()) )
        {
            String oldData; //  = new String();
            (*ptr2Col)->val_str(&oldData, &oldData);           // first argument is marked as:  __attribute__((unused))
            String * newData = do_decrypt( &oldData );
            (*ptr2Col)->store( newData->ptr(), newData->length(), system_charset_info);
        }
    }
    dbug_tmp_restore_column_maps(table->read_set, table->write_set, old_sets);
    return;
}
/*
 * our simple and silly 'encryptiont'-routine
 */
String *ha_mintempl::do_encrypt (String *data)
{
    // do an XOR with the char 0x55
    char *tempPtr = (char *)data->ptr();
    for ( int i = 0; i < (int) data->length(); i++)
    {
        tempPtr[i] ^= 0x55;
    }
    return data;
}

/*
 * our simple and silly 'decryptiont'-routine
 */
String *ha_mintempl::do_decrypt (String *data)
{
 // do an XOR with the char 0x55
 char *tempPtr = (char *)data->ptr();
 for ( int i = 0; i < data->length(); i++)
 {
  tempPtr[i] ^= 0x55;
 }
 return data;
}

As we will (later) play with indexes we have to link our engine statically to the database server. So please modify the file CMakeLists.txt to this content:
# for dynamical linking:
# SET(MINTEMPL_PLUGIN_DYNAMIC "ha_mintempl")
# SET(MINTEMPL_SOURCES ha_mintempl.cc ha_mintempl.h)
# MYSQL_ADD_PLUGIN(mintempl ${MINTEMPL_SOURCES} STORAGE_ENGINE MODULE_ONLY)

# for static linking:
SET(MINTEMPL_SOURCES ha_mintempl.cc ha_mintempl)
MYSQL_ADD_PLUGIN(mintempl ${MINTEMPL_SOURCES} 
  STORAGE_ENGINE 
  MANDATORY )
TARGET_LINK_LIBRARIES(mintempl)

Compile everything, start the new database server and we are ready for some tests.

With this code presented above I will encrypt the full row, not only a single column. It's easy to implement this for a single column but I want to show how to encrypt a full row. So let's start with the tests: I will create 2 tables, one unencrypted table (TestNormalFull) for comparison and one encrypted table (TestEncFull), insert a line into each table and then read from each table:
MariaDB [TestEnc]> create table TestNormalFull (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.06 sec)

MariaDB [TestEnc]> create table TestEncFull (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.05 sec)

MariaDB [TestEnc]> insert into TestNormalFull select * from TestOK.ABDAOK limit 1;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> insert into TestEncFull select * from TestOK.ABDAOK limit 1;
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> select * from TestNormalFull;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> select * from TestEncFull;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> 

As you can see the SQL-statements contain nothing special for encrypting or decrypting values. And the result of the 2 queries is identical. Let's look at the MYD-files to verify that the data in the table TestEncFull.MYD is indeed encrypted:
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestNormalFull.MYD
000000 01                              flag
       30 31 30 30 30 30 30 30         Id

000000 01                              flag
       30 31 30 30 30 30 30 30         Id
       39 39 39 39 39 39 20            PZN      
000010 31 33 32 2e 33 36 20 20         EVP
       30 2e 30 30 20 20 20 20         HAP
000020 55 52 4f 53 41 46 45 20         ArtikelBez
       42 45 49 4e 42 45 55 54
000030 45 4c 20 37 37 33 32 43 
       20 20 20 20 20 20 31 20
000040 20 4c 20 20 20 20 20 20 
       53 50 52 49 4e 47 20 44         ArtikelText
000050 45 20 4c 55 20 41 4d 20
       4b 31 20 4e 4f 20 54 4f
000060 50 35 
       32 35 36 37 37                  Hersteller
000067
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestEncFull.MYD
000000 01                              flag
       65 64 65 65 65 65 65 65         Id
       6c 6c 6c 6c 6c 6c 20            PZN
000010 64 66 67 7b 66 63 20 20         EVP
       65 7b 65 65 20 20 20 20         HAP
000020 00 07 1a 06 14 13 10 75         ArtikelBez
       17 10 1c 1b 17 10 00 01
000030 10 19 75 62 62 66 67 16 
       75 75 75 75 75 75 64 75
000040 75 19 20 20 20 20 20 20 
       06 05 07 1c 1b 12 75 11         ArtikelText
000050 10 75 19 00 75 14 18 75 
       1e 64 75 1b 1a 75 01 1a
000060 05 60
       67 60 63 62 62                  Hersteller
000067
august@AMD4:~/MariaDB/data/TestEnc$ 

You see the data in TetEncFull.MYD everything is "encrypted". If you compare the values of the column Hersteller with the values presented in the first post you see that they are identical. The only thing not encrypted in the file TestEncFull.MYD are the space-characters added by the server but this can also be done within the code with slight modifications (untested).

dynamic: let's repeat this test, but this time with tables of dynamic type:
MariaDB [TestEnc]> create table TestNormalDyn (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.05 sec)

MariaDB [TestEnc]> create table TestEncDyn (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestEnc]> insert into TestNormalDyn select * from TestOK.ABDAOK limit 2;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> insert into TestEncDyn select * from TestOK.ABDAOK limit 2;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> select * from TestNormalDyn;
+----------+--------+--------+-------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+-------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML            | URETERSCHIENEAC4P66        | 36367      |
+----------+--------+--------+-------+------------------------------------+----------------------------+------------+
2 rows in set (0.00 sec)

MariaDB [TestEnc]> select * from TestEncDyn;
+----------+--------+--------+-------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+-------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML            | URETERSCHIENEAC4P66        | 36367      |
+----------+--------+--------+-------+------------------------------------+----------------------------+------------+
2 rows in set (0.00 sec)

MariaDB [TestEnc]> 

And here is how the MYD-files look in hex:
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestNormalDyn.MYD
000000 03 00 60 00                     block-header for record #1
       0c 80               
       30 31 30 30 30 30 30 30         Id
       39 39 39 39 39 39 20            PZN
       06 31 33 32 2e 33 36            EVP
       04 30 2e 30 30                  HAP
       22 55 52 4f 53 41 46 45         ArtikelBez (len + text)
       20 42 45 49 4e 42 45 55 
       54 45 4c 20 37 37 33 32 
       43 20 20 20 20 20 20 31 
       20 20 4c 
       1a 53 50 52 49 4e 47 20        ArtikelText (len + text)
       44 45 20 4c 55 20 41 4d 
       20 4b 31 20 4e 4f 20 54 
       4f 50 35 
       32 35 36 37 37                 Hersteller

       01 00 4d                       block-header for record #2
       0c 80                 
       30 31 30 30 30 30 30 31        Id
       35 33 35 38 31 38 20           PZN
       04 30 2e 30 30                 EVP
       05 31 38 2e 39 31              HAP
       17 50 4b 20 46 55 45 52        ArtikelBez (len + text)
       20 54 49 45 52 45 20 20 
       20 20 32 35 30 20 4d 4c
       13 55 52 45 54 45 52 53        ArtikelText (len + text)
       43 48 49 45 4e 45 41 43 
       34 50 36 36 
       33 36 33 36 37                 Hersteller
0000b4

august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestEncDyn.MYD
000000 03 00 60 00                    block-header for record #1
       0c 80              
       65 64 65 65 65 65 65 65        Id
       6c 6c 6c 6c 6c 6c 20           PZN
       06 64 66 67 7b 66 63           EVP (len + text)
       04 65 7b 65 65                 HA (len + text)
       22 00 07 1a 06 14 13 10        ArtikelBez ( len + text)
       75 17 10 1c 1b 17 10 00 
       01 10 19 75 62 62 66 67 
       16 75 75 75 75 75 75 64 
       75 75 19 
       1a 06 05 07 1c 1b 12 75        ArtikelText (len + text)
       11 10 75 19 00 75 14 18 
       75 1e 64 75 1b 1a 75 01 
       1a 05 60 
       67 60 63 62 62                 Hersteller

       01 00 4d                       block-header for record #2
       0c 80                 
       65 64 65 65 65 65 65 64        Id
       60 66 60 6d 64 6d 20           PZN
       04 65 7b 65 65                 EVP (len + text)
       05 64 6d 7b 6c 64              HAP (len + text)
       17 05 1e 75 13 00 10 07        ArtikelBez (len + text)
       75 01 1c 10 07 10 75 75 
       75 75 67 60 65 75 18 19 
       13 00 07 10 01 10 07 06        ArtikelText (len + text)
       16 1d 1c 10 1b 10 14 16 
       61 05 63 63 
       66 63 66 63 62                 Hersteller
0000b4
august@AMD4:~/MariaDB/data/TestEnc$ 

UPDATE: we did an INSERT and a SELECT, but what about UPDATE? This is a bit more complicated in the case of a dynamic table but only internally. So let's test this and especially make one column a bit longer:
MariaDB [TestEnc]> update TestNormalDyn set ArtikelBez = concat(ArtikelBez, '-', 'ABCDEFGHIJK');
Query OK, 2 rows affected, 1 warning (0.00 sec)
Rows matched: 2  Changed: 2  Warnings: 1

MariaDB [TestEnc]> update TestEncDyn set ArtikelBez = concat(ArtikelBez, '-', 'ABCDEFGHIJK');
Query OK, 2 rows affected, 1 warning (0.00 sec)
Rows matched: 2  Changed: 2  Warnings: 1

MariaDB [TestEnc]> select * from TestNormalDyn;
+----------+--------+--------+-------+------------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                               | ArtikelText                | Hersteller |
+----------+--------+--------+-------+------------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L-ABCDE | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML-ABCDEFGHIJK      | URETERSCHIENEAC4P66        | 36367      |
+----------+--------+--------+-------+------------------------------------------+----------------------------+------------+
2 rows in set (0.00 sec)

MariaDB [TestEnc]> select * from TestEncDyn;
+----------+--------+--------+-------+------------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                               | ArtikelText                | Hersteller |
+----------+--------+--------+-------+------------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L-ABCDE | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML-ABCDEFGHIJK      | URETERSCHIENEAC4P66        | 36367      |
+----------+--------+--------+-------+------------------------------------------+----------------------------+------------+
2 rows in set (0.00 sec)

MariaDB [TestEnc]> 

Looks OK, but let's dive into the hex-values. With the help of the post od and this information we can see:
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestNormalDyn.MYD
000000 05                              block-type (first part of record #1)
       00 66                           len of record
       00 57                           len of this block
       00 00 00 00 00 00 00 b4         ptr. to next block of this record
       0c 80                           flags
       30 31 30 30 30 30 30 30         Id
       39 39 39 39 39 39 20            PZN
       06 31 33 32 2e 33 36            EVP (len + text)
       04 30 2e 30 30                  HAP (len + text)
       28 55 52 4f 53 41 46 45         ArtikelBez (len + text)
       20 42 45 49 4e 42 45 55
       54 45 4c 20 37 37 33 32 
       43 20 20 20 20 20 20 31 
       20 20 4c 2d 41 42 43 44 
       45 
       1a 53 50 52 49 4e 47 20         ArtikelText (len + text))
       44 45 20 4c 55 20 41 4d 
       20                              block ends here
       
       05 00 59 00 43 00 00 00         record #2 (first part of this record)
       00 00 00 00 c8 
       0c 80                           flags
       30 31 30 30 30 30 30 31         Id
       35 33 35 38 31 38 20            PZN
       04 30 2e 30 30                  EVP (len + text)
       05 31 38 2e 39 31               HAP (len + text)
       23 50 4b 20 46 55 45 52         ArtikelBez (len + text) 
       20 54 49 45 52 45 20 20 
       20 20 32 35 30 20 4d 4c 
       2d 41 42 43 44 45 46 47 
       48 49 4a 4b 
       13 55 52                       ArtikelText (len + text)
                                      block ends here

0000b4 09                             block-type (final part of record #1)
       00 0f                          len of data
       01                             no. of filler-bytes at the end of this block
       4b 31 20 4e 4f 20 54 4f        ArtikelText continued
       50 35
       32 35 36 37 37                 Hersteller
       00                             filler-byte
   
       09 00 16 02                    final part of record #2
       45 54 45 52 53 43 48 49        ArtikelText continued
       45 4e 45 41 43 34 50 36
       36     
       33 36 33 36 37                 Hersteller
       00 00                          filler-bytes
                                      end of block, end of record

august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestEncDyn.MYD
000000 05 00 66 00 57 00 00 00        record #1 (first part of this record)
       00 00 00 00 b4 
       0c 80                          flags
       65 64 65 65 65 65 65 65        Id
       6c 6c 6c 6c 6c 6c 20           PZN
       06 64 66 67 7b 66 63           EVP (len + text)   
       04 65 7b 65 65                 HAP (len + text)
       28 00 07 1a 06 14 13 10        ArtikelBez (len + text)
       75 17 10 1c 1b 17 10 00
       01 10 19 75 62 62 66 67 
       16 75 75 75 75 75 75 64 
       75 75 19 78 14 17 16 11 
       10 
       1a 06 05 07 1c 1b 12 75        ArtikelText (len + text)
       11 10 75 19 00 75 14 18 
       75                             block ends here
       
       05 00 59 00 43 00 00 00        record #2 (first part of this record)
       00 00 00 00 c8 
       0c 80                          flags
       65 64 65 65 65 65 65 64        Id
       60 66 60 6d 64 6d 20           PZN
       04 65 7b 65 65                 EVP (len + text)
       05 64 6d 7b 6c 64              HAP (len + text)
       23 05 1e 75 13 00 10 07        ArtikelBez (len + text)
       75 01 1c 10 07 10 75 75 
       75 75 67 60 65 75 18 19
       78 14 17 16 11 10 13 12 
       1d 1c 1f 1e 
       13 00 07                       ArtikelText (len + text)
                                      block ends here

       09 00 0f 01                    final part of record #1
       1e 64 75 1b 1a 75 01 1a        ArtikelText coninued
       05 60 
       67 60 63 62 62                 Hersteller
       00                             filler
                                      block ends here, record ends here
 
       09 00 16 02                    final part of record #2
       10 01 10 07 06 16 1d 1c        ArtikelText continued
       10 1b 10 14 16 61 05 63 
       63 
       66 63 66 63 62                 Hersteller
       00 00                          filler
                                      block ends here, record ends here

So this is handled even with the "encrypted" text. DELETE should be no problem because the blocks are marked as free and put into a linked list, so I didn't test this.

index: so let's inspect what happens with an index:
MariaDB [TestEnc]> show index from TestNormalDyn;
Empty set (0.00 sec)

MariaDB [TestEnc]> show index from TestEncDyn;
Empty set (0.00 sec)

MariaDB [TestEnc]> create index Hersteller on TestNormalDyn(Hersteller);
Query OK, 2 rows affected (0.04 sec)
Records: 2  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> create index Hersteller on TestEncDyn(Hersteller);
Query OK, 2 rows affected (0.05 sec)
Records: 2  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> 

We've seen that our data is "encrypted" in the data-file but a copy of the values of the column Hersteller must appear in the index-file. You will find a description of the structure of the MYI-file here. Let's look into the two index-files we just filled with data (I've shortened the output and marked the relevant parts in bold):
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestNormalDyn.MYI
*
000400 00 1a 01 32 35 36 37 37 00 00 00 00 00 00 01 33
000410 36 33 36 37 00 00 00 00 00 64 00 00 00 00 00 00
000420 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
000800
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestEncDyn.MYI
*
000400 00 1a 01 66 63 66 63 62 00 00 00 00 00 64 01 67
000410 60 63 62 62 00 00 00 00 00 00 00 00 00 00 00 00
000420 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
000800
august@AMD4:~/MariaDB/data/TestEnc$ 

As you can see the values of the column Hersteller are stored encrypted in the file TestEncDyn.MYI but the order of appearance has changed compared to the file TextNormalDyn.MYI:
MariaDB [(none)]> select string_xor( unhex( 6663666362 ) ), string_xor( unhex( 6760636262 ) );
+-----------------------------------+-----------------------------------+
| string_xor( unhex( 6663666362 ) ) | string_xor( unhex( 6760636262 ) ) |
+-----------------------------------+-----------------------------------+
| 36367                             | 25677                             |
+-----------------------------------+-----------------------------------+
1 row in set (0.01 sec)

MariaDB [(none)]> 

Instead of explaining this now please allow me to do some more tests on the SQL-level. First of all I need some more data in the table:
MariaDB [TestEnc]> truncate table TestNormalDyn;
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestEnc]> truncate table TestEncDyn;
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestEnc]> insert into TestNormalDyn select * from TestOK.ABDAOK;       
Query OK, 10000000 rows affected (2 min 7.74 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> insert into TestEncDyn select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (2 min 13.34 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestEnc]> show indexes from TestNormalDyn;
+---------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestNormalDyn |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+---------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> show index from TestEncDyn;
+------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name   | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| TestEncDyn |          1 | Hersteller |            1 | Hersteller  | A         |        2581 |     NULL | NULL   | YES  | BTREE      |         |               |
+------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> 

Now we have 10 mio. records in both tables and the index is active on both tables. Let's do some SELECTs using the index:
MariaDB [TestEnc]> select * from TestNormalDyn order by Hersteller limit 10;
+----------+--------+-------+-------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP   | HAP   | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+-------+-------+--------------------------------------+----------------------------+------------+
| 01001854 | 999999 | 11.95 | 8.84  | TINCTURA LAVANDULAE 1/5    250 ML    | DANSAC DUO GES HAFTP50 TRA | 00020      |
| 01008613 | 999999 | 0.00  | 1.37  | MARCOUMAR     50 ST                  | WIBA DEDUCO D 7 URINBTL AN | 00020      |
| 01013982 | 999999 | 7.25  | 42.00 | TETESEPT HAAR KRAFT     40 ST        | ACIDUM PHOS D 3            | 00020      |
| 01021789 | 999999 | 20.90 | 6.26  | CHRUETERMAENN CASSIA     10 ML       | COMPRESSANA NOV K2 A-T4SIL | 00020      |
| 01024190 | 548755 | 91.07 | 50.15 | LM 150 CARBO VEGETAB ARCA     10 ML  | HYPOTHALAM SUIS INJ ORG    | 00020      |
| 01024645 | 999999 | 0.00  | 15.20 | OROS ROLLSTUHLKISS 40X43X5      1 ST | STOKOSEPT UNPARF FLASCHE   | 00020      |
| 01025398 | 999999 | 0.00  | 0.00  | LIDOCAIN     10  G                   | FURANTHRIL 500             | 00020      |
| 01025951 | 999999 | 0.00  | 39.74 | HIRNSTAMM GL D-R   10X1 ML           | SYMPHYTUM COMP             | 00020      |
| 01028733 | 999999 | 2.67  | 3.81  | LM ALLIUM CEPA VI      5  G          | NORTA BA TI 2W M 5 CH14    | 00020      |
| 01029951 | 374002 | 0.00  | 6.56  | HAMAMELIS INJ     50 ST              | TAGONIS                    | 00020      |
+----------+--------+-------+-------+--------------------------------------+----------------------------+------------+
10 rows in set (0.00 sec)

MariaDB [TestEnc]> select * from TestEncDyn order by Hersteller limit 10;
+----------+--------+-------+-------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP   | HAP   | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+-------+-------+--------------------------------------+----------------------------+------------+
| 01000393 | 999999 | 27.28 | 10.06 | GINSENG D200     10 ML               | DUSCHHOCKER 83047          | 41200      |
| 01001895 | 999999 | 0.00  | 10.17 | PAEONIA OFFIC C 3     10 ML          | PAPRIKA SCHARF             | 41200      |
| 01005147 | 999999 | 0.00  | 0.00  | VITAMIN D3 OEL LOES 100000    100  G | MEDIVEN 2 AD VK GF SCHWA 5 | 41200      |
| 01008793 | 999999 | 20.07 | 0.00  | EYE CARE MAKE UP ROSE 16     35  G   | SANFT U SICHER IL ZW WEI57 | 41200      |
| 01011168 | 999999 | 0.00  | 50.29 | CEPA D12     20 ML                   | GALIUM APARINE D 4         | 41200      |
| 01011673 | 999999 | 0.00  | 9.30  | KARIL 50    4X5 ST                   | GUELDENMOOR 77 KR COCKTAIL | 41200      |
| 01013738 | 999999 | 24.70 | 6.26  | COMPACT KOLOS BTL32MM 3594     10 ST | LATEXHANDSCHUHE UNST GROSS | 41200      |
| 01018792 | 999999 | 0.00  | 3.04  | PRAECIGLUCON     30 ST               | LM GINKGO BILOBA XVIII     | 41200      |
| 01020311 | 265307 | 41.40 | 0.00  | ELEGANCE/2 ADVK OSP MOCCA6      2 ST | SANKT SEVERIN KRAEUTERMASS | 41200      |
| 01022366 | 999999 | 4.79  | 11.20 | LEYHS KINDER GESICHTSCREME    100 ML | KOMPRESSEN GR 2 TUERKIS    | 41200      |
+----------+--------+-------+-------+--------------------------------------+----------------------------+------------+
10 rows in set (0.00 sec)

MariaDB [TestEnc]> 

As you can see the results differ. Instead of explaining the difference now let me show you another observation:
MariaDB [TestEnc]> select count(*) from TestNormalDyn  where Hersteller = '25677';
+----------+
| count(*) |
+----------+
|    17083 |
+----------+
1 row in set (0.06 sec)

MariaDB [TestEnc]> select count(*) from TestEncDyn  where Hersteller = '25677';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> select count(*) from TestEncDyn  ignore index (Hersteller) where Hersteller = '25677';
+----------+
| count(*) |
+----------+
|    17083 |
+----------+
1 row in set (42.18 sec)

MariaDB [TestEnc]> select count(*) from TestEncDyn  where Hersteller = string_xor('25677');
+----------+
| count(*) |
+----------+
|    17083 |
+----------+
1 row in set (0.04 sec)

MariaDB [TestEnc]> 

explanations: when you look at the box above you will see 4 SQL-statements which do the same but their results can differ. For the first statement I want to say: the result on the table TestNormalDyn is correct (please trust me). The second statement found nothing in the table TestEncDyn, a surprising result. You have to look at the hex-dump of the file TestEncDyn.MYI some lines above to understand this result. The database-server walks through the index Hersteller of the table TestEncDyn to find an entry '25677' and it couldn't find one because all entries in TestEncDyn.MYI are encrypted so the result 0 is correct (but it's not what we want). The third statement tells the server to ignore the index so it does a full table scan on TestEncDyn.MYD instead and returns the correct value, and for this operation it needs some more time so we loose the advantage of an index. The fourth statement shows a fast way to the correct result but we are back to the approach using an UDF which was no good idea.

Now please look at the test before this one, the test including orde by Hersteller in the SQL-statement. There you can see different result-sets. This can be explained with the information given before: in both cases the database-server uses the index because the index contains the values of the column Hersteller in sorted order. So the server fetches the first entries from the index and reads the corresponding records from the table. In the unencrypted case the first entry in the index has the value '00020' and in the encrypted case it starts with the value 'adgee' which is the encrypted value for '41200'. This explains the differences in the result-sets.

index-handling code: in the approach presented here the data is encrypted and then given to the proper routines of the engine for further processing so it appears in the index in encrypted form. For encrypting data used in an index without impairing the operability of an index we have to modify the code of the index-handler. One of the more central atomic operations of an index are comparisons so let me describe the needed modifications in some sort of pseudocode: let's assume A and B are rows of a table, we currently do the comparison in the form:
    if ( A.column < B.column ) then ....
instead we have to code it in this form:
    if ( decrypt(A.column) < decrypt( B.column) ) then ....
This would solve the problem shown above with the SQL-statement that contains the order by Hersteller part but it has to be done on multiple places.

other engines

I've chosen MyISAM as the basis for my demonstration here because this engine is feature-rich so implementing encryption is a non-trivial job. For implementing encryption in any other storage engine similar code has to be written. These modification can be done in the functions I've used or in lower level functions of each engine as the link given in the first post showed this link). This could also solve the problem with the index-handling that I've showed.

protection

The data of a database is stored in files. Accessing the files can be restricted via the operating system but an administrator of such a machine should be able to access these files and copy them anyway. When some of the problems of the approach described here (e.g. indexes) are solved then the approach presented here offers some additional protection.

But copying a file is not the only attack possible on the data. Following the approach presented here the data is stored on disk in encrypted form but for handling the data and working on the query the database-server needs the data in unencrypted form. So on disk data is encrypted, in RAM the data is unencrypted. The story of the Heartbleed-bug showed that data in RAM can sometimes be read by some tricks. Other attacks were made via SQL-injection. More attacks are possible.

Encrypting the data on the file-level is not the last word on data-protection. And if you look at my suggestions for modifying the index-code you will add an entry-point for breaking the encryption because in the index you can find hints about the sort-order (think of the history of the Enigma and Bletchley Park). Encryption is never completed.

finally

I started this post by telling the story of a phone-call I received some weeks ago. Core-topic of the discussions was a project about encrypting data in MariaDB/MySQL and I was interested in this project. But I fell through the sieve in the first round, I didn't get the contract.

Thursday 21 August 2014

encryption

the data of a company is usually stored in databases, your personal data is normally not stored in a database (only some people do this). Protecting the data in a database is essential for companies. You need an argument for this statement? Please allow me to do some 'name-dropping': Edward Snowden showed us interesting things. And companies sometime wonder about competitors and their advantages: espionage.

Some weeks ago I was asked by a company to help in a project. The company that asked is working in the business for arranging contracts between companies and software developers, so getting the right information about the job to do wasn't easy in the first discussion. A week later the job description was clear: implementing an encryption-function in MySQL/MariaDB. By asking Mr. Google I found this site: MariaDB Database Encryption. This describes in detail one approach for use in the engines InnoDB and XtraDB.
There was more information available: Mr. Google also told me this about encryption in MariaDB. I continued searching and started thinking about possible solutions.

I like to present some of my thoughts and experiments. As I said in other posts before: the code I will present here is only a proof of concept, it's not ready for commercial use.

As my focus is on the code of MariaDB/MySQL and not in the algorithm or code of any encryption-procedures I've chosen a simple function for simulating encryption: XOR. It's easy to implement, fast, encrypting and decrypting is identical (so I have only one routine to code) and the encrypted text is broken in some seconds (or less). Let's simply use this algorithm (you shouldn't use it for encrypting your data).

functions

MariaDB/MySQL contains functions for encrypting and decrypting your data. You will find a list of these functions here: 12.13 Encryption and Compression Functions. This page also contains examples on how to use these.

As my "encryption" is XOR I would like to demonstrate this using this approach. So I have to write an UDF which implements this. You can find a description of UDFs here:CREATE FUNCTION UDF and an introduction into the rules for writing your own UDF here: 24.3 Adding New Functions to MySQL. And finally there is an example which you will find in the directory sql of the source-code: udf_example.c.

You will find the source-code of my little UDF in the archive in the file: string_xor.c1). Compile it and copy it to the directory where the database-server expects it:
august@AMD4:~/workspace/MariaDB/mariadb-10.0.10/sql$ gcc -shared -o string_xor.so string_xor.c  -I ../include
august@AMD4:~/workspace/MariaDB/mariadb-10.0.10/sql$ cp string_xor.so ~/MariaDB/pgm/lib/plugin/
august@AMD4:~/workspace/MariaDB/mariadb-10.0.10/sql$ 

So let's test this little routine:
 MariaDB [(none)]> CREATE FUNCTION string_xor RETURNS STRING SONAME 'string_xor.so';
Query OK, 0 rows affected (0.01 sec)

MariaDB [(none)]> select hex(string_xor('ABCDE'));
+--------------------------+
| hex(string_xor('ABCDE')) |
+--------------------------+
| 1417161110               |
+--------------------------+
1 row in set (0.03 sec)

MariaDB [(none)]> select string_xor(unhex(1417161110));
+-------------------------------+
| string_xor(unhex(1417161110)) |
+-------------------------------+
| ABCDE                         |
+-------------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> 

Using the hex()- und unhex()-functions was necessary because the box above does not show correctly the output of the function string_xor(). If I omit the hex()-function it looks like this:


Thius output looks like encrypted. So let's continue our tests:
MariaDB [(none)]> create database TestEnc;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use TestEnc;
Database changed
MariaDB [TestEnc]> create table TestNormal (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.05 sec)

MariaDB [TestEnc]> create table TestEnc (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.18 sec)

MariaDB [TestEnc]> insert into TestNormal values ( '01000000', '999999', '132.36', '0.00', 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', '25677'); 
Query OK, 1 row affected (0.00 sec)

MariaDB [TestEnc]> insert into TestEnc values ( '01000000', '999999', '132.36', '0.00', 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', string_xor('25677'));
Query OK, 1 row affected (0.00 sec)

MariaDB [TestEnc]> 

As you can see in the last lines I've inserted a row into each table. For showing the result of these INSERT-statements I did some SELECTs. In the result I've marked the values for the column Hersteller in bold. So let's look what is stored in the table:
MariaDB [(none)]>MariaDB [TestEnc]> select * from TestNormal;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> select * from TestEnc;   
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | g`cbb      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> select Id, PZN, EVP, HAP, ArtikelBez, ArtikelText, string_xor(Hersteller) as Hersteller from TestEnc;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | 25677      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.01 sec)

MariaDB [TestEnc]> 
As you can see accessing the column Hersteller of our table TestEnc shows an encrypted value, you must use the decryption-routine to get the real value of this column.

Let's look into the MYD-files of the two tables:
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestNormal.MYD
000000 01                              flag
       30 31 30 30 30 30 30 30         Id
       39 39 39 39 39 39 20            PZN
000010 31 33 32 2e 33 36 20 20         EVP
       30 2e 30 30 20 20 20 20         HAP
000020 55 52 4f 53 41 46 45 20         ArtikelBez
       42 45 49 4e 42 45 55 54
000030 45 4c 20 37 37 33 32 43 
       20 20 20 20 20 20 31 20
000040 20 4c 20 20 20 20 20 20 
       53 50 52 49 4e 47 20 44         ArtikelText
000050 45 20 4c 55 20 41 4d 20 
       4b 31 20 4e 4f 20 54 4f
000060 50 35 
       32 35 36 37 37                  Hersteller
000067

august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestEnc.MYD
000000 01                              flag
       30 31 30 30 30 30 30 30         Id
       39 39 39 39 39 39 20            PZN
000010 31 33 32 2e 33 36 20 20         EVP
       30 2e 30 30 20 20 20 20         HAP
000020 55 52 4f 53 41 46 45 20         ArtikelBez
       42 45 49 4e 42 45 55 54
000030 45 4c 20 37 37 33 32 43 
       20 20 20 20 20 20 31 20
000040 20 4c 20 20 20 20 20 20 
       53 50 52 49 4e 47 20 44         ArtikelText
000050 45 20 4c 55 20 41 4d 20 
       4b 31 20 4e 4f 20 54 4f
000060 50 35 
       67 60 63 62 62                  Hersteller (encrypted)
000067
august@AMD4:~/MariaDB/data/TestEnc$ 

At the end of the tables you see the value of the column Hersteller. For table TestNormal you see it in plain text, for table TestEnc you see the encrypted values. To verify our little UDF let's do some calculations:
       value        decrypt                   encrypt
       67           67 XOR 55 = 32 = '2'      32 XOR 55 = 67
       60           60 XOR 55 = 35 = '5'      35 XOR 55 = 60
       63           63 XOR 55 = 36 = '6'      36 XOR 55 = 63
       62           62 XOR 55 = 37 = '7'      37 XOR 55 = 62

final words: can we use this approach? Maybe. Please keep in mind that every SQL-statement must contain the call of the encryption- or decryption-function for every column that contains encrypted values. In most of the cases this has to be done in the client-software. Using a hex-editor or decompiler (e.g. for Java) or some other tools this can easily be detected. So this approach is no real protection.

trigger

Let's switch over to the server-side and implement encryption via a trigger. You will find a description here: CREATE TRIGGER Statement.

Here is the first step of the example using a trigger:
MariaDB [TestEnc]> create table TestTrigger (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestEnc]> delimiter //    
MariaDB [TestEnc]> CREATE TRIGGER test_xor BEFORE INSERT ON TestTrigger
    ->     FOR EACH ROW
    ->     BEGIN
    ->         SET NEW.Hersteller = string_xor( NEW.Hersteller);
    ->     END;//
Query OK, 0 rows affected (0.08 sec)

MariaDB [TestEnc]> delimiter ;
MariaDB [TestEnc]> insert into TestTrigger values ( '01000000', '999999', '132.36', '0.00', 'UROSAFE BEINBEUTEL 7732C      1  L', 'SPRING DE LU AM K1 NO TOP5', '25677');
Query OK, 1 row affected (0.00 sec)

MariaDB [TestEnc]> select * from TestTrigger;
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP  | ArtikelBez                         | ArtikelText                | Hersteller |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00 | UROSAFE BEINBEUTEL 7732C      1  L | SPRING DE LU AM K1 NO TOP5 | g`cbb      |
+----------+--------+--------+------+------------------------------------+----------------------------+------------+
1 row in set (0.00 sec)

MariaDB [TestEnc]> 

If you look at the result of the SELECT you see that only the column Hersteller is encrypted. And if you look at the definition of the trigger you see that it uses the UDF string_xor() defined before and it calls this function only for the column Hersteller.

Looks fine but let's look into the MYD-file:
august@AMD4:~/MariaDB/data/TestEnc$ l -l TestTrigger*
-rw-rw---- 1 august august  679 Aug 19 13:57 TestTrigger.frm
-rw-rw---- 1 august august  103 Aug 19 13:58 TestTrigger.MYD
-rw-rw---- 1 august august 1024 Aug 19 13:58 TestTrigger.MYI
-rw-rw---- 1 august august  337 Aug 19 13:57 TestTrigger.TRG
august@AMD4:~/MariaDB/data/TestEnc$ od -t x1 -Ax TestTrigger.MYD
000000 01 30 31 30 30 30 30 30 30 39 39 39 39 39 39 20
000010 31 33 32 2e 33 36 20 20 30 2e 30 30 20 20 20 20
000020 55 52 4f 53 41 46 45 20 42 45 49 4e 42 45 55 54
000030 45 4c 20 37 37 33 32 43 20 20 20 20 20 20 31 20
000040 20 4c 20 20 20 20 20 20 53 50 52 49 4e 47 20 44
000050 45 20 4c 55 20 41 4d 20 4b 31 20 4e 4f 20 54 4f
000060 50 35 67 60 63 62 62
000067
august@AMD4:~/MariaDB/data/TestEnc$ 

If you look at the hex-dump you will see exactly the same hex-values as in the example above (table TestEnc.MYD) so this approach worked.

How can we access this column with the encrypted value? Unfortunately there is no trigger for SELECT in MariaDB/MySQL. Other database-software also does not have such a trigger as you can see here for Oracle and Microsoft. So we have to create a stored procedure that contains this SELECT-statement and does the processing on the encrypted columns, returning the row after processing it.2). But I don't want to do this here but instead switch over to a different approach.

continued

Next I like to describe an approach that uses a storage engine so this post will be continued.


some notes:
1) I will give you the link to the archive in the coming post.
2) for protecting the data in your database this approach has some advantages: you can put all SQL-statements in stored-procedures and the clients are only allowed to call these stored-procedures. You can even let these stored procedures call other stored procedures which belong to a different user but have the right to access tables and manipulate data. So from the client side you will only see a stored procedure which contains a one-liner (the call of another stored procedure, which is invisible to the client).

Monday 4 August 2014

columns: some remarks

this is my fifth post about this topic: storing data in a column-oriented form. You can find the other posts here: columns, columns (2), columns (3) and columns (4).

I started these posts with the idea that storing the data in column-oriented form would speed-up the database-operations, especially it would speed-up the queries. And I tried to implement such a structure within MariaDB/MySQL.

After playing with this code and thinking about improving the algorithms for a while I'm not so sure anymore. Yes, this approach is faster, but only in some cases and it seems that these are specialized cases.

descriptions

In the first post about this topic I gave a short explanation of storing the data row-oriented or column-oriented in a file. I don't want to repeat this text so let me describe here in a few words the ways I used for some experiments: MYCOLS1, MYCOLS2, MYCOLS3. I tried some more algorithms which I don't want to publish (and naturally other people have their own algorithms too).

first shot

In the first post some (silly) code was presented to show which functions were of interest for our topic.

MYCOLS1

The algo in MYCOLS1 was primitive, when a row was written the data of one column was written to an extra file in sequential form. Querying the data was done by scanning through the extra file if this column was found in a condition in the WHERE-clause. If a match was found in the extra file the corresponding row was read and given back to the database-server to do the rest of the filtering on this row in the usual way. If a more complex WHERE-condition exists this algo could not check for all parts of the condition because other columns were not handled by this algo but could be involved in the query.1)

MYCOLS2

In MYCOL2 the extra storage was rearranged to a dictionary-like approach. The storage of the column-data was split into a dictionary- and a reference-part. The dictionary (the .key-file) contained the unique entries of this column (think of an index of a book), the .data-file contained the references to the rows (again think of the index of a book: under the entry you find the page-numbers on which this keyword can be found. Or you may think of a cross-reference-listing where all variables of a program are listed together with the line-numbers of their occurrences in the source-file).

MYCOLS3

In the examples above each query was handled by a table-scan. In MYCOLS3 this changed to an index: MYCOLS3 used the same layout of the column-data as MYCOLS2, but in a query the additional data was treated like an index.
The Microsoft SQL Server 2012 has a feature called Column Store Index. Maybe it's a similar approach.


The algorithms presented use a lot of the code of the underlying MyISAM-engine, This can be removed but for this demonstration this was not done. Removing additional MyISAM-code could speed up the read-operation a bit.

All the algos presented here use a hybrid storage: the data was stored in a MYD-file plus in one or more additional files, similar to the example of the book with the additional index or the cross-reference-listing. In handling the query the extra file was scanned and finally when a condition was fulfilled the row was read and given to the database-server for further handling. So all algos presented here were modifications of a row-based approach; they did not overcome the concept of row-based storage.

Further improvements of the algos are possible. With the current implementation in MariaDB/MySQL the optimizer knows nothing about the data-structure of our 'index' (in MYCOLS3). If you look at the GROUP BY-example you see that the data-structure of MYCOLS3 fits into the query. The only thing needed is a call to fetch one key from the storage-engine together with the number of occurrences of this key in the data; this can easily be done with the data available in the 'index'. But this means extending the API of the storage-engine and rewriting the optimizer-code to take advantage of this new function.2)



from a different angle

Let's look at the same problem from a different angle: data is stored on one or more harddisks. The steps involved in answering a query contain some steps for reading data from the harddisk. Harddisks are fast but have always been slow compared to RAM or CPUs. Harddisks need some milliseconds to deliver data; current CPUs are able to execute some billion of assembly-instructions per second, so in a millisecond (= 0.001 second) the CPU is able to execute some 10 million assembly-instructions (and harddisks do need more than one millisecond to deliver the data). We can speed things up by adding a cache, an additional layer between RAM and harddik using some RAM to keep parts of the data in RAM based on statistical asumptions. This can speed things up but the whole data does not fit into the cache so it can help only sometimes.

Let me describe the disk-operations: one can distinguish between two general types of disk-operations: sequential and random. To simplify the argumentation I will only describe reads.

sequential read:
  • operating system tells the disk which part to read
  • move disk-head to correct track(cylinder)
  • wait in the track until sector arrives
  • start reading (simplified)
=> from this moment on the disk can deliver a lot of data (until the full track is read and we have to move the head again). So multiple records can be read one after the other.

random read:
  • operating system tells the disk which part to read
  • move disk-head over correct track(cylinder)
  • wait in the track until sector arrives
  • read sector(s)
=> looks similar to the description given above but do not forget: we read only one or some sector(s) and only one record will be given back to the OS. Usually the next records is on a different position on the disk and therefore the whole process starts again. So we do spend some milliseconds by waiting for the disk-head to move to the next correct position.

The optimizer chooses the path to read the data, via a table-scan (=sequential read) or via an index (=random read), if an index is available.3)

Observations here on my PC show: sequential reads are fast:
  • the harddisk on my PC is able to read up to 130 MB/sec
  • the SSD is able to read approx. 400 MB/sec

The general idea for speeding up the query is to reduce the amount of data read from the disk. This is the reason why an additional organization of the data (=index) is helpful. Reducing the amount of data read is a must when tables grew bigger. With the numbers above reading a single table sequentially takes about 6 to 7 seconds for each GB of data plus the necessary time for further handling the data (filtering, joining, grouping, sorting etc.). This can add to minutes if you think of SQL-statements involving 20 (or more) tables and this time required is unacceptable.
So we have to organize the data anyway and we can try to find a layout of the data that speeds up the query (and writing and updating and deleting and checking for a condition and ....). For approx. 30 years we live with relational databases and have learned to work with them. We know about tables and records and we think this way. And therefore the data comes in into a system in the form of rows and is handled by our software this way. Our concepts are based on rows (or a set of rows). Maybe we need a different model to handle columnar-storage of the data. Unfortunately I can't show you a better way.



some notes:
1) this is valid for all algos presented here.
2) To be fair: adding such a function would also speed up the operation with the existing indexes because it can easily be implemented using the conventional index-structure (conventional = BTREE = as implemented in MyISAM).
3) In one of my tests the optimizer chooses the wrong path: a sequential-read took approx. 10 secs whereas a random-read took approx. 17 seconds and it chooses the index when it was availabe (the test was made with a table of type MyISAM, 10 mio. records in the table, about 1GB in size). Maybe this is a topic for another post here.