Friday, April 3, 2015

Fast index create

What is a fast way to create secondary indexes for a write-optimized database engine?

Back in the day the only way to create a secondary index for InnoDB was via incremental maintenance. Insert, update and delete statements would maintain secondary indexes as needed. The CREATE INDEX command for a secondary index would make a copy of the table, define all indexes (primary and secondary) and then scan the source table in PK order and insert rows from the scan into the copy table while maintaining secondary indexes after each insert.  In most cases the result from this is a secondary index subject to changes in a random sequence. That means the secondary index is fully fragmented immediately after index create and there was no way to defragment a secondary index. Fragmentation is bad as it means you are wasting space and using about 1.5X the space for the index compared to the index without fragmentation.

Today InnoDB is able to create a secondary index in the expected way via scan/sort/write and fragmentation is much better. The expected way is to scan the base table or another secondary index to get the columns for the new index, sort those columns and then write out the secondary index in key order. I will ignore the complexity of logging to allow other changes concurrent with the index create. Not only does this reduce fragmentation but it also reduces random IO -- the index is written sequentially and there are no random disk reads to get secondary index leaf pages during the read-modify-write cycle of an update for a b-tree.

The best-practice for fast index creation can change for a write-optimized database engine when the engine supports a blind-write in addition to a read-modify-write internally. By this standard a b-tree is not write-optimized whether it is update-in-place like InnoDB or copy-on-write like WiredTiger. Updates and inserts with a b-tree must read the old copy of the page, modify it and then eventually write the page back to storage. There is always a read before the write even with the InnoDB change buffer where the read is deferred. But with an LSM like RocksDB or WiredTiger and probably with a fractal tree like Tokutek updates can be done as blind-writes in many cases like when the secondary index to be updated is not unique. By blind-write I mean there isn't a read before the write. This means that the random reads that make incremental index maintenance for a b-tree slow can be avoided with a write-optimized engine. The random writes can also be avoided assuming the engine really is write-optimized when flushing changes to storage.

It might take some time for us to learn that the rules have changed. It might also take time for the rules to really change when there are robust write-optimized engines available for MySQL and MongoDB.

Update - today I learned the RocksDB storage engine for MongoDB doesn't do reads for non-unique secondary index maintenance.

1 comment:

  1. In my opinion, the best feature that write optimized data structures can add is hot (re overhead) schema change to MySQL. Our, Tokutek, implementation of this feature is covered in the below link. In short, an extra column in the database can be handled just like any insert into the database (I/O deferred). I think it's one of the most under-rated features...NoSQL like agility in a RDBMS