Wednesday, April 1, 2015

A few possibly true facts about indexes in MongoDB, RocksDB and WiredTiger

These are probably correct as of April 1, 2015. They might not be correct next year. We have a lot of work to do to explain the storage engines available in MongoDB 3. I welcome corrections.

  • The primary key index is not clustered in MongoDB for RocksDB or WiredTiger. I assume this is a limitation of the storage engine API and hope a clustered PK index is supported in the future. The PK index is a map from PK columns to RecordId and (I assume) RocksDB and WiredTiger have a clustered index on RecordId and the documents are in the leaf nodes of the RecordId index. The impact from not having a clustered PK index include:
    • Much less RAM is required to cache the PK index because it is not clustered but if the table is huge and you want to guarantee that at most one disk read is required for a point query then a non-clustered PK requires much more RAM then a clustered PK. With the clustered PK you only need the level above the leaf in RAM and that only requires 1 (key, block pointer) pair per block in the leaf level. With the non-clustered PK you need one (key, block pointer) pair per row in the table.
    • For RocksDB and WiredTiger going from RecordId to the document requires searching an index (b-tree or LSM for WT, LSM for RocksDB). An index search is not required for mmapv1 as the RecordId is a filename + offset in that case. This might make read-mostly workloads on a cached working set faster with mmapv1.
    • If values adjacent in the PK index are not adjacent in the RecordId index then a scan in PK order can do a lot of random IO as it moves around in the RecordId index.
  • The WiredTiger b-tree can use prefix compression for indexes and block compression for data. I think this means that block compression is used for the leaf nodes of the RecordId index. I am not sure if it is also used for the leaf nodes of the other indexes. I guess that only prefix compression is used for the non-leaf nodes of all indexes.
  • With RocksDB we need to be careful about terminology. With leveled compaction the block-based table format is used for MongoDB and most of the data in an LSM level is from pages, and then the block index and bloom filter. This is true for secondary indexes, primary indexes and the RecordId index. Currently all data is in one column family and indexes are distinguished by using the index ID as the prefix of the key given to RocksDB but we can ignore that for now. Lets also ignore bloom filters and just focus on pages and the block index. The interesting details are:
    • There is one key per page in the block index so this is unlikely to use much space compared to the pages. 
    • Key value pairs are stored in the pages in key order and prefix compression is done for them. Prefix compression is not done for the block index. Logically the value in the secondary index is the PK columns but the PK columns are used here to make the secondary index key unique. Logically the value in the PK index is the RecordId, but I think that the implementation puts the RecordId at the end of the PK columns in the key. And for the RecordId index the value is the Document.
    • Block compression, when configured, is done for all pages. So it might be possible that more data can get the benefit from block compression with RocksDB than with WiredTiger and this can be a big deal when wide (covering) secondary indexes are used. However I still have questions about where the block compressor is used for WT.


  1. I do not like how the current storage engine API does not allow for a single "index" to be used for a collection with no secondary indexes.

    1. I don't understand, haven't spent much time looking at the API