Tuesday, February 9, 2016

Compaction priority in RocksDB

Compaction priority is an option in RocksDB that determines the next file to select for compaction. Here I show the impact of this option on the amount of storage writes while running Linkbench. The right value can reduce the amount of data written to storage which improves storage efficiency and can make an SSD device last longer.

The Linkbench schema has 3 tables with 4 indexes. Each of the indexes uses a separate column family with MyRocks. There is one column family for each of the primary key indexes on the link, count and node tables (link_pk, count_pk, node_pk) and another column family for the secondary index on the link table (link_id1_type). The schema for MyRocks is here.


I ran Linkbench with maxid1=1B and requesters=20 for 3 days on a server with 24 CPU cores, 256 GB of RAM and an MLC SSD. I then looked at the compaction IO statistics via SHOW ENGINE ROCKSDB STATUS to determine the amount of storage writes per column family. The graphs below use the integer value for the compaction priority:

This graph displays the total amount written per column family for each of the compaction_pri options. The least data is written with kOldestSmallestSeqFirst and the most data is written with kOldestLargestSeqFirst. The difference is about 1.34X. I have been using the default value, kByCompensatedSize, for all of my tests prior to this result.
The next graph shows the amount of data written into RocksDB by the application (MyRocks) by column family. I call this the ingest. The storage writes done by RocksDB includes the ingest and writes done for compaction in the background. As expected the ingest is similar for each of the compaction_pri values.
The final result is the write-amplification which is the ratio of total-writes / ingest. A smaller write-amplification is usually better. Because the total-writes (the first graph) are largest for compaction_pri=1 it has the largest write-amplification.

No comments:

Post a Comment