Thursday, February 11, 2016

RocksDB vs the world for loading Linkbench tables

I like RocksDB, MyRocks and MongoRocks because they are IO efficient thanks to excellent compression and low write-amplification. It is a bonus when I get better throughput with the RocksDB family. Here I show that RocksDB has excellent load performance and to avoid benchmarketing the context is loading Linkbench tables when all data fits in cache. This workload should be CPU bound and can be mutex bound. I compare engines for MySQL (RocksDB, TokuDB, InnoDB) and MongoDB (RocksDB, WiredTiger) using the same hardware.


I used Linkbench for MySQL and LinkbenchX for MongoDB. The load test was run with generate_nodes=false to limit the load to the count and link tables/collections. The load was repeated with 1 and 8 users (loaders=1 & maxid1=1M, loaders=8 & maxid1=4M). The test server has 144G RAM, 2 sockets, 12 CPU cores (24 HW-threads) and a 400G Intel s3700 SSD.

The oplog, binlog and sync-on-commit were disabled. I did that to remove bottlenecks that might hide stalls in the database engine.

I tested the following binaries:


This shows the average insert rate during the single-threaded load. The performance summary is:
  • MySQL gets more throughput than MongoDB for the same workload because MongoDB uses more CPU. MongoDB also suffers from using more indexes on the Linkbench tables (the extra indexes are internal) as explained previously. I hope they fix this.
  • For MongoDB, data is loaded faster with WiredTiger than RocksDB because RocksDB uses more CPU.
  • MySQL+RocksDB and MySQL+TokuDB were by far the fastest for single-threaded loads if you ignore uncompressed InnoDB and I ignore it because I must have compression. Compressed InnoDB suffers from the overhead of doing some (de)compression operations in the foreground.
  • Uncompressed InnoDB is about 10% slower in MySQL 5.7 compared to 5.6 for the single-thread load. Low-concurrency performance regressions in MySQL is a known problem.

This shows the value of: (CPU utilization * 1000) / insert_rate. That includes CPU consumed in the foreground by the thread processing the users requests and in the background. The value is inversely correlated with the insert rate. The value is larger for MongoDB than for MySQL which explains why the insert rate is larger for MySQL.


This shows the average insert rate during the load with 8 user threads. The performance summary is:
  • MySQL+RocksDB is a lot faster than everything else except uncompressed InnoDB and I ignore that because I require compression.
  • Compressed InnoDB suffers from mutex contention on the pessimistic code path. See this.
  • TokuDB suffers from mutex contention and is slower here than at 1 thread. See this.
  • Mongo+RocksDB suffers from mutex contention and the difference between it and WiredTiger is larger here than for the single-threaded load.

This is the rate of context switches per insert. The context switch rate was measured by vmstat. Several of the engines suffer from too much mutex contention.

Tuesday, February 9, 2016

Compaction priority in RocksDB

Compaction priority is an option in RocksDB that determines the next file to select for compaction. Here I show the impact of this option on the amount of storage writes while running Linkbench. The right value can reduce the amount of data written to storage which improves storage efficiency and can make an SSD device last longer.

The Linkbench schema has 3 tables with 4 indexes. Each of the indexes uses a separate column family with MyRocks. There is one column family for each of the primary key indexes on the link, count and node tables (link_pk, count_pk, node_pk) and another column family for the secondary index on the link table (link_id1_type). The schema for MyRocks is here.


I ran Linkbench with maxid1=1B and requesters=20 for 3 days on a server with 24 CPU cores, 256 GB of RAM and an MLC SSD. I then looked at the compaction IO statistics via SHOW ENGINE ROCKSDB STATUS to determine the amount of storage writes per column family. The graphs below use the integer value for the compaction priority:

This graph displays the total amount written per column family for each of the compaction_pri options. The least data is written with kOldestSmallestSeqFirst and the most data is written with kOldestLargestSeqFirst. The difference is about 1.34X. I have been using the default value, kByCompensatedSize, for all of my tests prior to this result.
The next graph shows the amount of data written into RocksDB by the application (MyRocks) by column family. I call this the ingest. The storage writes done by RocksDB includes the ingest and writes done for compaction in the background. As expected the ingest is similar for each of the compaction_pri values.
The final result is the write-amplification which is the ratio of total-writes / ingest. A smaller write-amplification is usually better. Because the total-writes (the first graph) are largest for compaction_pri=1 it has the largest write-amplification.

Monday, February 8, 2016

Concurrent inserts and the RocksDB memtable

RocksDB started as a fork of LevelDB. While it will always be a fork, we describe it as derived from LevelDB given how much code has changed. It inherited a memtable that did not allow concurrent inserts courtesy of a mutex - readers ignored the mutex, writers must lock it. Writers waiting for the mutex linked their updates together and the thread at the head of the convoy applied all of their changes after getting the mutex.

Insert performance for RocksDB and LevelDB did not improve beyond one thread when sync on commit was disabled because of contention on the mutex. It could improve with more threads when sync on commit was enabled courtesy of group commit. Write-only workloads are rare for a DBMS that uses a B-Tree because read-modify-write of leaf nodes is done for updates and inserts. Write-only is more common with RocksDB thanks to the merge operator and because non-unique secondary index maintenance doesn't need RMW. We expect to make this more common via SQL features in MyRocks.

We continue to make RocksDB better and today I show how the concurrent insert rate improves by 3X with sync on commit disabled and 2X with it enabled. Hopefully we will get a wiki entry to explain the code changes, until then thank you Nathan. The new behavior is enabled by setting two configuration options to true: allow_concurrent_memtable_write and enable_write_thread_adaptive_yield.

These tests were done using a server with 24 cores (48 HW threads) and enough RAM to cache the database. The test script is here. The workload is write-only with threads doing Put operations on randomly selected keys.

Results with sync on commit disabled

This section has results with sync on commit disabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 3X at high concurrency.

Results with sync on commit enabled

This section has results with sync on commit enabled. The first graph shows the rate in inserts/second for RocksDB with the concurrent memtable enabled (New) and disabled (Old). The second graph shows the ratio of (New/Old) and the speedup is about 2X at high concurrency.

Speedup for MyRocks

I used a smaller server (20 cores instead of 24) to measure the benefit from this change for loading linkbench tables in parallel. The linkbench load step was run with generate_nodes=false to disable the single-threaded load of the node table and maxid1=30M. The database is small enough to fit in cache. The speedup is 1.4X at high concurrency. I will be vague but there is another configuration change that improves the speedup to 2.4X.

Thursday, January 28, 2016

MyRocks vs InnoDB with Linkbench over 7 days

After feedback from my previous post on Linkbench I repeated it for 7 days as the previous test ran for 1 day. The results are the same -- MyRocks sustains more QPS, is more IO efficient and provides better compression on IO-bound Linkbench. Ask your MySQL vendor when they will support MyRocks. The summary is:
  • InnoDB writes between 8X and 14X more data to SSD per transaction than RocksDB
  • RocksDB sustains about 1.5X more QPS
  • Compressed/uncompressed InnoDB uses 2X/3X more SSD space than RocksDB
I encourage others to use long running benchmark tests and present IO efficiency metrics in addition to performance results.


I used the same configuration as described in the previous post with one difference. For this test I ran 168 iterations of the query step and each step ran for 1 hour. The test ran for 7 days while the previous test ran for 1 day. What I describe as QPS below is TPS (transactions/second) and when I use per query below I mean per transaction. The IO efficiency metrics are measured by iostat. I report the database size in GB at the end of each day - hours 1, 24, 48, 72, 96, 120, 144 and 168. For each one hour interval I collect:
  • average QPS
  • iostat reads per query (r/q)
  • iostat KB read per query(rKB/q)
  • iostat KB written per query (wKB/q)
  • iostat reads per second (r/s)
  • iostat KB read per second (rKB/s)
  • iostat KB written per second (wKB/s)
Tests were run for several binaries:
  • myrocks.zlib - Facebook MySQL 5.6, RocksDB with zlib compression
  • innodb56.none - upstream MySQL 5.6.26, InnoDB without compression
  • innodb57.none - upstream MySQL 5.7.10, InnoDB without compression
  • innodb56.zlib - upstream MySQL 5.6.26, InnoDB with zlib compression
  • innodb57.zlib - upstream MySQL 5.7.10, InnoDB with zlib compression

Better Compression

Compressed InnoDB uses about 2X more SSD space than MyRocks. Uncompressed InnoDB uses about 3.1X more SSD space than MyRocks. This graph shows the database size every 24 hours. Note that the database gets more data as a function of the QPS rate and MyRocks has more data than InnoDB after 168 hours -- myrocks.zlib has 6.4% more rows than inno57.none and 7.2% more rows than inno57.zlib after 7 days.

Better Performance

I'd be happy if MyRocks matched the QPS from InnoDB and only beat it on IO efficiency. But it wins on QPS and IO efficiency. The data below is QPS over time. MyRocks gets at least 1.5X more QPS than compressed InnoDB. It also does a lot better than uncompressed InnoDB but who wants to use 3X more SSD space. The QPS growth at test start for InnoDB with zlib happens because there are stalls until the compressed b-tree pages fragment. I think this is a problem with mutexes in the pessimistic insert/update code for compressed InnoDB.

The graph below shows the average QPS from each 1-hour interval.

Better IO Efficiency

I present IO efficiency metrics here using data from iostat normalized by the QPS rate to show the amount of IO done per transaction. 

This result is remarkable. InnoDB writes between 8X and 14X more to storage per transaction than MyRocks. This means that workloads can use TLC SSD with MyRocks when InnoDB requires MLC and that workloads can use SSD with MyRocks when SSD doesn't have sufficient endurance for InnoDB. Running a busy database on SSD is so much easier than using disk. 

This result doesn't include the additional write-amplification from flash GC that occurs with InnoDB compared to MyRocks because the MyRocks write pattern generates less work for flash GC. I previously described how that increased the flash write-rate by about 1.5X for InnoDB compared to RocksDB for the device I use. This means that the real difference in write rates on SSD might mean that InnoDB writes 12X to 21X more to storage than MyRocks.

Using iostat metrics from hour 168 InnoDB writes 8.7, 10.5, 8.9 and 10.6 times more to storage per transaction compared to RocksDB. Using iostat data from hour 167 the difference is 11.5, 13.9, 11.7 and 14.0 times more data written.

The graph below shows the number of KB written to SSD per transaction from each 1-hour interval.

This graph shows the number of SSD reads per transaction. MyRocks has the smallest rate. While an LSM can have an IO penalty for range queries and range queries are the most frequent operation in Linkbench, that isn't a problem for this workload.
This graph shows the number of KB read from SSD per transaction. The rate for MyRocks is in between the rates for compressed and uncompressed InnoDB. The r/q rate is more important than this rate as long as the difference here isn't extreme. The rate for MyRocks includes reads done for user queries and reads done in the background for compaction.

Absolute iostat results

These graphs show absolute rates from iostat. The data is the average rate per 1-hour interval.

The first graph is the rate for iostat r/s. The rate is larger for MyRocks because it sustains the most QPS.

The next graph shows the rate for iostat wKB/s. Note that InnoDB sustains between 100 and 200 MB/s of writes. The SSD device is much busier doing writes for InnoDB than for MyRocks. IO capacity used for writes isn't available for reads even when endurance isn't an issue. More writes means more erases from flash GC and erases are a source of IO stalls when a read gets stuck behind the erase on the same channel.
The last graph shows the rate for iostat rKB/s. The rates are larger for uncompressed InnoDB and MyRocks compared to uncompressed InnoDB. The SSD is very busy if you combine the iostat rates for rKB/s and wKB/s.

How to build MongoRocks for MongoDB 3.2

This explains how I built MongoDB 3.2 from source with support for RocksDB thanks to help from Igor Canadi. There are more details here and here. My server uses Fedora.

# Install many of the dependencies for MongoRocks
sudo yum install snappy-devel zlib-devel bzip2-devel lz4-devel
sudo yum install scons gcc-g++ git

# Unpack MongoDB 3.2 source in $MONGOSRC

# Directory in which git repos are created
mkdir ~/git

# Get MongoRocks engine
cd ~/git
git clone

# get and build RocksDB libraries
git clone
cd rocksdb
git checkout --track origin/4.4.fb -b 44fb
make static_lib

# prepare source build with support for RocksDB
mkdir -p src/mongo/db/modules/
ln -sf ~/git/mongo-rocks src/mongo/db/modules/rocks

# build mongod & mongo binaries
scons CPPPATH=/home/mdcallag/git/rocksdb/include \
      LIBPATH=/home/mdcallag/git/rocksdb \

      LIBS=lz4 mongod mongo

# install mongod
mkdir -p ~/b/m321
cd ~/b/m321
mkdir data
mkdir bin
cp $MONGOSRC/build/opt/mongo/mongod bin
cp $MONGOSRC/build/opt/mongo/mongo bin

# create mongo.conf file with the text that follows. You must
# change $HOME and consider changing 
the value for cacheSizeGB
  fork: true
  destination: file
  path: $HOME/b/m321/log
  logAppend: true
  syncPeriodSecs: 600
  dbPath: $HOME/b/m321/data
    enabled: true
operationProfiling.slowOpThresholdMs: 2000
replication.oplogSizeMB: 4000
storage.rocksdb.cacheSizeGB: 40

# start mongod, consider using numactl --interleave=all
bin/mongod --config mongo.conf --master --storageEngine rocksdb

# confirm RocksDB is there
$ head -1 data/db/LOG

2016/01/28-11:54:15.738641 7f7bd45d5e80 RocksDB version: 4.4.0

Wednesday, January 20, 2016

MyRocks vs InnoDB, the insert benchmark and a disk array

This compares MyRocks and InnoDB using the insert benchmark. The test server has a disk array. The workload generates a lot of secondary index maintenance which can stress the server's random IO capacity.

tl;dr - the average insert rate is much better for MyRocks than for InnoDB


The test server has 2 sockets, 8 cores (16 HW threads) per socket, 64GB of RAM and 14 disks with SW RAID 0 and a 2MB RAID stripe.

I tested MyRocks versus InnoDB from MySQL 5.6.26 and 5.7.10 from Oracle. Two configurations were tested for MyRocks. The first is the regular configuration described here. The second is the configuration optimized for load performance and called load-optimized. The load-optimized configuration sets rocksdb_bulk_load=1 to disable unique index checks and uses a smaller memtable to reduce the number of comparisons per insert.

The command line to run the insert benchmark for MyRocks is here. These are links to the my.cnf files for default MyRocks, load-optimized MyRocksMySQL 5.6 and MySQL 5.7. It is possible that I misconfigured background IO rates for InnoDB. MyRocks is much easier to configure. I have results for 6 configurations:
  • myrocks.def is MyRocks with the default configuration
  • myrocks.opt is MyRocks with the load-optimized configuration
  • mysql56.zlib is MySQL 5.6.26, InnoDB with zlib compression
  • mysql57.zlib is MySQL 5.7.10, InnoDB with zlib compression
  • mysql56.none is MySQL 5.6.26, InnoDB without compression
  • mysql57.none is MySQL 5.7.10, InnoDB without compression


This is a summary of the load performance for people like me who prefer tables. The legend for the table below is:
  • #rows - number of rows inserted. The target is 1B but the test was stopped early for InnoDB because I wasn't willing to wait one week.
  • #secs - number of seconds for which the test ran.
  • avg_ips - average rate for rows inserted per second during the test. Note that the rate declines significantly over time for InnoDB.
  • last_ips - rows inserted per second for the last 100,000 inserts.

My summary of the table below is:
  • I didn't wait for the InnoDB tests to finish as that might have taken more 1 week.
  • There is a big difference between the insert rates for MyRocks and InnoDB.
  • There is a big difference between the insert rates for the default and load-optimized configurations of MyRocks. This difference is much smaller as the number of insert threads is increased.  The load-optimized configuration is 2.87X faster at 1 thread, 1.48X faster at 4 threads and 1.15X faster at 8 threads.
  • The insert rates for MyRocks increase with the number of insert threads. The results here are from a test with 1 thread. When I repeated a test with 8 threads the insert rate was ~50,000/second for MyRocks while the rates for InnoDB did not improve from those below.

#rows           #secs  avg_ips last_ips        engine
1000000000       96775  10333   10276          myrocks.def
1000000000       25009  39986   30339          myrocks.opt
 788300000      169511   4650    2834          mysql56.zlib
 470200000      164890   2852    1552          mysql57.zlib
 553700000      164989   3356    1925          mysql56.none
 524500000      168829   3107    1690          mysql57.none


This is the insert rate over time for load-optimized MyRocks. There are a few stalls which I have yet to investigate. Otherwise the throughput is almost steady over time.

This is the insert rate for the default configuration of MyRocks. This has fewer stalls than the load-optimized configuration which is reasonable since there is less stress on compaction. Again the insert rate is almost steady over time.

This is the insert rate for all of the InnoDB configurations. The insert rate degrades significantly from the start of the test. I don't understand why MySQL56.zlib has the best throughput. InnoDB page flushing tuning is a complex art.

This is the insert rate over time for the InnoDB configurations using log scale for the y-axis. That makes it easier to see the difference as the rate declines.

Tuesday, January 19, 2016

The advantages of an LSM vs a B-Tree

The log structured merge tree (LSM) is an interesting algorithm. It was designed for disks yet has been shown to be effective on SSD. Not all algorithms grow better with age. A long time ago I met one of the LSM co-inventors, Patrick O'Neil, at the first job I had after graduate school. He was advising my team on bitmap indexes. He did early and interesting work on both topics. I went on to maintain bitmap index code in the Oracle RDBMS for a few years. Patrick O'Neil made my career more interesting.

Performance evaluations are hard. It took me a long time to get expertise in InnoDB, then I repeated that for RocksDB. Along the way I made many mistakes. Advice on doing benchmarks for RocksDB is here and here.

tl;dr - the MyRocks advantage is better compression and less write-amplification

The MyRocks Advantage

There are many benefits of the MyRocks LSM relative to a B-Tree. If you want to try MyRocks the source is on github, there is a wiki with notes on buildingmy.cnf and more details on the MyRocks advantage. Track down Yoshinori at the MyRocks tutorial at Percona Live or his talk at FOSDEM to learn more. The advantages include:
  • better compression - on real and synthetic workloads we measure 2X better compression with MyRocks compared to compressed InnoDB. A B-Tree wastes space when pages fragment. An LSM doesn't fragment. While an LSM can waste space from old versions of rows, with leveled compaction the overhead is ~10% of the database size compared to between 33% and 50% for a fragmented B-Tree and I have confirmed such fragmentation in production. MyRocks also uses less space for per-row metadata than InnoDB. Finally, InnoDB disk pages have a fixed size and more space is lost from rounding up the compressed page output (maybe 5KB) to the fixed page size (maybe 8KB).
  • no page reads for secondary index maintenance - MyRocks does not read the old version of a secondary index entry during insert, update or delete maintenance for a non-unique secondary index. So this is a write-only operation compared to read-page, modify-page, write-page for a B-Tree. This greatly reduces the random IO overhead for some workloads (benchmark results to be shared soon).
  • less IO capacity used for persisting changes - a B-Tree does more & smaller writes to persist a database while the MyRocks LSM does fewer & larger writes. Not only is less random IO capacity used, but the MyRocks LSM writes less data to storage per transaction compared to InnoDB as seen in the Linkbench results (look for wKB). In that case the difference was ~10X.
  • less write amplification from flash GC - the write pattern from the MyRocks LSM is friendlier to flash GC compared to InnoDB. This leads to lower write-amplification so that InnoDB was writing up to 18X more data per transaction compared to MyRocks as seen in the Linkbench results.
  • simpler code - RocksDB is much simpler than InnoDB. I have spent years with both -- debugging and reading code. MyRocks still has mutex contention, but the problem is easier to fix because the engine is simpler. New features are easier to add because the engine is simpler. We are moving fast to make MyRocks better.

The MyRocks Disadvantage

There is no free lunch with database algorithms. From theory we can expect better efficiency on writes and worse efficiency from reads with an LSM compared to a B-Tree. But what does that mean in practice? This will take time to figure out and I try to include IO efficiency metrics in my benchmark results that include disk reads, disk KB read and disk KB written per transaction to understand the differences. But at this point the IO efficiency improvements I see from MyRocks are much greater than the IO efficiency losses.

MyRocks is not as feature complete as InnoDB. While the team is moving fast the MySQL storage engine API is hard and getting harder. Missing features include text, geo, native partitioning, online DDL and XA recovery on the master (keep binlog & RocksDB in sync on master). I am not sure that text & geo will ever get implemented. I hope to see XA recovery and partial support for online DDL this year.

MyRocks is not proven like InnoDB. This takes time, a variety of workloads and robust QA. It is getting done but I am not aware of anybody running MyRocks in production today.

An LSM might use more CPU per query because more data structures can be checked to satisfy a query. This can increase the number of CPU cache misses and key comparisons done per query. It is likely to be a bigger deal on CPU-bound workloads and my focus has been on IO-bound. One example of the difference is the tuning that was required to make the MyRocks load performance match InnoDB when using fast storage. Although in that case the problem might be more of an implementation artifact than something fundamental to the LSM algorithm.

An LSM like MyRocks can also suffer from the range read penalty. An LSM might have to check many files to get data for a point or range query. Most LSMs, including MyRocks, use bloom filters to reduce the number of files to be checked during a point query. Bloom filters can eliminate all disk reads for queries of keys that don't exist. We have had some workloads on InnoDB for which 30% of the disk reads were for keys that don't exist.

Bloom filters can't be used on pure range queries. The most frequent query in Linkbench, and the real workload it models, is a short range query. Fortunately this query includes an equality predicate on the leading two columns of a covering secondary index (see the id1_type index in the Linkbench schema) and RocksDB has a feature to create the bloom filter on a prefix of the key (innovation!).

Even when the prefix bloom filter can't be used the range read penalty can be overstated. It is important to distinguish between logical and physical reads. A logical read means that data is read from a file. A physical read means that a logical read had to use the storage device to get the data. Otherwise a logical read is satisfied from cache. A range query with an LSM can require more logical reads than a B-Tree. But the important question is whether it requires more physical reads.