Wednesday, November 23, 2016

MyRocks: use less IO on writes to have more IO for reads

Holiday is almost here and I wrote a long blog post on write-efficiency yesterday so this one will be short. A longer version of this is in progress because this is an interesting result for me to explain. We assume that an LSM is less efficient for reads because it is more efficient for writes and it is hard to be optimal for all of read, write & space efficiency.

For real workloads it is complicated and for now I include benchmarks in "real workloads".  Here is one interesting result from my IO-bound tests of Linkbench. The summary is that when you spend less on IO to write back changes then you can spend more on IO to handle user queries. That benefit is more apparent on slower storage (disk array) than on faster storage (MLC NAND flash) because slower storage is more likely to be the bottleneck.

IO-bound Linkbench means that I used a server with 50G of RAM and ran Linkbench with maxid1=1B (1B nodes). The MyRocks database was ~400G and the InnoDB database was ~1.6T. Both MyRocks and InnoDB used MySQL 5.6.26. The workload is IO-heavy and the database working set is not cached.

The interesting result is that the difference between MyRocks and InnoDB becomes larger as storage gets slower. Another way to describe this is that InnoDB loses more performance than MyRocks when moving from faster to slower storage. I assume this is because MyRocks uses less IO capacity for writing back database changes so it has more IO capacity for handling user queries.

                Transactions per second
                MyRocks InnoDB  MyRocks/InnoDB
Disk array      2195    414     5.3
Slow SSD        23484   10143   2.3
Fast SSD        28965   21414   1.4

The random operations per second provided by the storage devices above is approximately 1k for the disk array, 10k for the slow SSD and more than 100k for the fast SSD.


  1. Mark,

    I'm not surprised by results at all. It is not very clear how the logical IO distributed over the data stored but with 50GB you have much more cache fit for RocksDB than for Innodb. Even with LSM I would expect data access is not uniform.

    The less cache misses we get naturally the more CPU time vs number of IOs you will get and as such the slower the storage the larger is impact.

    For example if for System1 you take 10ms of CPU time plus 100 disk IOs per operation and for System2 it is 100ms and 10 disk IO operations than the total time would be a lot difference if you have 10ms for IO or 0.01ms

    From what you reported before RocksDB uses more CPU time per operation compared to uncompressed Innodb.

  2. Continuing to compare InnoDB from 5.6 seems ... a little disingenuous.

    You have complained about this from others in the past, now I complain about it to you.. ;)

    1. Did you learn anything from what I wrote?

      Compared to 5.6 TPS from InnoDB in 5.7 will be better but write-efficiency will not because InnoDB is still a b-tree writing pages back. I know because I have used InnoDB 5.7 for some tests, just not every test. Machine time is limited.

      We run InnoDB from MySQL 5.6 in production so that is my primary target for testing. When MyRocks moves to 5.7 and 8 there will be more comparisons with InnoDB from the same version.

      From slides/marketing put out by Oracle/MySQL it seems that only sysbench and only at extremely high concurrency matters. I am happy to advise if Oracle/MySQL wants to consider other benchmarks like Linkbench. You could run that for InnoDB on 5.7. Even better would be to run at low and high concurrency.

    2. I'm making no assertions on better or worse really, I would just prefer modern benchmarks to actually use modern code - it makes it clearer where we should focus our efforts "now".

      I understand your position of course.. And I understand the IO assertion may not be much different (well, at least significantly - it will, given different changes on for instance back ground purge algorithms etc., no matter if it's just a b-tree or not).

      We do a range of benchmarks of course - not just sysbench, we've shown DBT2 runs too for optimizer enhancements. I like linkbench, but I'm also not the one that's doing benchmark or performance analysis, so I'm not going to speak for them either..

      But anyway, I call this out *particularly* because I've always valued the independent analysis you've done here, I'm just complaining now because I wish that you're not doing it on our modern code and continuing to push improvement there. ;)

    3. You are making broad assumptions about the goodness of 5.7. It isn't strictly better than 5.6. It is frequently better at mid to high concurrency, except for simple SELECT statements. It is frequently worse at low to mid concurrency and at simple SELECT statements.

      5.6 isn't that old. 5.7 is GA, so 5.6 is GA-1.

      With respect to pushing improvement for InnoDB. I retired and my focus is on MyRocks. Eventually I get to stop helping make MySQL better - maybe MongoDB is the next project. I wish the owner of MySQL copyright (MySQL -> Sun -> Oracle) made it easier to be an external hacker.

      I am biased but I think I am still publishing the most objective benchmark results that compare alternative engines.

    4. Still love all your posts, hope that got through in my last comment, just .. sad I guess.. <3

  3. This is a good point. I still find plenty of issues with sysbench but it only covers very small piece of functionality and more benchmarks are needed. I would welcome seeing LinkBench results as well but also others.

    1. I give sysbench credit for helping me find/fix the bug explained at

      It also helped me understand the variance that occurs in sysbench read-only tests with MyRocks. The variance is caused by the state of the memtable and L0 files. QPS from a read-only test that follows a read-write test will be lower when there is a lot of data in the memtable and L0. But when the amount of data there (memtable, L0) isn't enough to trigger a memtable flush or L0->L1 compaction then the state doesn't change.

      So a goal for 2017 is to make MyRocks & RocksDB smarter (adaptive) and trigger compaction/flush early when the workload is read-only or read-heavy for a long period of time.

      So sysbench is useful to me, but I am wary of someone using it with a small database and forming conclusions about MyRocks.