Tuesday, September 21, 2021

Review of DiffKV

This is a review of Differentiated Key-Value Storage Management for Balanced I/O Performance that was published in ATC 2021. This is a wonderful paper that resolves several of my concerns about key-value separation (here and here). The authors have experience with key-value separation via Titan which is part of TiDB.

The key idea in the paper is to use leveled compaction for keys and tiered compaction for values. Write-amplification is reduced by not using leveled compaction for values. Read-amplification is reduced, relative to classic key-value separation (see WiscKey), by using tiered compaction for values rather than logs. With classic key-value separation the worst case read-amp for a scan is the need to do a random read from the log for every qualifying key as that random read can include a block decompression and a storage IO. With DiffKV the use of tiered compaction for values provides some ordering to avoid that worst case.

While I enjoyed the paper, it didn't have performance results with compression enabled and I am curious about the impact of decompression overhead on point and range read latency.

A summary of the approach:

  • small values are stored inline in the LSM tree
  • large values are stored in logs, called vLogs. The paper explains a few improvements to make GC faster in this case. But mostly this is classic key-value separation.
  • medium values use the vTree (tiered compaction for values)
  • the user defines the two size limits that determines whether a value is small, medium or large
The vTree

Medium values are stored in the vTree which resembles an LSM that uses tiered compaction. A vTable is the unit of storage in the vTree. A vTable is 8MB by default and stores KV pairs in key order. A sorted group is a sequence vTables for which the keys don't overlap -- (vTable, sorted group) are similar to an (SST, sorted run) in RocksDB. A vTree has multiple levels where each level has one or more sorted groups and the levels have exponentially increasing size.

This is similar to tiered compaction for two reasons. First, there are multiple sorted groups per level. Second, when merges are done to move values from level N to N+1, the merge process only reads data from level N and then writes it to level N+1. It doesn't read and rewrite data already on level N+1. 

To reduce the number of times that values get rewritten the vTree has fewer levels than the LSM tree that stores keys. The smallest N levels in the key's LSM tree share the smallest vTree level. I assume the remaining levels in the key's LSM tree each have their own level in the vTree.

vTree GC

Keys in the LSM tree point into the vTree. When a value is moved between levels in the vTree the key's entry in the LSM tree must be updated to point to the new location in the vTree. DiffKV couples GC in the vTree with GC in the key's LSM tree to avoid extra writes. By coupling I mean that vTree GC is extra work added to compaction done on the key's LSM tree. When a key is moved from level N to level N+1 in the LSM tree then its value might be moved to the next level in the vTree.

By coupling it like this, DiffKV avoids the need to probe the index (key's LSM tree) to determine whether a value is live -- which is an extra overhead that occurs with classic key-value separation. It also avoids generating extra writes to change the point into the vTree that is stored with the key.

This above is called compaction-triggered merge and driven by compaction of the key's LSM tree. Another reason for doing vTree GC is called scan-triggered and triggered by scans that encounter regions of the vTree that need GC.

I am curious whether there is a need to also trigger GC based on vTables that have excessive space-amplification.

vTree rewrites

Leveled compaction has more write-amp than tiered because it rewrites previously written KV pairs. While vTree GC frequently avoids the need to do rewrites, there is one case where a rewrite is needed and I didn't learn enough about how DiffKV handles this.

There can be space wasted from vTables that have low utilization rates because most of the values in the vTable have been deleted. In this case something should be done to copy out (rewrite) the values to a new vTable. And when values are moved the key LSM tree must be updated to reference their new location. For workloads with updates and deletes this will be needed in the largest level of the vTree and might be needed for smaller levels.

Saturday, August 14, 2021

Happy 15th to Percona

I am thrilled to be celebrate Percona's 15th anniversary. My time with MySQL began about the same time as the founding of Percona. Those years, the mid-2000s, were the dark ages for MySQL. There was doubt about the future of MySQL because there were many things that needed to be made better. Fortunately, upstream and the community, with much help from Percona, rallied to fix the problems and add needed features.

Percona has been a huge part of the community. I don't have time to list everything, but examples include saving the user conference, providing an open-source hot backup solution, educating many of us (including me) via their blog posts and helping push the product to get so much better. I have also been a customer as they helped with the MyRocks effort and in the great patch migration (5.6 to 8.0).

I like that the anniversary book starts by mentioning the multi-core scaling problem that InnoDB had in the mid-2000s. My MySQL deployment used 4-core, 1-socket CPUs at the time and 8-core, 2-socket servers were about to arrive. It was difficult to get more than 10k QPS from MySQL/InnoDB on such hardware regardless of the number of cores or sockets. InnoDB didn't benefit from 2-socket servers because of contention in the custom InnoDB mutex (which was also used to guard state in the custom InnoDB rw-lock). A fix was first proposed by Yasufumi Kinoshita, who would soon join Percona. After seeing his presentation at the user conference, my team at Google proposed a similar solution which was accepted by the InnoDB team and a serious problem was resolved.

I also like that the book is full of stories from people in the community. I know many of these people from my time traveling to conferences. I am an introvert except when at conferences. I rarely observe talks because I am out in the hallway track talking to others.

Wednesday, August 11, 2021

On storage engines

I wish it were easier to implement new storage engines for MySQL, Postgres and MongoDB and other OSS databases. There is so much innovation that we miss out on - FASTER is one example. All (MySQL, MongoDB, Postgres) have storage engine APIs but there are not many OSS implementations of them.

MyRocks, MySQL Aurora and MySQL HeatWave are examples of the benefits. But they also show that it helps to have the backing of a well-funded company because this is a huge undertaking.

The API for Postgres is the most recent and perhaps that will be the most popular. The API for MySQL is the oldest and has become harder as more requirements (like partitioning) are pushed into it. The MongoDB API is friendlier than the MySQL API, but MongoRocks was deprecated when the API was enhanced to support transactions and RocksDB wasn't able to support user-provided commit timestamps.

One problem is naming. MongoDB and WiredTiger are databases, but WiredTiger is also a component of MongoDB. To avoid confusion I will use storage engine for things like WiredTiger, RocksDB and InnoDB and database for the systems that provide query languages, replication and more.

I have been involved in three such engines -- MyRocks, MongoRocks and a read-only MySQL engine used long-ago at Google. MyRocks benefited from a large team and Sergey Petrunya at MariaDB and MongoRocks benefited from amazing support from MongoDB the company.

Two interesting things are:

  • The impact of database design decisions on the storage engine API. See what is needed for transactions in the MongoDB API.
  • The impact of the original storage engine on the storage engine API.
    • MongoDB doesn't take advantage of clustered indexes in WiredTiger or RocksDB. An extra (hidden) index must be used to map between DiskLoc and the PK index. See SERVER-14569.
    • I have more research to do before I understand the impact of vacuum and 32-bit transactions IDs on the Postgres API.

I also wonder whether the RocksDB API could be the universal storage engine API. In theory any storage engine implementing that would be able to replace RocksDB in MyRocks or MongoRocks. The goal is then to implement the RocksDB API glue once per database and then be able to use a variety of storage engines based on that. Of course, XKCD has a great take on standards and this might just be naive and wishful thinking on my part.


Saturday, March 13, 2021

Sequential IO and an LSM tree

I see statements that sequential IO is a benefit of using an LSM tree. That is a vague and truthy statement. It is correct that the writes done by compaction are sequential per-file. But I am interested in the IO from the perspective of the storage device. 

The truthier claim is that with an LSM there will be many streams of IO (read & write) that benefit from large IO requests. The reads and writes done by compaction are sequential per file, but there is much concurrency and the storage device will see many concurrent streams of IO.

The IO patterns for an LSM with a busy workload are:

  • N compaction threads where each thread is reading from ~10 files, merging the results and writing the output to create new SSTs. Each thread has one file open at a time that is written sequentially. The reads are sequential per-file. At any point in time there can be 10*N read streams and N write streams. The reads benefit from large IO requests and prefetch. The writes benefit from async write-back. The writes might benefit from clever placement on the storage device (see multistream). The writes are likely to generate large requests to the storage device.
  • The WAL, if enabled, is another write stream. If fsync is done on commit then stream gets a sequence of small writes.
  • User queries generate read requests. For OLTP these are mostly small (random) requests. But in some cases (logical backup, analytics) there can be scans.

Sunday, March 7, 2021

How to submit a PR for RocksDB

I submitted my first PR to RocksDB as an external contributor and have advice on the process.

  • Read the Contribution Guide
  • Run make format on your diffs. This depends on clang-format and you might need to install it.
  • Edit HISTORY.md if there are user-facing changes
  • Optionally run make check to confirm tests pass. Some tests failed because my small servers have 16G of RAM, the tests use /dev/shm and needed more than the 8G that /dev/shm gets on my servers.
  • Confirm that RocksDB LITE builds still work: LITE=1 make db_bench
  • Some of the CI tests are flaky now and have bogus failures (see here). To get CI tests to repeat I edit and push small changes to HISTORY.md. The team is fixing the intermittent test failures. Alas, one of the intermittent problems was from internal gcc failures on ARM -- something I can't reproduce on my x86 hardware. The good news is that after three attempts everything passes.
  • Google for advice on how to do this in git. I am not a git guru but the advice I find is great

Friday, February 26, 2021

More on read-only benchmarks and an LSM

This adds to my last post on the impact of LSM tree shape on read performance. I updated the test script to add one more stage, l0_and_lmax, that runs after compact_all. A full compaction is done for compact_all to show the benefit (more QPS) of reducing the number of trees to search to 1. For the l0_and_lmax stage overwrites are done to get 6 SSTs in the L0. At this point there is data in the L0 and max level of the LSM tree -- thus the name l0_and_lmax.

The updated results and changes to my test scripts are here. The spreadsheet with data and charts is here.

The summary is that query performance suffers if you do all of the work to fully compact the LSM tree and then put 6 SSTs in the L0. By suffers, I mean it is almost as bad as when the LSM tree has data everywhere (memtable, L0, all levels after the L0). This is visible in the charts below -- compare l0_and_lmax with pre_flush and post_flush.

Charts!



Wednesday, February 24, 2021

Read-only benchmarks with an LSM are complicated

I often see benchmark results for an LSM tree where the shape of the tree is not described. When the results are worse than desired I wonder how much of that is inherent to the LSM tree versus the configuration. It might be easier to make an LSM tree look bad than to make a B-Tree look bad.

I have results from RocksDB for which QPS varies by up to 5X depending on the shape of the LSM tree. The workload is read-only and CPU-bound. The impact of the LSM tree shape on CPU overhead is known but not widely known. Hopefully this helps with that.

Read amplification for an LSM is usually worse than for a B-Tree. But read-amp should be described separately for IO vs CPU because (usually) CPU read-amp is the problem. The unit for IO read-amp is pages read from storage (real storage, not a read from the OS page cache). The unit for CPU read-amp is CPU overhead but some combination of cache lines read, keys compared and bloom filters checked per query might be a good proxy.

The extra CPU read-amp comes from the number of trees in the LSM tree that must be searched on queries. Perhaps LSM tree is a misnomer because it usually has more than one tree, but perhaps not enough trees for LSM forest to be a good name. Is it too late to rename LSM tree to LSM trees? The memtable is one tree, each sorted run in the L0 is another tree, and then each level beyond L0 is one more tree. For the configurations I use it is common for there to be from 10 to 20 trees in one RocksDB instance. While bloom filters help, they don't eliminate the overhead. Even with a bloom filter some searching is done (cost is a function of |values|) and then there is the overhead of the bloom filter check. For too much information on the CPU overhead of a RocksDB query, see this post from me.

I will begin using LSM tree shape to describe the state of the LSM tree where the state includes:

  • number of KV pairs in the active memtable
  • number of immutable memtables waiting to be flushed
  • number of sorted runs in the L0
  • number of levels beyond the L0, and their sizes
Just keep in mind that it is hard to interpret performance results for a CPU-bound and read-only workload without understanding the LSM tree shape. It is easy to make the LSM look better or worse -- on purpose or by accident.

Finally, LSM implementations can get better at this by adapting their LSM tree shape to become more read-friendly when a workload shifts to read-only or read-mostly phases.

Performance summary

The tests I do change the per-query CPU overhead. I directly measure and report throughput but it, per-query response times and per-query CPU overhead are correlated.
  • Flushing the memtable has no impact on throughput
  • The impact from too many trees to search is worse for workloads that don't use bloom filters (like range queries, or any queries when bloom filters don't exist). The worst-case was ~5X for range and point without bloom vs ~2X for point with bloom.
  • There are incremental benefits from compacting the L0, compacting the L1 and doing full compaction. Each of these reduces the number of trees to search.
  • Bloom filters in an LSM aren't O(1). There is a search cost, O(logN), to find per-SST filter
  • Feature request for LSM trees - adapt to be more read-friendly when a workload is read-heavy
  • This is more of an issue for read-only benchmarks because the LSM tree shape is usually stuck in the same state for the duration of the test. That state might be good or bad and this can add noise or misleading results to your tests. For a test with writes the LSM tree is likely to have data in all places (memtable, L0, L1, ...) most of the time which might not help read performance but will reduce read response time variance.
Charts!

A spreadsheet with absolute and relative throughput is here. All results are from the configuration with the per-level fanout set to 8.





My tests

I ran tests for three types of queries: short range queries, point queries with bloom filters and point queries without bloom filters. I repeated the queries for four LSM tree shapes described below. By data is in the memtable I mean that the memtable is at least half full but there were no immutable memtables. By data is in L0 I mean there were six SSTs (sorted runs) in L0. 

  • pre_flush - data is in the memtable, L0 and each level beyond L0
  • post_flush - memtable is empty while L0 and larger levels have data
  • compact_l0 - memtable and L0 are empty while L1 and larger levels have data
  • compact_l1 - memtable, L0 and L1 are empty while L2 and larger levels have data
  • compact_all - all data is in one sorted run on Lmax (the max level of the LSM tree)
The query tests are provided by db_bench. I used seekrandom for range queries with --seek_nexts=8 to fetch 8 values per query. I used readrandom for point queries with and without bloom filters. Elsewhere I call these tests range query, point with bloom and point without bloom.

Tests were repeated after loading 1M, 10M, 100M and 200M 128-byte values. All tests used one thread and were run on my small servers. The servers have 16G of RAM and RocksDB used an 8G block cache. The database fit in the block cache for all tests except for 200M values. Test output, command lines, and a diff to db_bench are here. It wasn't easy to get db_bench to cooperate. Some of the parts of the db_bench diff will be useful for future tests and I hope to improve and submit a PR for it. The useful features include new benchmarks (things that go in --benchmarks=...) including:
  • flush - flushes the memtable
  • compact0 - compacts L0 into L1 (and blocks until done)
  • compact1 - compacts L1 into L2 (and blocks until done)
  • waitforcompaction - a hacky attempt to block until compaction is done. Can be used as --benchmarks=fillrandom,compactall,waitforcompaction,readrandom because compactall doesn't block until compaction is done.
Without the options above it was hard to get the LSM tree shape into a deterministic state. Once I had the new features it took a few experiments to find the correct number of overwrites to do to make both the memtable and L0 almost full (275000), while also avoiding doing too many flushes and triggering an L0->L1 compaction.

One interesting problem was that immediately after fillrandom the number of sorted runs in the L0 was almost always zero regardless of the number of values written. The reason is that I run tests with fsync on commit disabled and writes were so fast that at the end of fillrandom there were always too many sorted runs in L0. Once fillrandom ended a compaction was scheduled and it merged all L0 sorted runs into L1. The workaround is to run db_bench twice -- once to do fillrandom, flush memtables and then compact L0. Then again to run overwrite to put some SSTs in L0 and then do query tests. This is in the r.sh script.

# Step 1
db_bench ... --benchmarks=fillrandom,flush,waitforcompaction,compact0

# Step 2
db_bench ... --use_existing_db=1 --benchmarks=overwrite,waitforcompaction,...

LSM tree shape

Below I describe the LSM tree shape using logical rather than physical names for the levels because an artifact of dynamic leveled compaction is that compaction stats might show L0 followed by L4 where L4 is logically L1.  

The results show something that I hope can be improved. When configuring RocksDB you can reason about the read, write and space amplification tradeoffs but then the following happens as the LSM tree grows and your assumptions might be far from the truth. The problem is visible in the 10M and 100M values tests when the per-level fanout is 10 (the default). They use the same number of levels while the 100M test has 10X more values.

For all tests the target size for L1 is 64M and dynamic leveled compaction was enabled. But the interaction between dynamic leveled and my configuration means that the LSM trees for the 10M and 100M values tests use the same number of levels with a dramatically different sizeof(L1) / sizeof(L0) ratio. That ratio is <= 2 for the 10M values test vs ~10 for the 100M values test. The LSM tree uses one too many levels in the 10M values test. 

LSM tree shape by number of values inserted when per-level fanout is 10 (default):
  • 1M
    • memtable is ~80% full with 38,239 values
    • L0 has 6 sorted runs and 33M of data
    • L1 has 87M of data
  • 10M
    • memtable is ~80% full with 38,199 values
    • L0 has 6 sorted run and 34M of data
    • L1 has 80M of data
    • L2 has 829M of data
  • 100M
    • memtable is ~80% full with 38,240 values
    • L0 has 6 sorted runs and 39M of data
    • L1 has 792M of data
    • L2 has 7.77G of data
  • 200M
    • memtable is ~80% full with 38,215 values
    • L0 has 6 sorted runs and 38M of data
    • L1 has 163M of data
    • L2 has 1.61G of data
    • L3 has 16.12G of data
I then repeated the test with the per-level fanout reduced to 8 via the max_bytes_for_level_multiplier option in db_bench. In this case the number of levels is what I expect.
  • 1M
    • memtable is ~80% full with 38,209 values
    • L0 has 6 sorted runs and 33M of data
    • L1 has 87M of data
  • 10M
    • memtable is ~80% full with 38,241 values
    • L0 has 6 sorted runs and 34M of data
    • L1 has 101M of data
    • L2 has 820M of data
  • 100M
    • memtable is ~80% full with 38,264 values
    • L0 has 6 sorted runs and 39M of data
    • L1 has 126M of data
    • L2 has 1.0G of data
    • L3 has 7.9G of data
  • 200M
    • memtable is ~80% full with 38,215 values
    • L0 has 6 sorted runs with 38M of data
    • L1 has 251M of data
    • L2 has 1.98G of data
    • L3 has 15.83G of data