Small Datum: February 2021

Friday, February 26, 2021

Wednesday, February 24, 2021

Read-only benchmarks with an LSM are complicated

I often see benchmark results for an LSM tree where the shape of the tree is not described. When the results are worse than desired I wonder how much of that is inherent to the LSM tree versus the configuration. It might be easier to make an LSM tree look bad than to make a B-Tree look bad.

I have results from RocksDB for which QPS varies by up to 5X depending on the shape of the LSM tree. The workload is read-only and CPU-bound. The impact of the LSM tree shape on CPU overhead is known but not widely known. Hopefully this helps with that.

Read amplification for an LSM is usually worse than for a B-Tree. But read-amp should be described separately for IO vs CPU because (usually) CPU read-amp is the problem. The unit for IO read-amp is pages read from storage (real storage, not a read from the OS page cache). The unit for CPU read-amp is CPU overhead but some combination of cache lines read, keys compared and bloom filters checked per query might be a good proxy.

The extra CPU read-amp comes from the number of trees in the LSM tree that must be searched on queries. Perhaps LSM tree is a misnomer because it usually has more than one tree, but perhaps not enough trees for LSM forest to be a good name. Is it too late to rename LSM tree to LSM trees? The memtable is one tree, each sorted run in the L0 is another tree, and then each level beyond L0 is one more tree. For the configurations I use it is common for there to be from 10 to 20 trees in one RocksDB instance. While bloom filters help, they don't eliminate the overhead. Even with a bloom filter some searching is done (cost is a function of |values|) and then there is the overhead of the bloom filter check. For too much information on the CPU overhead of a RocksDB query, see this post from me.

I will begin using LSM tree shape to describe the state of the LSM tree where the state includes:

number of KV pairs in the active memtable
number of immutable memtables waiting to be flushed
number of sorted runs in the L0
number of levels beyond the L0, and their sizes

Just keep in mind that it is hard to interpret performance results for a CPU-bound and read-only workload without understanding the LSM tree shape. It is easy to make the LSM look better or worse -- on purpose or by accident.

Finally, LSM implementations can get better at this by adapting their LSM tree shape to become more read-friendly when a workload shifts to read-only or read-mostly phases.

Performance summary

The tests I do change the per-query CPU overhead. I directly measure and report throughput but it, per-query response times and per-query CPU overhead are correlated.

Flushing the memtable has no impact on throughput
The impact from too many trees to search is worse for workloads that don't use bloom filters (like range queries, or any queries when bloom filters don't exist). The worst-case was ~5X for range and point without bloom vs ~2X for point with bloom.
There are incremental benefits from compacting the L0, compacting the L1 and doing full compaction. Each of these reduces the number of trees to search.
Bloom filters in an LSM aren't O(1). There is a search cost, O(logN), to find per-SST filter
Feature request for LSM trees - adapt to be more read-friendly when a workload is read-heavy
This is more of an issue for read-only benchmarks because the LSM tree shape is usually stuck in the same state for the duration of the test. That state might be good or bad and this can add noise or misleading results to your tests. For a test with writes the LSM tree is likely to have data in all places (memtable, L0, L1, ...) most of the time which might not help read performance but will reduce read response time variance.

Charts!

A spreadsheet with absolute and relative throughput is here. All results are from the configuration with the per-level fanout set to 8.

My tests

I ran tests for three types of queries: short range queries, point queries with bloom filters and point queries without bloom filters. I repeated the queries for four LSM tree shapes described below. By data is in the memtable I mean that the memtable is at least half full but there were no immutable memtables. By data is in L0 I mean there were six SSTs (sorted runs) in L0.

pre_flush - data is in the memtable, L0 and each level beyond L0
post_flush - memtable is empty while L0 and larger levels have data
compact_l0 - memtable and L0 are empty while L1 and larger levels have data
compact_l1 - memtable, L0 and L1 are empty while L2 and larger levels have data
compact_all - all data is in one sorted run on Lmax (the max level of the LSM tree)

The query tests are provided by db_bench. I used seekrandom for range queries with --seek_nexts=8 to fetch 8 values per query. I used readrandom for point queries with and without bloom filters. Elsewhere I call these tests range query, point with bloom and point without bloom.

Tests were repeated after loading 1M, 10M, 100M and 200M 128-byte values. All tests used one thread and were run on my small servers. The servers have 16G of RAM and RocksDB used an 8G block cache. The database fit in the block cache for all tests except for 200M values. Test output, command lines, and a diff to db_bench are here. It wasn't easy to get db_bench to cooperate. Some of the parts of the db_bench diff will be useful for future tests and I hope to improve and submit a PR for it. The useful features include new benchmarks (things that go in --benchmarks=...) including:

flush - flushes the memtable
compact0 - compacts L0 into L1 (and blocks until done)
compact1 - compacts L1 into L2 (and blocks until done)
waitforcompaction - a hacky attempt to block until compaction is done. Can be used as --benchmarks=fillrandom,compactall,waitforcompaction,readrandom because compactall doesn't block until compaction is done.

Without the options above it was hard to get the LSM tree shape into a deterministic state. Once I had the new features it took a few experiments to find the correct number of overwrites to do to make both the memtable and L0 almost full (275000), while also avoiding doing too many flushes and triggering an L0->L1 compaction.

One interesting problem was that immediately after fillrandom the number of sorted runs in the L0 was almost always zero regardless of the number of values written. The reason is that I run tests with fsync on commit disabled and writes were so fast that at the end of fillrandom there were always too many sorted runs in L0. Once fillrandom ended a compaction was scheduled and it merged all L0 sorted runs into L1. The workaround is to run db_bench twice -- once to do fillrandom, flush memtables and then compact L0. Then again to run overwrite to put some SSTs in L0 and then do query tests. This is in the r.sh script.

# Step 1
db_bench ... --benchmarks=fillrandom,flush,waitforcompaction,compact0

# Step 2
db_bench ... --use_existing_db=1 --benchmarks=overwrite,waitforcompaction,...

LSM tree shape

Below I describe the LSM tree shape using logical rather than physical names for the levels because an artifact of dynamic leveled compaction is that compaction stats might show L0 followed by L4 where L4 is logically L1.

The results show something that I hope can be improved. When configuring RocksDB you can reason about the read, write and space amplification tradeoffs but then the following happens as the LSM tree grows and your assumptions might be far from the truth. The problem is visible in the 10M and 100M values tests when the per-level fanout is 10 (the default). They use the same number of levels while the 100M test has 10X more values.

For all tests the target size for L1 is 64M and dynamic leveled compaction was enabled. But the interaction between dynamic leveled and my configuration means that the LSM trees for the 10M and 100M values tests use the same number of levels with a dramatically different sizeof(L1) / sizeof(L0) ratio. That ratio is <= 2 for the 10M values test vs ~10 for the 100M values test. The LSM tree uses one too many levels in the 10M values test.

LSM tree shape by number of values inserted when per-level fanout is 10 (default):

memtable is ~80% full with 38,239 values
L0 has 6 sorted runs and 33M of data
L1 has 87M of data

memtable is ~80% full with 38,199 values
L0 has 6 sorted run and 34M of data
L1 has 80M of data
L2 has 829M of data

100M

memtable is ~80% full with 38,240 values
L0 has 6 sorted runs and 39M of data
L1 has 792M of data
L2 has 7.77G of data

200M

memtable is ~80% full with 38,215 values
L0 has 6 sorted runs and 38M of data
L1 has 163M of data
L2 has 1.61G of data
L3 has 16.12G of data

I then repeated the test with the per-level fanout reduced to 8 via the max_bytes_for_level_multiplier option in db_bench. In this case the number of levels is what I expect.

memtable is ~80% full with 38,209 values
L0 has 6 sorted runs and 33M of data
L1 has 87M of data

memtable is ~80% full with 38,241 values
L0 has 6 sorted runs and 34M of data
L1 has 101M of data
L2 has 820M of data

100M

memtable is ~80% full with 38,264 values
L0 has 6 sorted runs and 39M of data
L1 has 126M of data
L2 has 1.0G of data
L3 has 7.9G of data

200M

memtable is ~80% full with 38,215 values
L0 has 6 sorted runs with 38M of data
L1 has 251M of data
L2 has 1.98G of data
L3 has 15.83G of data

Thursday, February 18, 2021

Still not finding CPU regressions in RocksDB

I repeated the search for CPU regressions in recent RocksDB releases using a m5ad.12xlarge server from AWS. I was happy to not find problems, just as I did not find any on a small servers.

I setup the AWS host using this script and then ran the all3.sh script for 1, 4, 8 and 16 concurrent clients. I forgot to save the command line, but the only thing I changed was the number of background threads that RocksDB can use (increased from 2 to 8).

The AWS host has two direct attached NVMe drives and I used one for the database directory with XFS and discard enabled. I also disabled hyperthreads when I setup the host.

Results are in github for one, four, eight and sixteen threads. The output is created by the rep_all4.sh script and has 3 sections. The first is the response time/operation per test. The second is the throughput per test. The third is the relative throughput per test (relative to RocksDB version 6.4). The versions tested are (v64=6.4, v611=6.11, ..., v617=6.1.7).

Below I only show the output from the third section. A value > 1 means the new version has more throughput than RocksDB version 6.4. The values tend to be close to 1 and I don't see significant regressions.

My focus is on the first two and last 5 tests and the others are read-only and can suffer from noise because the state of the LSM tree isn't deterministic (data in memtable and L0 can vary). In the long run I hope for options in db_bench to flush the memtable and compact L0 into L1 to reduce this problem.

Results for 1 thread

v64 v611 v612 v613 v614 v615 v616 v617 test
1 0.99 0.99 0.99 0.99 1.00 0.98 1.00 fillrandom
1 0.97 0.96 0.97 0.96 0.97 0.97 0.97 overwrite
1 0.89 0.94 0.95 1.01 0.94 0.90 0.91 readseq
1 0.79 0.77 0.82 0.80 0.81 0.69 0.76 readrandom
1 0.98 1.00 0.99 0.98 1.02 0.67 0.99 seekrandom
1 0.89 0.98 0.96 0.95 0.96 0.89 0.94 readseq
1 0.85 0.79 0.80 0.79 0.81 0.80 0.82 readrandom
1 0.92 0.97 0.96 0.96 1.00 1.09 0.98 seekrandom
1 0.96 0.98 0.96 0.91 0.97 1.10 0.97 seekrandom
1 0.96 0.96 0.96 0.97 1.00 1.07 0.96 seekrandom
1 1.01 1.04 1.05 1.00 1.07 1.09 0.99 seekrandom
1 1.06 1.01 0.87 0.98 0.95 1.03 1.00 readwhilewriting
1 0.84 0.94 0.98 1.03 0.96 0.94 0.97 seekrandomwhilewriting
1 0.97 1.03 0.97 0.98 0.99 0.97 0.96 seekrandomwhilewriting
1 0.93 0.97 0.93 0.97 0.94 0.97 0.99 seekrandomwhilewriting
1 0.95 0.97 1.02 0.95 1.01 1.01 1.01 seekrandomwhilewriting

Results for 4 threads

v64 v611 v612 v613 v614 v615 v616 v617 test
1 0.99 0.98 0.99 0.98 0.99 0.98 0.98 fillrandom
1 0.99 1.07 1.09 1.03 1.06 1.02 1.02 overwrite
1 0.92 0.94 0.94 0.97 0.96 0.95 0.93 readseq
1 0.84 0.90 0.90 0.90 0.74 0.74 0.99 readrandom
1 0.72 1.01 0.98 1.01 0.72 0.84 1.04 seekrandom
1 0.90 0.97 0.90 0.89 0.88 0.90 0.90 readseq
1 0.87 0.94 0.94 0.95 1.03 0.86 0.95 readrandom
1 0.75 1.06 0.99 1.01 1.11 0.86 1.03 seekrandom
1 0.73 1.02 0.97 0.97 1.07 0.79 0.95 seekrandom
1 0.76 1.02 0.96 1.02 1.12 0.85 1.03 seekrandom
1 0.78 0.95 0.96 0.96 1.05 0.82 0.96 seekrandom
1 0.94 0.93 0.91 0.94 0.95 0.94 0.96 readwhilewriting
1 0.96 0.92 0.94 0.94 0.94 0.92 0.97 seekrandomwhilewriting
1 0.98 0.94 0.98 0.96 0.98 0.97 0.97 seekrandomwhilewriting
1 0.97 0.93 1.04 0.98 0.98 0.95 0.96 seekrandomwhilewriting
1 0.96 1.02 1.00 1.02 0.98 1.01 0.98 seekrandomwhilewriting

Results for 8 threads

v64 v611 v612 v613 v614 v615 v616 v617 test
1 1.01 0.96 0.97 0.99 0.97 0.97 0.99 fillrandom
1 0.74 0.84 1.07 1.01 1.00 0.99 0.99 overwrite
1 0.89 0.89 1.09 1.06 1.10 1.08 1.08 readseq
1 1.01 1.01 0.89 0.97 1.01 0.72 0.88 readrandom
1 1.17 0.99 1.02 0.94 0.98 0.80 0.94 seekrandom
1 1.01 1.06 1.01 1.01 0.99 0.98 0.97 readseq
1 0.98 0.99 0.91 0.96 0.92 0.81 0.92 readrandom
1 1.13 0.92 0.95 0.88 0.92 0.74 0.89 seekrandom
1 1.12 0.94 0.96 0.91 0.96 0.77 0.91 seekrandom
1 1.15 0.96 1.00 0.94 0.95 0.79 0.92 seekrandom
1 1.04 0.96 0.91 0.92 0.92 0.81 0.90 seekrandom
1 0.97 0.97 0.99 0.99 0.98 1.02 1.00 readwhilewriting
1 0.98 1.01 0.98 0.96 0.97 0.97 0.98 seekrandomwhilewriting
1 0.95 0.94 0.95 0.96 0.95 0.92 0.94 seekrandomwhilewriting
1 0.94 0.97 0.96 0.96 0.95 0.95 0.98 seekrandomwhilewriting
1 0.98 1.01 0.98 0.96 0.96 0.94 0.92 seekrandomwhilewriting

Results for 16 threads

v64 v611 v612 v613 v614 v615 v616 v617 test
1 0.99 0.98 0.98 0.95 0.97 0.96 0.99 fillrandom
1 0.93 0.95 1.41 1.29 1.32 1.30 1.32 overwrite
1 0.95 0.99 1.12 1.13 1.12 1.09 1.12 readseq
1 1.07 0.85 0.97 0.95 0.88 0.95 0.91 readrandom
1 1.03 1.00 1.00 1.03 0.96 1.02 1.02 seekrandom
1 0.98 1.06 0.97 0.97 0.96 0.97 0.94 readseq
1 1.04 0.88 1.03 1.01 0.97 0.96 1.05 readrandom
1 0.99 1.00 0.98 0.99 0.94 0.97 0.99 seekrandom
1 1.00 1.00 0.99 1.00 0.93 0.98 1.00 seekrandom
1 1.00 0.99 0.97 1.01 0.93 0.97 0.98 seekrandom
1 0.98 0.98 0.98 0.97 0.93 0.94 0.95 seekrandom
1 0.93 0.97 0.97 0.97 0.94 0.96 0.93 readwhilewriting
1 0.97 0.97 0.92 0.96 0.99 0.97 0.98 seekrandomwhilewriting
1 0.95 0.97 0.96 0.97 0.95 0.98 0.94 seekrandomwhilewriting
1 0.97 1.00 0.99 0.97 0.99 0.99 0.94 seekrandomwhilewriting
1 0.97 1.01 1.00 1.02 1.00 0.98 0.99 seekrandomwhilewriting

Monday, February 15, 2021

Not finding CPU regressions across RocksDB releases

I looked for CPU regressions in recent RocksDB releases and was happy to not find them.

My workload was low concurrency (1 or 2 threads), an in-memory workload and a small server. I tested RocksDB versions 6.4 and 6.11 through 6.17. My servers use Ubuntu 20.04 which has g++ version 9.3. I wasn't able to compile versions prior to 6.4 because there were compiler errors that I didn't try to resolve. Such is the cost of using modern C++.

Running the tests

I used the all3.sh, run3.sh and rep_all3.sh scripts from my github repo. The scripts do two things for me. First, they handles changes to the db_bench options across RocksDB versions. Second, all3.sh runs tests in a sequence that is interesting to me. I need to update all3.sh for changes in the 6.X branch with respect to the db_bench options.

I forgot to do ulimit -n 50000 prior to running tests and repeated tests after fixing that. I have forgotten to do that many times in the past.

I ran the test in two modes: not-cached and cached. By not-cached I mean that the database is larger than the RocksDB block cache and cached means the database fits in the RocksDB block cache. In both cases all data is in the OS page cache.

For the benchmarks:10M KV pairs were inserted in the initial load, each test step was run for 300 seconds and the LSM tree was made small (8M write buffer, 32M L1) to get more levels in the LSM tree.

Command lines that are useful to me, but the scripts might be inscrutable to you:

# To run for cached
bash run3.sh 100000000 64 300 8 32 $(( 10 * 1024 * 1024 ))
# To run for not-cached
bash run3.sh 100000000 64 300 8 32 $(( 1 * 1024 * 1024 ))
# To generate summaries for response time and throughput
bash rep_all3.sh v64 v611 v612 v613 v614 v615 v616 v617

Results

While there are many test steps (test step == one run of db_bench) the most interesting are the first two (fillrandom, overwrite) and the last 5 (readwhilewriting, seekrandomwhilewriting with different range sizes). Results can be misleading for the read-only tests that are done in between these as the performance on them depends on the shape of the LSM tree. The amount of data in the memtable, L0 and L1 isn't deterministic and can have a large impact on the CPU overhead for queries. In MyRocks I reduce the impact from this by flushing the memtable and compacting the L0 but db_bench doesn't have options for that (yet). It does have an option to do a full compaction but that is too much for me.

So I will share the results for all test steps but focus on the first 2 and last 5. There aren't significant regressions from v64 to v617. Results with a larger font and numbers for both response time and throughput are in github for cached and not-cached.

This has the QPS from the cached test:

v64 v611 v612 v613 v614 v615 v616 v617 test

357763 339811 329584 342793 339205 347598 339990 348218 fillrandom
344379 344567 340444 330834 325129 333669 331436 347029 overwrite
2817650 2779196 2877354 2886832 2887711 2774030 2710212 2823124 readseq
159063 129527 121106 123190 121325 134453 120146 137377 readrandom
77047 73334 64849 71194 49592 93552 64414 98480 seekrandom
3760630 3319862 3436593 3424777 3416794 3468542 3348936 3419843 readseq
216189 168233 170755 167659 177929 194189 170752 197241 readrandom
77176 74307 67099 73279 83014 95671 66752 97165 seekrandom
76688 73207 65771 71620 80992 93068 64924 94364 seekrandom
67994 65093 59306 65372 72698 80427 59296 83996 seekrandom
35619 33270 32388 34049 36375 38317 32091 38652 seekrandom
155204 151360 150730 151218 149980 150653 150261 148748 readwhilewriting
57080 54777 56317 55931 55271 55564 55581 56334 seekrandomwhilewriting
56184 53540 54838 54445 54450 54633 54465 54410 seekrandomwhilewriting
51143 49391 50092 50338 49548 50055 49486 50481 seekrandomwhilewriting
29553 27373 28491 28734 28586 28242 27853 28267 seekrandomwhilewriting

And this has the QPS from the not-cached test:

v64 v611 v612 v613 v614 v615 v616 v617 test
349918 349072 341224 347164 348470 340888 347850 334909 fillrandom
344040 327776 334852 332857 336480 343888 339678 332415 overwrite
2888170 2704291 2869560 2838685 2708847 2630220 2743535 2634374 readseq
167660 133981 130999 112923 120657 120273 87018 121126 readrandom
79615 58025 66542 49269 71643 71525 94862 71959 seekrandom
3784203 3284938 3411521 3404096 3414857 3409997 3448335 3366118 readseq
222893 165198 172372 175113 169132 174096 190337 166636 readrandom
80397 59224 67565 83540 73345 73666 94354 73855 seekrandom
78815 58232 65396 81865 72491 71938 92648 72689 seekrandom
70153 52933 60468 73907 64593 64626 80768 65317 seekrandom
36654 29389 32587 36881 34236 33753 38028 33618 seekrandom
154127 150561 150021 151168 148856 149113 149967 150643 readwhilewriting
57050 55440 55498 55576 55258 56255 55440 55162 seekrandomwhilewriting
56178 54348 55160 54893 54251 54699 54177 54770 seekrandomwhilewriting
51651 49415 50537 50627 49488 49281 49268 50172 seekrandomwhilewriting
29567 27303 28346 28359 28376 27969 27932 28204 seekrandomwhilewriting