Tuesday, August 23, 2022

How I do RocksDB performance tests, part 2

This extends on my previous postThis post isn’t specific to RocksDB. It also has more opinions and might serve as speaker notes were I to write slides. I am writing this prior to speaking to peers about my work so it might have an audience of one (me) but that is typical of many of my posts. Regardless, I appreciate that people read and engage with some of my posts.

Points

  • How did I get here?
    • Long ago I worked on DBMS internals full time - I added features, fixed bugs and created bugs. Then I moved to web-scale MySQL at Google and started to spend time in production. Production is a great education but it came at the cost of less time for new features. I still spent much time finding and fixing bugs. After a few years I moved to Facebook and the trend continued. Over time I stopped doing feature development, spent much less time fixing bugs but still spend time reporting bugs. I read a lot of code when trying to explain things, but I don't write much that makes it upstream. I have enjoyed the change. I don't need to write code because I am surrounded by talented developers. I can specialize in my thing, and others specializing in their thing. It is hard to be expert in too many things.
  • Benchmarks are what you make of them
    • They are far from perfect but they are quite useful. Testing by comparing things and explaining the results makes them more value. Benchmarks informed by production are even better.
  • How does code age?
    • Single-thread performance on CPUs isn't improving like it used to. Long-lived code tends to attract CPU regressions. This combination is a problem. Good regression tests help spot the problems and spotting them early is a big deal because removing them long after they arrive is too expensive. This isn't just a technical problem. How do you reject new features that help a fraction of the user base when the cost if more CPU overhead for all users?
  • Needs improvement
    • I hope to get better about using benchmarks that avoid coordinated omission, have more statistical rigor, expand beyond single-node tests, use benchmark workloads that are adaptive and use benchmarks that enforce response time constraints.
  • Build a network of smart peers
    • I have been fortunate to have many talented peers. I engage with Postgres experts on Twitter and have also met smart people who work on upstream projects via bug report discussions. 
  • Explain things
    • Explain your results. But find a balance because explaining everything will slow you down.
  • Testing an LSM is complicated
    • Old posts on this are here and here.
    • The shape of an LSM tree has more variance than the shape of a B-tree. This can be a source of variance in benchmarks, especially in read-heavy ones. While this is still a work in progress there are db_bench commands to make the LSM shape more deterministic (flush memtable, compact L0, compact L1, wait-for-compaction).
    • Another problem is a test that inherits a non-deterministic amount of compaction debt. If the sequence is: —benchmarks=write-heavy,read-heavy then the read-heavy step might suffer from compaction debt inherited from write-heavy. The impact of reducing this debt during the read-heavy step can vary and produce confusing results for the read-heavy step.
    • Try to get the LSM tree into a steady state before read-heavy tests. For example, after fillseq there is no overlap between SSTs. After a full compaction there is only one level (or one SST). These are usually not steady states.
    • For a load + query benchmark it is easy for the LSM (or B-Tree) to not be in a steady state after the load and many benchmarks suffer from this. If the load is done in key order then the PK index won’t be fragmented with a B-Tree and the SSTs won’t overlap with an LSM — which hides some of the overhead that occurs during query processing. When storage is a local attached SSD and the workload is heavy on IO then you need to worry about non-determinism from the SSD — you either want no SSD GC or to get SSD GC into a steady state (by running the test for long enough and having database files that are large enough, something between 50% and 90% of the device capacity).
  • Make the DBMS unhappy
    • Find ways to make the DBMS unhappy and see if it falls over. The challenge is that there are more and less realistic ways to make a DBMS fall over. An easy way to make a DBMS unhappy is to provide it with too many concurrent requests, especially a DBMS that doesn’t provide admission control (like RocksDB). But some problems are best fixed elsewhere because fixes have an opportunity cost. It might be nice to have an optional RocksDB API that implements admission control. 
  • Define your goals
    • Do you care about average throughput or outliers (p99, p99.9, p99.99). I have a post on this. Average throughput is easy to measure but p99 and beyond (p99.9, p99.99) matters in production because outliers determine user experience and capacity planning is based on p99. While single-valued metrics like p99 are easy to share, graphs for throughput over time at 1-second intervals make it easier to spot problems like stalls, cyclic behavior or throughput that degrades over time.
  • Statistical rigor
    • Statistical rigor is great but can be expensive. Repeating every benchmark 3 times increases the accuracy of your results at the cost of 3X more HW. I usually get less rigorous statistical rigor because I frequently repeat benchmark runs because I made a mistake or need to measure one more thing. Another way to think of this is: assume B units of HW capacity, each benchmark has a warmup cost of W and runtime of R. Then solve for N in B = N(W+R) where N is the number of times the benchmark is repeated. A larger value for N implies a smaller value for R and the confidence interval is a function of both N and R.
  • Coordinated omission
    • Coordinated omission is a real problem. All of the benchmark clients that I use suffer from it, yet they are still useful. Two things prevent me from doing open-loop benchmarks. First, the benchmark clients I use don’t support it and it takes work to implement a new benchmark client and incorporate it into my workflow. Second, an open-loop benchmark takes more work to setup as I need to discover an arrival rate that the DBMS can handle — or I need a more complicated client that can discover it for me. One day I will use open-loop clients.
  • Response time constraints
    • The db_bench benchmark client for RocksDB doesn't have an option to use response time constraints (ignore responses slower than X ms). Another problem is computing throughput without a response time constraint. More concurrency usually means more throughput, but it also means worse response time and more response time outliers. Those slow responses should not be counted. Most of the benchmark clients that I use don’t enforce a response time SLA. Such an SLA is more work, you need to select a reasonable value, but I hope to improve with this. I hope to add them to db_bench.
  • Single node
    • Most of my testing runs the client and server on the same server. While I prefer to use separate servers for client & server when the DBMS supports it, that introduces the risk of perf variance because I will be sharing the network.
  • Stable platform
    • I use HW at work, in the public cloud and my home test cluster. My work HW has value-added services that consume a variable and occasionally significant amount of compute and storage so I am wary of using it for low-concurrency benchmarks. Public cloud HW means I am using a VM and might be sharing compute and storage with noisy neighbors so I found a way to reduce the CPU variance by using the largest number of CPUs for a given instance type and disabling HT. From quick tests with fio there wasn't much variance in the cloud block storage I chose. My home HW is the most stable after I disabled HT and turbo boost. Alas, it is also the least capable — 4 CPUs, 16G of RAM.
  • Compare things 
    • I rarely test one thing in isolation because the results are hard to interpret. So I do A/B or even A/B/C/D/... testing where these represent different DBMS, different versions of the same DBMS or different configurations for one version of one DBMS.
  • Measure things
    • Start with throughput, then add variance, then add CPU, IO and memory. Foreground CPU and IO can remain constant while background CPU and IO change significantly and most DBMS do much work in the background (excluding SQLite which doesn’t have background threads. Don’t forget to watch VSZ/RSS for the DBMS processes because increases there might lead to OOM. Has disk space usage increases because that can lead to out of space errors. When something is slower search top down. Look at iostat metrics to see if IO/query has changed. Look at vmstat to see if CPU/query has changed. Look at vmstat to see if context switches/query has changed (mutex contention?). Normalize your metrics — IO/query, CPU/query, context switches/query. I frequently have scripts running that scrape output from ps and top. To watch for disk space issues I have a script that runs du and ls in a loop during benchmarks.
  • Summarize things
    • One practice I have it to create one line performance summaries with useful numbers for throughput, HW (CPU/storage/memory/disk space) usage, normalized HW usage (CPU/query, IO/query). One line summaries make it easy to compare performance when A/B or A/B/C/D/... testing is done. They also make it easy to spot regressions that don't directly hurt throughput but are a concern -- larger RSS for the DBMS process, more disk space used, more CPU consumed by background threads. The summaries also provide a starting point when I try to explain a performance change. An example is here.
  • Name & archive things
    • A mistake I have made many times is starting a benchmark, getting interrupted for a week and forgetting an important detail about the benchmark when I return. Naming patterns reduces the change of this. I try to archive the test scripts and command lines via Github. Saving RocksDB LOG files is also important. All of my important scripts are in Github.
  • Adaptive benchmark clients
    • I often have to assume things when configuring a benchmark client. The number of threads that db_bench uses for clients is currently fixed. It would be nice to have some benchmarks that increase the request rate or number of request clients over time or until a response time constraint is violated. I currently do this manually and my solution is sad.
  • Proactive vs reactive
    • Is it a bug when it has yet to happen in production? That is an interesting question. The answer requires nuance. Some bugs do happen but have yet to be noticed, some bugs might happen and are worth avoiding other bugs just aren't worth fixing. It isn't always easy to classify a bug into one of these groups.

Friday, August 19, 2022

When is a secondary index-only query not index-only for InnoDB and Postgres?

For Postgres and MySQL/InnoDB an index-only query plan that uses a covering secondary index might still require access to the base row for some of the index entries read from the secondary index. Most of the time index-only really is index-only. But in some cases there will still be fetches of the indexed row and power users should know when this can happen.

This doesn't happen with MySQL/MyRocks.

Postgres

One reason for this is recent transactions. The issue for Postgres is well-documented and widely known -- the visibility map. I won't explain it because experts have done a better job than I can. I don't know whether Postgres has performance counters to monitor the frequency of this.

One interesting variant of this with Postgres was for insert-only workloads that I encountered with the Insert Benchmark. The problem was that prior to PG 13 autovacuum wasn't triggered by insert-only workloads, thus visibility map bits were not set for pages full of new inserts and index-only scans weren't really index only for deployments that relied on autovacuum.

The result is that for some index entries read from the secondary index for an index-only query plan, Postgres might have to fetch the row by using the tuple id to read from the heap-organized table. This can result in unexpected random IO. 

InnoDB

The issue for InnoDB is less widely-known and not well documented. I don't think there are performance counters to monitor the frequency for this. Long-ago when I supported InnoDB in production I added some to the FB patch.

The best doc I found in a quick search is this slide deck (thanks Marko from MariaDB) and I will quote parts from it:

  1. secondary indexes have a PAGE_MAX_TRX_ID
  2. PK index entries have a DB_TRX_ID per entry, secondary index entries do not
  3. See here for a description of DB_TRX_ID
  4. Secondary indexes may contain multiple (indexed_col, pk) records pointing to a single pk, one for each value of indexed_col
  5. All versions except at most one must be delete-marked
  6. If a secondary index record is delete-marked, MVCC must look up pk in the PRIMARY index and attempt to construct a version that matches indexed_col, to determine if (indexed_col, pk) exists in the read view
  7. If the PAGE_MAX_TRX_ID is too new, for each record in the secondary index leaf page, we must look up pk
  8. For old read views, secondary index scan can be very expensive!
InnoDB looks up the base by searching the clustered PK index using the PK columns stored in the secondary index entry.

While the InnoDB docs have some description of this, they aren't as clear as the slide deck above. 

When a secondary index column is updated, old secondary index records are delete-marked, new records are inserted, and delete-marked records are eventually purged. When a secondary index record is delete-marked or the secondary index page is updated by a newer transaction, InnoDB looks up the database record in the clustered index. In the clustered index, the record's DB_TRX_ID is checked, and the correct version of the record is retrieved from the undo log if the record was modified after the reading transaction was initiated.

If a secondary index record is marked for deletion or the secondary index page is updated by a newer transaction, the covering index technique is not used. Instead of returning values from the index structure, InnoDB looks up the record in the clustered index.

That page also explains another detail about secondary indexes that many InnoDB users might not know. Secondary index entries are not updated in place. An update is implemented as delete-mark existing entry, insert new entry. Eventually purge will physically remove the delete-marked entry after it is no longer visible to current snapshots. Thus InnoDB secondary indexes risk suffering from bloat and churn which is worse with long-open snapshots that block purge -- because those delete-marked entries can linger leading to excessive page-splits.
MyRocks
I wrote above that MyRocks doesn't have this problem and that is true. It can have problems with bloat/churn after secondary index maintenance that I mention above for InnoDB. Fortunately the SingleDelete optimization helps a lot to avoid that as described in the VLDB paper. SingleDelete allows for faster removal of tombstones as normal (Delete) tombstones are frequently not removed until reaching the max level of the LSM tree, and MyRocks uses SingleDelete for secondary index maintenance.

The reason that tombstones are frequently not removed until reaching the max level of the LSM tree is that the removal check chooses to be fast and has many false positives. During compaction when a tombstone is found, if the key for the tombstone is covered by the key ranges of SSTs in larger levels, then it is not dropped because the key might exist in those larger levels. The number of false positives could be reduced at the cost of more CPU, and possibly some IO, if after confirming overlap the bloom filters in those larger levels were probed.

Thursday, August 18, 2022

How I do performance tests for RocksDB

My primary tool for doing RocksDB performance tests is db_bench, but as always there are layers of shell scripts to automate the process. First, I will define two terms. A benchmark is a sequence of benchmark steps. A common pattern for big data benchmarks is load and then query where load and query are the steps. I do small data (OLTP) so the common pattern for me is load and then several read+write steps. To save time the load step can be replaced by copying from a backup or snapshot.

I have shared notes on how to do benchmarks for an LSM with a focus on RocksDB and LevelDB. These are based on the many mistakes I have made and occasionally learned from: see here and here.

The layers of shell scripts are:

  • tools/benchmark.sh - runs one benchmark step by invoking db_bench
  • tools/benchmark_compare.sh - runs a benchmark by invoking a sequence of benchmark steps
  • x.sh - selects configuration options based on HW size then calls benchmark_compare.sh
  • x3.sh - selects configuration options based on workload (IO-bound vs cached) then calls x.sh
I also forked the scripts above to work on older versions of RocksDB (versions 4 and 5). The forks are here and were done to reduce the complexity in the scripts above.

The scripts also let me run a benchmark for many versions of RocksDB. When I test many versions it might take more than a week for x3.sh to finish (tmux and screen help here), but it runs unattended which saves me time and in the end I get ascii formatted reports that make it easy to compare results across versions.

RocksDB has many configuration options. The benchmark.sh script has good default values for many options, but the values for some options depend on HW size (amount of RAM, number of CPUs) which is one reason there are layers of scripts.

The benchmark tool, db_bench, also has many configuration options to select different benchmark steps (load in key order, point queries, etc). The primary purpose for benchmark_compare.sh is to run the benchmark steps (and configure db_bench to do that) in an order that is useful to me and minimizes noise. For example, if benchmark steps were to follow overwrite then they might inherit a large and varying amount of compaction debt which would cause variance. The sequence of benchmark steps begins here.

The x.sh script selects values for RocksDB configuration options based on HW capacity. For now I have hardwired this based on known servers that I use (see here) but eventually this will accept CPU and memory sizes as input and go from there.

The x3.sh script is invoked by me. It defines four workloads:
  • byrx
    • byrx is short for cached by RocksDB and the database fits in the RocksDB block cache
  • byos
    • byos is short for cached by OS and the database fits in the OS page cache but is larger than the RocksDB block cache. This simulates fast storage for reads and lets me increase stress on the RocksDB block cache code.
  • iobuf
    • iobuf is short for IO-bound with buffered IO. The database is larger than RAM and RocksDB uses buffered IO.
  • iodir
    • iodir is short for IO-bound with O_DIRECT. The database is larger than RAM and RocksDB uses O_DIRECT for user reads and compaction.
An example command line is:

nohup bash x3.sh 22 no 1800 c30r240 40000000 2000000000 iodir &

Benchmark steps

Note that I have pending changes for benchmark.sh and benchmark_compare.sh that are not yet pushed upstream.

The benchmark steps are:
  • fillseq - loads the database in key order. There is not much compaction debt when this finishes.
  • read-only
    • revrange - reverse range scans, the real use for this is to let compacton catch up
    • fwdrange - forward range scans, this has too much noise that I have yet to explain
    • readrandom - point queries
    • multireadrandom - more point queries, but enhanced by io_uring
  • fragment the LSM tree, prior to this keys of SST files do not overlap
    • overwritesome - does overwrite with –num set to 10% of the keys
    • flush_mt_l0 - flushes the memtable, flushes the L0 then waits for compaction to catch up
  • read+write - perf is reported for reads, the background writer has a 2MB/s rate limit
    • revrangewhilewriting - short, reverse range scans
    • fwdrangewhilewriting - short, forward range scans
    • readwhilewriting - point queries
  • write-only
    • overwrite - the writer does not have a rate limit. If there were more write-only tests that followed then I would use overwriteandwait which waits for compaction to finish when overwrite ends.