Monday, September 19, 2022

Understanding some jemalloc stats for RocksDB

I have been struggling to understand the root cause for RSS being larger than expected during my usage of db_bench. In this case db_bench used jemalloc. Statistics for jemalloc are written to LOG every stats_dump_period_sec seconds and this is an abbreviated example of that output -- I hacked db_bench to disable the mutex related stats in the call to malloc_stats_print.

To keep this post short and because I am not a jemalloc expert I will only try to explain a few things in this output that might help you, and me in a few years when I have to understand jemalloc again.

The first is the number of arenas that are listed on the opt.arenas line. In this case there were 320 and I assume that jemalloc uses 4 arenas per CPU (the server had 80 CPUs). There is also a line with Arenas: 321 which is confusing but the difference isn't a big deal to me. Arenas are a way to shard the allocator to reduce mutex contention and when threads are (hopefully) assigned to a different arenas. In my usage with db_bench the threads are long-lived and I assume the mapping of thread to arena doesn't change. I don't know whether that is a potential source of perf problems.

The next interesting line starts with Allocated: and displays a few counters for currently allocated memory. The value next to Allocated: is ~51GB which is reasonable given I configured RocksDB to use 50GB for the block cache. 

This is followed by an interesting section that starts with Merged arena stats. This has various counters aggregated over all 320 arenas. Start with the assigned threads: 10 line. That means that 10 threads are currently assigned to arenas. While per-arena stats might also be useful there are many arenas, that would require a lot of output to print and read and I have not been able to get malloc_stats_print to return results for each of the 320 arenas (it stops after 4) so I prefer just using malloc_stats_print(..., opts="a") to disable per-arena stats. If you use opts="xa" then per-arena and mutex counters are not displayed which makes the output easier to read.

Next are lines that start with small: and large: to show the amount of memory currently allocated for small and large allocations sizes. Most of the allocations in this case are small and <= 8kb.

The next section has counters for active, mapped, retained and more. These are frequently interesting but not so much in this case.

Given that most allocations are small the following section for bins: is the most interesting. The bin with size 8 is used for all allocations of <= 8 bytes. The bin with size 16 is used for all allocations between 9 and 16 bytes. This section has long lines and clicking on Raw improves readability but I can't link to lines in the Raw view.

I wanted to know the bins that accounted for the largest amount of currently allocated memory. Sorting these lines by curslabs * pgs would provide that. Alas I can't sort that in the gist so I just look for the lines with the largest value in curslabs. The curslabs column is the number of slabs (slab is better explained by an expert, not me) and pgs is the size of a slab. The unit for pgs is 4kb in this case so curslabs * pgs * 4096 is the number of bytes currently mapped for a given bin.

In this case the bins using the most memory have sizes 32, 96, 448 and 8192. Note that this is from db_bench configured to use Integrated BlobDB with 400-byte blob values and blob values share the block cache:

  • bins of size 32, 80 (not in this case though) and 96 are commonly used. Their usage is described by an expert as the 32-byte bucket corresponds to the blob-owning object, the 80-byte bucket to the block-owning object, and the 96-byte bucket to the cache metadata (which is an overhead that’s applicable to both blocks and blobs)
  • the bin of size 448 is used by the 400-byte blob values stored in the RocksDB block cache
  • the bin of size 8192 is used by the uncompressed data blocks that are approximately 8192 bytes each because the block_size option was set to 8192.
The util column is also interesting. This lists the utilization fraction, a value between 0 and 1. Multiply by 100 to get the utilization percentage. I read the source code to learn how it is computed. It is curregs / (regs * curslabs) where curregs is the number of currently used regions, regs is the number of regions per slab and curslabs is the number of slabs.

In this example the util value is low (~0.7) for bins of size 32, 96 and 448 but large (1) for the bin of size 8192. To me this indicates there is fragmentation for the bins of size 32, 96 and 448 and removing that fragmentation will reduce RSS and improve memory efficiency. Alas I have yet to figure how to make that happen. The root cause with respect to RocksDB is that at test start the blob values get a larger fraction of the block cache, over time that fraction is reduced and the difference between the peak memory usage near startup and the usage over time became significant (by my standards) as RSS was ~1.15X the size of the RocksDB block cache.

Many of the columns are explained in the man page, however some of the column names used in the malloc_stats_print output are different from what is used in the source code and explained in the man page. One example is the usage of slab_size in the man page vs pgs in the malloc_stats_print output where pgs * 4096 = slab_size. The source for printing per-arena metrics is in stats_arena_print. But you can find curslabs in the man page.

Thursday, September 15, 2022

Configuring the RocksDB block cache

I learn by making mistakes and the rate at which RocksDB improves makes it easier for me to make mistakes. My mistake here was reading that RocksDB has high and low priority classes for the block cache and assuming there was a medium priority. There is a third priority but its name is bottom and it is lower than low priority.

RocksDB has had an option to do key-value separation for many years. That was recently rewritten so it could be feature complete with the RocksDB API. The official name for the feature is Integrated BlobDB but I just call it BlobDB.

I am excited about BlobDB and have begun testing it after adding support for it to tools/benchmark.sh. But it brings a new challenge with respect to the RocksDB block cache that stores uncompressed data, index and filter blocks in memory. 

Thanks to a great intern project, with BlobDB there is now another thing that can be stored in the block cache -- blob values. The usage of the db_bench option named use_shared_block_and_blob_cache option shows how that is enabled. There are 3 priority classes that can be used by the block cache: high, low, bottom.

The options that control this are:

  • high_pri_pool_ratio
    • The fraction of the cache devoted to filter and index blocks.
  • low_pri_pool_ratio 
    • When zero then data block and blob values have the same priority. Otherwise, this is the fraction of the cache for data blocks and blob values get the remainder (1 - high_pri_pool_ratio - low_pri_pool_ratio). The sum of high_pri_pool_ratio and low_pri_pool_ratio should be <= 1.

The priority for storing things in the cache is one of the following depending on how you configure it:

  • everything has the same priority
    • when high_pri_pool ratio=0 and low_pri_pool_ratio=0
  • index & filter blocks have high priority, data blocks and blob values have low priority
    • when high_pri_pool ratio > 0 and low_pri_pool_ratio=0
  • index & filter blocks have high-pri, data blocks have low-pri, blob values have bottom-pri
    • when high_pri_pool ratio > 0 and low_pri_pool_ratio > 0
  • data, index & filter blocks have low priority, blob values have bottom priority
    • when high_pri_pool_ratio=0 and low_pri_pool_ratio > 0

How to ask questions

This is a follow-up to my post on how to answer questions. The audience for that post was people doing in something in production that others have a strong opinion about. The audience for this post is people with strong opinions about what others are doing in production.

One of these is a great way to start a discussion:

  • You are doing X, you should be doing Y.
    • This is isn't even a question and it isn't a productive way to start a discussion. With respect to supporting a social graph OLTP workload I have been told on separate occasions that I should be using an in-memory DBMS, an OLAP DBMS, a graph DBMS and a NoSQL DBMS (ScyllaDB). In two of those cases the people offering that advice were quite famous. Alas, nobody offering that advice followed up and submitted a PR for Linkbench to show the benefits of such a change.
  • Why are you doing X!
    • Still not a question. Still not a great way to start a discussion.
  • Why are you doing X?
    • We have a winner. I still remember being quizzed by Peter Bailis, while at HPTS, about the social graph OLTP workload I supported. I did my best to answer his questions. It was a great discussion and I was impressed by his curiousity.

Wednesday, September 14, 2022

Using SPIN to validate the InnoDB read-write lock

I hope to being learning how to use TLA+ and that reminded me of prior work I did with SPIN.

I have written two blog posts (here and here) about my usage of SPIN to validate concurrent code. The most important usage was for the rewrite of the InnoDB read-write lock that my team did at Google. It was important because that model helped to convince the InnoDB team to accept the change. I hope to being learning about TLA+ 

Within Google the project was mostly done by Ben Handy and I helped. This was ~15 years ago so my memory is vague. I remember reading the SPIN book to learn how to write a model for the algorithm. I also remember finding a machine with a lot of RAM and then doing what I could to reduce the memory requirements to run the model. Reducing the amount of concurrency tested was one such change. Finally, I remember enjoying SPIN.

I didn't share the model before but a friend, and InnoDB/TiDB expert, provided me with a copy. That model can now be read here.

Tuesday, August 23, 2022

How I do RocksDB performance tests, part 2

This extends on my previous postThis post isn’t specific to RocksDB. It also has more opinions and might serve as speaker notes were I to write slides. I am writing this prior to speaking to peers about my work so it might have an audience of one (me) but that is typical of many of my posts. Regardless, I appreciate that people read and engage with some of my posts.

Points

  • How did I get here?
    • Long ago I worked on DBMS internals full time - I added features, fixed bugs and created bugs. Then I moved to web-scale MySQL at Google and started to spend time in production. Production is a great education but it came at the cost of less time for new features. I still spent much time finding and fixing bugs. After a few years I moved to Facebook and the trend continued. Over time I stopped doing feature development, spent much less time fixing bugs but still spend time reporting bugs. I read a lot of code when trying to explain things, but I don't write much that makes it upstream. I have enjoyed the change. I don't need to write code because I am surrounded by talented developers. I can specialize in my thing, and others specializing in their thing. It is hard to be expert in too many things.
  • Benchmarks are what you make of them
    • They are far from perfect but they are quite useful. Testing by comparing things and explaining the results makes them more value. Benchmarks informed by production are even better.
  • How does code age?
    • Single-thread performance on CPUs isn't improving like it used to. Long-lived code tends to attract CPU regressions. This combination is a problem. Good regression tests help spot the problems and spotting them early is a big deal because removing them long after they arrive is too expensive. This isn't just a technical problem. How do you reject new features that help a fraction of the user base when the cost if more CPU overhead for all users?
  • Needs improvement
    • I hope to get better about using benchmarks that avoid coordinated omission, have more statistical rigor, expand beyond single-node tests, use benchmark workloads that are adaptive and use benchmarks that enforce response time constraints.
  • Build a network of smart peers
    • I have been fortunate to have many talented peers. I engage with Postgres experts on Twitter and have also met smart people who work on upstream projects via bug report discussions. 
  • Explain things
    • Explain your results. But find a balance because explaining everything will slow you down.
  • Testing an LSM is complicated
    • Old posts on this are here and here.
    • The shape of an LSM tree has more variance than the shape of a B-tree. This can be a source of variance in benchmarks, especially in read-heavy ones. While this is still a work in progress there are db_bench commands to make the LSM shape more deterministic (flush memtable, compact L0, compact L1, wait-for-compaction).
    • Another problem is a test that inherits a non-deterministic amount of compaction debt. If the sequence is: —benchmarks=write-heavy,read-heavy then the read-heavy step might suffer from compaction debt inherited from write-heavy. The impact of reducing this debt during the read-heavy step can vary and produce confusing results for the read-heavy step.
    • Try to get the LSM tree into a steady state before read-heavy tests. For example, after fillseq there is no overlap between SSTs. After a full compaction there is only one level (or one SST). These are usually not steady states.
    • For a load + query benchmark it is easy for the LSM (or B-Tree) to not be in a steady state after the load and many benchmarks suffer from this. If the load is done in key order then the PK index won’t be fragmented with a B-Tree and the SSTs won’t overlap with an LSM — which hides some of the overhead that occurs during query processing. When storage is a local attached SSD and the workload is heavy on IO then you need to worry about non-determinism from the SSD — you either want no SSD GC or to get SSD GC into a steady state (by running the test for long enough and having database files that are large enough, something between 50% and 90% of the device capacity).
  • Make the DBMS unhappy
    • Find ways to make the DBMS unhappy and see if it falls over. The challenge is that there are more and less realistic ways to make a DBMS fall over. An easy way to make a DBMS unhappy is to provide it with too many concurrent requests, especially a DBMS that doesn’t provide admission control (like RocksDB). But some problems are best fixed elsewhere because fixes have an opportunity cost. It might be nice to have an optional RocksDB API that implements admission control. 
  • Define your goals
    • Do you care about average throughput or outliers (p99, p99.9, p99.99). I have a post on this. Average throughput is easy to measure but p99 and beyond (p99.9, p99.99) matters in production because outliers determine user experience and capacity planning is based on p99. While single-valued metrics like p99 are easy to share, graphs for throughput over time at 1-second intervals make it easier to spot problems like stalls, cyclic behavior or throughput that degrades over time.
  • Statistical rigor
    • Statistical rigor is great but can be expensive. Repeating every benchmark 3 times increases the accuracy of your results at the cost of 3X more HW. I usually get less rigorous statistical rigor because I frequently repeat benchmark runs because I made a mistake or need to measure one more thing. Another way to think of this is: assume B units of HW capacity, each benchmark has a warmup cost of W and runtime of R. Then solve for N in B = N(W+R) where N is the number of times the benchmark is repeated. A larger value for N implies a smaller value for R and the confidence interval is a function of both N and R.
  • Coordinated omission
    • Coordinated omission is a real problem. All of the benchmark clients that I use suffer from it, yet they are still useful. Two things prevent me from doing open-loop benchmarks. First, the benchmark clients I use don’t support it and it takes work to implement a new benchmark client and incorporate it into my workflow. Second, an open-loop benchmark takes more work to setup as I need to discover an arrival rate that the DBMS can handle — or I need a more complicated client that can discover it for me. One day I will use open-loop clients.
  • Response time constraints
    • The db_bench benchmark client for RocksDB doesn't have an option to use response time constraints (ignore responses slower than X ms). Another problem is computing throughput without a response time constraint. More concurrency usually means more throughput, but it also means worse response time and more response time outliers. Those slow responses should not be counted. Most of the benchmark clients that I use don’t enforce a response time SLA. Such an SLA is more work, you need to select a reasonable value, but I hope to improve with this. I hope to add them to db_bench.
  • Single node
    • Most of my testing runs the client and server on the same server. While I prefer to use separate servers for client & server when the DBMS supports it, that introduces the risk of perf variance because I will be sharing the network.
  • Stable platform
    • I use HW at work, in the public cloud and my home test cluster. My work HW has value-added services that consume a variable and occasionally significant amount of compute and storage so I am wary of using it for low-concurrency benchmarks. Public cloud HW means I am using a VM and might be sharing compute and storage with noisy neighbors so I found a way to reduce the CPU variance by using the largest number of CPUs for a given instance type and disabling HT. From quick tests with fio there wasn't much variance in the cloud block storage I chose. My home HW is the most stable after I disabled HT and turbo boost. Alas, it is also the least capable — 4 CPUs, 16G of RAM.
  • Compare things 
    • I rarely test one thing in isolation because the results are hard to interpret. So I do A/B or even A/B/C/D/... testing where these represent different DBMS, different versions of the same DBMS or different configurations for one version of one DBMS.
  • Measure things
    • Start with throughput, then add variance, then add CPU, IO and memory. Foreground CPU and IO can remain constant while background CPU and IO change significantly and most DBMS do much work in the background (excluding SQLite which doesn’t have background threads. Don’t forget to watch VSZ/RSS for the DBMS processes because increases there might lead to OOM. Has disk space usage increases because that can lead to out of space errors. When something is slower search top down. Look at iostat metrics to see if IO/query has changed. Look at vmstat to see if CPU/query has changed. Look at vmstat to see if context switches/query has changed (mutex contention?). Normalize your metrics — IO/query, CPU/query, context switches/query. I frequently have scripts running that scrape output from ps and top. To watch for disk space issues I have a script that runs du and ls in a loop during benchmarks.
  • Summarize things
    • One practice I have it to create one line performance summaries with useful numbers for throughput, HW (CPU/storage/memory/disk space) usage, normalized HW usage (CPU/query, IO/query). One line summaries make it easy to compare performance when A/B or A/B/C/D/... testing is done. They also make it easy to spot regressions that don't directly hurt throughput but are a concern -- larger RSS for the DBMS process, more disk space used, more CPU consumed by background threads. The summaries also provide a starting point when I try to explain a performance change. An example is here.
  • Name & archive things
    • A mistake I have made many times is starting a benchmark, getting interrupted for a week and forgetting an important detail about the benchmark when I return. Naming patterns reduces the change of this. I try to archive the test scripts and command lines via Github. Saving RocksDB LOG files is also important. All of my important scripts are in Github.
  • Adaptive benchmark clients
    • I often have to assume things when configuring a benchmark client. The number of threads that db_bench uses for clients is currently fixed. It would be nice to have some benchmarks that increase the request rate or number of request clients over time or until a response time constraint is violated. I currently do this manually and my solution is sad.
  • Proactive vs reactive
    • Is it a bug when it has yet to happen in production? That is an interesting question. The answer requires nuance. Some bugs do happen but have yet to be noticed, some bugs might happen and are worth avoiding other bugs just aren't worth fixing. It isn't always easy to classify a bug into one of these groups.

Friday, August 19, 2022

When is a secondary index-only query not index-only for InnoDB and Postgres?

For Postgres and MySQL/InnoDB an index-only query plan that uses a covering secondary index might still require access to the base row for some of the index entries read from the secondary index. Most of the time index-only really is index-only. But in some cases there will still be fetches of the indexed row and power users should know when this can happen.

This doesn't happen with MySQL/MyRocks.

Postgres

One reason for this is recent transactions. The issue for Postgres is well-documented and widely known -- the visibility map. I won't explain it because experts have done a better job than I can. I don't know whether Postgres has performance counters to monitor the frequency of this.

One interesting variant of this with Postgres was for insert-only workloads that I encountered with the Insert Benchmark. The problem was that prior to PG 13 autovacuum wasn't triggered by insert-only workloads, thus visibility map bits were not set for pages full of new inserts and index-only scans weren't really index only for deployments that relied on autovacuum.

The result is that for some index entries read from the secondary index for an index-only query plan, Postgres might have to fetch the row by using the tuple id to read from the heap-organized table. This can result in unexpected random IO. 

InnoDB

The issue for InnoDB is less widely-known and not well documented. I don't think there are performance counters to monitor the frequency for this. Long-ago when I supported InnoDB in production I added some to the FB patch.

The best doc I found in a quick search is this slide deck (thanks Marko from MariaDB) and I will quote parts from it:

  1. secondary indexes have a PAGE_MAX_TRX_ID
  2. PK index entries have a DB_TRX_ID per entry, secondary index entries do not
  3. See here for a description of DB_TRX_ID
  4. Secondary indexes may contain multiple (indexed_col, pk) records pointing to a single pk, one for each value of indexed_col
  5. All versions except at most one must be delete-marked
  6. If a secondary index record is delete-marked, MVCC must look up pk in the PRIMARY index and attempt to construct a version that matches indexed_col, to determine if (indexed_col, pk) exists in the read view
  7. If the PAGE_MAX_TRX_ID is too new, for each record in the secondary index leaf page, we must look up pk
  8. For old read views, secondary index scan can be very expensive!
InnoDB looks up the base by searching the clustered PK index using the PK columns stored in the secondary index entry.

While the InnoDB docs have some description of this, they aren't as clear as the slide deck above. 

When a secondary index column is updated, old secondary index records are delete-marked, new records are inserted, and delete-marked records are eventually purged. When a secondary index record is delete-marked or the secondary index page is updated by a newer transaction, InnoDB looks up the database record in the clustered index. In the clustered index, the record's DB_TRX_ID is checked, and the correct version of the record is retrieved from the undo log if the record was modified after the reading transaction was initiated.

If a secondary index record is marked for deletion or the secondary index page is updated by a newer transaction, the covering index technique is not used. Instead of returning values from the index structure, InnoDB looks up the record in the clustered index.

That page also explains another detail about secondary indexes that many InnoDB users might not know. Secondary index entries are not updated in place. An update is implemented as delete-mark existing entry, insert new entry. Eventually purge will physically remove the delete-marked entry after it is no longer visible to current snapshots. Thus InnoDB secondary indexes risk suffering from bloat and churn which is worse with long-open snapshots that block purge -- because those delete-marked entries can linger leading to excessive page-splits.
MyRocks
I wrote above that MyRocks doesn't have this problem and that is true. It can have problems with bloat/churn after secondary index maintenance that I mention above for InnoDB. Fortunately the SingleDelete optimization helps a lot to avoid that as described in the VLDB paper. SingleDelete allows for faster removal of tombstones as normal (Delete) tombstones are frequently not removed until reaching the max level of the LSM tree, and MyRocks uses SingleDelete for secondary index maintenance.

The reason that tombstones are frequently not removed until reaching the max level of the LSM tree is that the removal check chooses to be fast and has many false positives. During compaction when a tombstone is found, if the key for the tombstone is covered by the key ranges of SSTs in larger levels, then it is not dropped because the key might exist in those larger levels. The number of false positives could be reduced at the cost of more CPU, and possibly some IO, if after confirming overlap the bloom filters in those larger levels were probed.

Thursday, August 18, 2022

How I do performance tests for RocksDB

My primary tool for doing RocksDB performance tests is db_bench, but as always there are layers of shell scripts to automate the process. First, I will define two terms. A benchmark is a sequence of benchmark steps. A common pattern for big data benchmarks is load and then query where load and query are the steps. I do small data (OLTP) so the common pattern for me is load and then several read+write steps. To save time the load step can be replaced by copying from a backup or snapshot.

I have shared notes on how to do benchmarks for an LSM with a focus on RocksDB and LevelDB. These are based on the many mistakes I have made and occasionally learned from: see here and here.

The layers of shell scripts are:

  • tools/benchmark.sh - runs one benchmark step by invoking db_bench
  • tools/benchmark_compare.sh - runs a benchmark by invoking a sequence of benchmark steps
  • x.sh - selects configuration options based on HW size then calls benchmark_compare.sh
  • x3.sh - selects configuration options based on workload (IO-bound vs cached) then calls x.sh
I also forked the scripts above to work on older versions of RocksDB (versions 4 and 5). The forks are here and were done to reduce the complexity in the scripts above.

The scripts also let me run a benchmark for many versions of RocksDB. When I test many versions it might take more than a week for x3.sh to finish (tmux and screen help here), but it runs unattended which saves me time and in the end I get ascii formatted reports that make it easy to compare results across versions.

RocksDB has many configuration options. The benchmark.sh script has good default values for many options, but the values for some options depend on HW size (amount of RAM, number of CPUs) which is one reason there are layers of scripts.

The benchmark tool, db_bench, also has many configuration options to select different benchmark steps (load in key order, point queries, etc). The primary purpose for benchmark_compare.sh is to run the benchmark steps (and configure db_bench to do that) in an order that is useful to me and minimizes noise. For example, if benchmark steps were to follow overwrite then they might inherit a large and varying amount of compaction debt which would cause variance. The sequence of benchmark steps begins here.

The x.sh script selects values for RocksDB configuration options based on HW capacity. For now I have hardwired this based on known servers that I use (see here) but eventually this will accept CPU and memory sizes as input and go from there.

The x3.sh script is invoked by me. It defines four workloads:
  • byrx
    • byrx is short for cached by RocksDB and the database fits in the RocksDB block cache
  • byos
    • byos is short for cached by OS and the database fits in the OS page cache but is larger than the RocksDB block cache. This simulates fast storage for reads and lets me increase stress on the RocksDB block cache code.
  • iobuf
    • iobuf is short for IO-bound with buffered IO. The database is larger than RAM and RocksDB uses buffered IO.
  • iodir
    • iodir is short for IO-bound with O_DIRECT. The database is larger than RAM and RocksDB uses O_DIRECT for user reads and compaction.
An example command line is:

nohup bash x3.sh 22 no 1800 c30r240 40000000 2000000000 iodir &

Benchmark steps

Note that I have pending changes for benchmark.sh and benchmark_compare.sh that are not yet pushed upstream.

The benchmark steps are:
  • fillseq - loads the database in key order. There is not much compaction debt when this finishes.
  • read-only
    • revrange - reverse range scans, the real use for this is to let compacton catch up
    • fwdrange - forward range scans, this has too much noise that I have yet to explain
    • readrandom - point queries
    • multireadrandom - more point queries, but enhanced by io_uring
  • fragment the LSM tree, prior to this keys of SST files do not overlap
    • overwritesome - does overwrite with –num set to 10% of the keys
    • flush_mt_l0 - flushes the memtable, flushes the L0 then waits for compaction to catch up
  • read+write - perf is reported for reads, the background writer has a 2MB/s rate limit
    • revrangewhilewriting - short, reverse range scans
    • fwdrangewhilewriting - short, forward range scans
    • readwhilewriting - point queries
  • write-only
    • overwrite - the writer does not have a rate limit. If there were more write-only tests that followed then I would use overwriteandwait which waits for compaction to finish when overwrite ends.