Thursday, July 6, 2023

Dynamic leveled compaction in RocksDB

Dynamic leveled has become the default in RocksDB 8.4.0 and it improves on the alternative that I will call non-dynamic in this post, but it can also create some confusion.

The target sizes for the levels of the LSM tree are computed ...

  • from largest to smallest with dynamic leveled
  • from smallest to largest with non-dynamic leveled
LSM Math

Assume:

With non-dynamic leveled the LSM tree shape has the target sizes of: 256M for L1, 2560M for L2, 25600 for L3 and 256000 for L4. With 30G of data in Lmax then the typical tree shape will be:

  • 256M for L1
  • 2560M for L2
  • 25600M for L3
  • 30720M for L4
The problem with the LSM tree shape above is that it has too much space-amplification. The expected space-amp for leveled is 1.1 (meaning you use about 10% more space than optimal). But from the example above the space-amp is 1.93 -> (30720+25600+2560+256) / 30720.

Dynamic leveled was created to avoid that problem and with dynamic leveled the target sizes are computed from largest to smallest level. When Lmax has 30G of data then the target sizes are 30G for L4, 3072 for L3, 307 for L2, 256 for L1 and the typical LSM tree shape will be:
  • 256M for L1
  • 307M for L2
  • 3072M for L3
  • 30720M for L4
This solves the issue with too much space-amp but it comes at the cost of an extra level. The extra level (307M for L2) is there because 3072/256 is larger than 10, thus there must be another level between L3 and L1 because the size ratio between any adjacent levels must be <= the per-level fanout (10). The worst-case write-amp will be ~26.2 from +1 for WAL, +1 for L0, +2 for L1, +2.2 for L2, +10 for L3, +10 for L4. The extra level also increases the read-amp as there is more CPU overhead because there are more places to check for data.

Perhaps it is possible to avoid the use of this extra level. The possible solution is to increase the per-level fanout (max_bytes_for_level_muliplier) as needed. But the increase must have some limit to avoid using a too-large value.

In this case sizeof(Lmax) = 30720M and sizeof(L1) = 256. Their ratio is 30720/256 == 120. To get from 256M in L1 to 30720 in L3 with one level in between the per-level fanout must be sqrt(120) == 10.95. If that were used then the per-level target sizes would be 30G for L3, 2805M for L2, 256.2M for L1. The worst-case write-amp will be ~26 from +1 for WAL, +1 for L0, +2 for L1, +11 for L2, +11 for L3.

This might not be a trivial change and it might not be feasible. But on paper it solves the problem. I created a feature request for this.

The Confusion

One more source of confusion is that with dynamic leveled there is a difference between a logical level and a physical level. If I configure RocksDB to use dynamic leveled and also to use 8 levels, then there might be data in L0, L5, L6, L7, L8. When more data is added then L4 will be used. When there is data in L0, L4, L5, L6, L7, L8 then L4 (the physical name for the level) is acting as L1 (the logical name for it).

An example where L4 is the logical L1 is here.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...