Monday, January 27, 2020

Deletes are fast and slow in an LSM

In an LSM deletes are fast for the deleter but can make queries that follow slower. The problem is that too many tombstones can get in the way of a query especially a range query.

A tombstone must remain in the LSM tree until any keys it deletes have been removed from the LSM. For example, if there is a tombstone for key "ABC" on level 1 of the LSM tree and a live value for that key on level 3 then that tombstone cannot be removed. It is hard to make the check (does live key exist below me) efficient.

I haven't read much about optimizing for tombstones in an LSM not named RocksDB. Perhaps I have not tried hard to find such details. Maybe this is something that LSM engine developers should explain more in public.

Confirming whether a tombstone can be dropped

This is based on code that I read ~2 years ago. Maybe RocksDB has changed today.

Tombstones are dropped during compaction. The question is how much work (CPU and IO) you are willing to spend to determine whether a tombstone can be dropped. LevelDB and RocksDB favor performance over exactness. By this I mean they spend less CPU and no IO on the "can I drop this tombstone" check. The check is simple today. If there is an LSM level (or sorted run) below (older) that has a key range which overlaps the tombstone key then the tombstone won't be dropped. And in most workloads this means that tombstones aren't dropped until they reach the max LSM level -- because there usually is overlap.

An LSM could spend a bit more CPU and check bloom filters for the levels that overlap with the tombstone. That might allow tombstones to be dropped earlier. An even more expensive check would be to use more CPU and possibly IO to confirm whether the level that overlaps with the tombstone really has that key. This can make compaction much slower.

Fast path

The SingleDelete API in RocksDB makes it easier to drop tombstones. If you respect what the API requires then tombstones can be dropped quickly -- without spending IO or CPU. SingleDelete makes it easy to drop tombstones for a given key when those tombstones meet during compaction. They don't do anything for the case above where the tombstone is on one level and the live key might be on a lower level.

MyRocks magic

MyRocks has some clever code that does extra work to remove tombstones from an SST when the SST has too many tombstones. Configuration options for this are mentioned in this post. Percona has some docs on rocksdb_compaction_sequential_deletes. And I am sure there is a slide deck from my favorite MyRocks expert. Maybe he will share that with me.


  1. Do too many tombstones ever get in the way of a point query?

    It seems like tombstones could be handled efficiently if every tombstone contained the next non-deleted value in the same LSM level (by "non-deleted", I mean "not deleted by any tombstone in this level".) Then as you do the merge of different levels in the range query, you can efficiently skip contiguous runs of tombstones. If there actually is a key in a lower level that falls within the the run of tombstones, you have to do a point query into that level to find out if there is a relevant tombstone, but I suspect you can amortize that point query against something else.

    1. That might help. Long runs of tombstones in one LSM have been a problem.

  2. For partition-level tombstones, something like Cassandra's "garbagecollect" can help, at the expense of additional i/o and cpu, of course. It iterates through a table's sstables (using the compaction code path) and clears shadowed data from sstables. (It also drops tombstones that shadow entire partitions while it's at it, iirc.)

    Clearing tombstones that shadow rows or cells is harder, since you have to not only check whether the partition exists in other sstables, but also read the sstables to see whether the row or cell has data in them.

    Of course, this is meant for cases where there are "too many" tombstones in a table, not for continuous background use. (Or to free disk space when there are very many updates.)

    1. Thanks for the update. I agree that most users need something running all of the time to preserve QoS on reads.