In an LSM deletes are fast for the deleter but can make queries that follow slower. The problem is that too many tombstones can get in the way of a query especially a range query.
A tombstone must remain in the LSM tree until any keys it deletes have been removed from the LSM. For example, if there is a tombstone for key "ABC" on level 1 of the LSM tree and a live value for that key on level 3 then that tombstone cannot be removed. It is hard to make the check (does live key exist below me) efficient.
I haven't read much about optimizing for tombstones in an LSM not named RocksDB. Perhaps I have not tried hard to find such details. Maybe this is something that LSM engine developers should explain more in public.
Confirming whether a tombstone can be dropped
This is based on code that I read ~2 years ago. Maybe RocksDB has changed today.
Tombstones are dropped during compaction. The question is how much work (CPU and IO) you are willing to spend to determine whether a tombstone can be dropped. LevelDB and RocksDB favor performance over exactness. By this I mean they spend less CPU and no IO on the "can I drop this tombstone" check. The check is simple today. If there is an LSM level (or sorted run) below (older) that has a key range which overlaps the tombstone key then the tombstone won't be dropped. And in most workloads this means that tombstones aren't dropped until they reach the max LSM level -- because there usually is overlap.
An LSM could spend a bit more CPU and check bloom filters for the levels that overlap with the tombstone. That might allow tombstones to be dropped earlier. An even more expensive check would be to use more CPU and possibly IO to confirm whether the level that overlaps with the tombstone really has that key. This can make compaction much slower.
The SingleDelete API in RocksDB makes it easier to drop tombstones. If you respect what the API requires then tombstones can be dropped quickly -- without spending IO or CPU. SingleDelete makes it easy to drop tombstones for a given key when those tombstones meet during compaction. They don't do anything for the case above where the tombstone is on one level and the live key might be on a lower level.
MyRocks has some clever code that does extra work to remove tombstones from an SST when the SST has too many tombstones. Configuration options for this are mentioned in this post. Percona has some docs on rocksdb_compaction_sequential_deletes. And I am sure there is a slide deck from my favorite MyRocks expert. Maybe he will share that with me.