Comments on Small Datum: Throttling writes: LSM vs B-Tree

It seems like the write throttling in RocksDB is k...

2019-11-30T17:23:53.803-08:00

It seems like the write throttling in RocksDB is kind of ad-hoc. Here is what I would consider a principled way to do write throttling:

Suppose that the write amplification is W. (This covers the general case: for example for an LSM that doubles the size of each run and does merging of equal-sized runs to make larger runs, and whenever there are two runs of the same size, they get merged, then W is log n. For the LSM that has the size factor go up by a factor of 10 and you merge each time, then W is 10 log_{10} N. We parameterize by W.)

Now every time the application writes K bytes into the LSM, we do KW worth of writes of rebalancing. It turns out for all these LSM variants you can find a schedule that does it. For the simple power-of-two LSM, you simply merge K bytes in each level.

You might want to support a higher burst rate, feeling that paying log N on every write is too much. I'm not sure how much value there really is to supporting a higher burst rate. Anyway, to support a higher burst rate, you pick a size S that you are willing to get behind. For example, you might set S to 1GB, and you are allowed to write 1GB without doing all the work.

Now S gives you a straightforward tuning parameter: Bigger S means you need to allocate more storage and tolerate longer bursts.

I like it, but it is also like the merge operator ...

2019-11-30T09:42:51.112-08:00

I like it, but it is also like the merge operator in RocksDB. A 2-level LSM is a great fit when data:RAM ratios won't be too big (maybe <= 5:1). Sophia might have been a 2 level LSM, and I know of a successful in-house implementation.

"Strong throttling" is too vague. This p...

2019-11-30T09:36:52.497-08:00

"Strong throttling" is too vague. This post explains it:
https://github.com/facebook/rocksdb/wiki/Write-Stalls

I sometimes explain Innodb change buffer as "...

2019-11-27T00:09:15.292-08:00

I sometimes explain Innodb change buffer as "it's kind of like a 2 level LSM". How right or wrong do you think such a characterisation is?

What is strong throttling?

2019-11-26T11:16:00.352-08:00

What is strong throttling?

RocksDB has strong throttling enabled by default. ...

2019-11-26T08:30:00.799-08:00

RocksDB has strong throttling enabled by default. That creates a few bad experiences for some users who encounter write stalls. Other users quietly benefit, perhaps unaware, by getting less variance on reads. I assume that RocksDB can do better and make throttling more smooth. If nothing else, this would lead to a few interesting research papers.

AFAIK the SQLite4 LSM has a solution for this. SQLite doesn't allow background threads so the foreground threads help with compaction. But I need to revisit their design docs.

How much of write variance in LSM's is due to ...

2019-11-26T07:37:48.042-08:00

How much of write variance in LSM's is due to overcommitted throughput? For every insert into an LSM, one must perform about log N inserts at higher levels of the LSM, where N is the size of the data set. I have a feeling (unsupported by any evidence) that many LSM implementations simply blast as much as they can into L0, and don't worry about the write debt that they have accumulated. How much of the variance would be solved if for every insert one really did those log N operations at the other layers of the LSM: that is pay the debt as you go?

That still leaves GC stalls and TRIM stalls.