Comments on Small Datum: Benchmarking the leveldb family

Glad you found the whitepaper interesting. I still...

2014-07-10T22:08:32.933-07:00

Glad you found the whitepaper interesting. I still lament the fact that it needed to be trimmed down so much; practically none of the assertions in the paper have any backing details included. (But I guess if you're reading it next to a web browser, you can get some of it through the linked references. And probably no one has the patience to read anything longer. But also if you were an attendee at the LDAPCon where it was debuted, you were probably already intimately familiar with the problems being addressed.)

The LDAP use case dictated a lot of our testing direction too - e.g., most LDAP entries are 1K-4K in size; benchmarks with 100 byte values really don't tell us anything. (This is why our HyperDex benchmark initially used 4K records - we wanted to evaluate HyperDex for suitability as an OpenLDAP backend.)

Yes, OpenLDAP is the reason LMDB exists and the co...

2014-07-10T21:31:30.196-07:00

Yes, OpenLDAP is the reason LMDB exists and the consumer of the majority of "special features" in LMDB. (The exception being nested transactions, which we added for SQLite/MySQL/etc.) In that context, all that matters is that we're superior to BerkeleyDB.

Interesting link, it appears to spell out the tradeoffs of the two compaction approaches pretty well. Sounds like your LinkBench-style workload does few/no deletes.

I am all for application specific benchmarking, bu...

2014-07-10T13:11:37.219-07:00

I am all for application specific benchmarking, but I like to start with micro-benchmarks. I assume OpenLDAP also serves as such an application specific benchmark on which LMDB is a great fit. Your whitepaper on that was very interesting. LinkBench has been the benchmark that I need to care about.

I have workloads for which leveled compaction as done by LevelDB/RocksDB does much worse than an update-in-place b-tree on write-amp. Switching to size-tiered, which is a RocksDB option, gets me write-amp that is much less than the update-in-place b-tree.
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra

re: merge_stats - ok, missed that before. re: wri...

2014-07-10T12:30:27.364-07:00

re: merge_stats - ok, missed that before.

re: write-amp - the difference is that in those other two cases it's a one-time cost for a given write. With an LSM you pay the cost over and over again as you flush from one level to the next and when you do compactions.

re: response variance - ok, no argument there. LevelDB response time is clearly a wildcard.

re: zero-copy - fair enough. In that case it should be sufficient to just reference the first byte of a returned value. (Or every PAGESIZE'th byte, if the values are larger.) If you force a walk over all of the value bytes then you negate the zero-copy benefit and mislead in the other direction. Either way, it doesn't tell you what performance a real application will get, and you still need higher level benchmarking.

Stats aren't merged into thread 0 for RocksDB,...

2014-07-10T07:36:12.215-07:00

Stats aren't merged into thread 0 for RocksDB, so I don't think it has the bug. They are merged into thread 0 in your version of db_bench and in upstream.
From https://github.com/facebook/rocksdb/blob/master/db/db_bench.cc
// Stats for some threads can be excluded.
Stats merge_stats;
for (int i = 0; i < n; i++) {
merge_stats.Merge(arg[i].thread->stats);
}

Write-amp is a fact of life for db engines, not just ones that use an LSM. When an update-in-place tree writes a 4k page back to disk because a 100 byte row is dirty, then write-amp is ~40X. Forcing a redo log to disk after appending a 100 byte record writes back at least one disk sector of data (512 bytes today, 4k tomorrow), and that is more write-amp. We pay to get endurance with flash, so write-amp matters for some workloads.

For #6 the placement of FinishedSingleOp is only an issue when there is more than 1 put per write batch. This doesn't hurt averages in any case. It does hurt response time histograms and percentiles. Response variance is important for some benchmarks and LevelDB is prone to it on write-heavy workloads.

For #8 computing a sum of the returned value bytes isn't realistic client processing, it is done to confirm that the db engine really read the value bytes. An mmap-based engine can avoid any read (disk or memory) for the value when the client doesn't access it. If mmap is used and the index isn't clustered so the values are in a heap organized file separate from the index, the engine stores a pointer to the value in the index. Assuming the Get call returns that pointer without copying (zero-copy is good), then we have an optimization or a way to get misleading results because the disk pages for values never get faulted in.

and re: (1) those stats are used to generate the t...

2014-07-10T07:35:29.128-07:00

and re: (1) those stats are used to generate the throughput graphs in e.g. http://symas.com/mdb/inmem/ and http://symas.com/mdb/inmem/large.html

By the way, the RocksDB readwhilewriting stats are...

2014-07-10T01:50:58.135-07:00

By the way, the RocksDB readwhilewriting stats aren't reporting what the authors think it is. They set "exclude_from_stats" for the writer thread, to supposedly report only the reader statistics. But in fact, the final stats are generated by merging all of the threads' stats into thread 0's stats, and guess what, the writer is thread 0. I fixed this a while ago too: https://github.com/hyc/leveldb/commit/355b9cbdfaf2939cbc5178963895d59d42f47bf1

10) I disable compression in my tests, for similar...

2014-07-10T01:47:08.708-07:00

10) I disable compression in my tests, for similar reasons of avoiding 3rd party malloc libraries. Most projects now support multiple compression libraries, but if they don't all support the same ones then you wind up benchmarking compressors, not DB engines. Also you can't reasonably benchmark a compressor without domain-relevant input data. "50% compressible random data" that db_bench claims to provide is really just a guess, and that 50% target is also compressor-sensitive. For comparative benchmarking you have to control and minimize variables.

12) yeah, this was annoying because the original db_bench code doesn't give any indication that it wrote fewer keys than you requested. I noted this a while ago https://groups.google.com/d/msg/leveldb/9Ol2Gi4Yv6I/ZI5tH1ThfoUJ and the WiredTiger folks confirmed it here https://github.com/wiredtiger/wiredtiger/wiki/LevelDB-Benchmark

In combination with the RNG seed issue (2) you can get some ridiculously optimistic results. I provided a simple fix in my version (using Shuffle) and also added du output after each test so that we can tell when something fishy like this has happened. (I.e., if the disk use after a test is much smaller than the number of records would suggest, you probably got bitten by this.)

6) seems inconsequential since the FinishedSingleO...

2014-07-10T01:46:40.000-07:00

6) seems inconsequential since the FinishedSingleOp placement doesn't affect the final calculation of # Ops / Total Time.

7) choice of malloc library is certainly important. I've studied this extensively. http://highlandsun.com/hyc/malloc/ But for comparative benchmarks, this needs to be kept uniform. When we test DB engines we want to know what the engine itself can do. We don't want numbers obscured by the influence of different 3rd party components, especially when projects have conflicting recommendations - tcmalloc, jemalloc, etc... But at least in this case, we can go back again using LD_PRELOAD and test all of the engines with the malloc library of our choice. It's just that there's usually already plenty to test, and adding any variable increases testing time exponentially.

8) client-side processing of data could be more "realistic", but then it becomes a benchmark of your calling application, not of the DB engine. For a microbenchmark you really want to test the DB engine in isolation, even if the results won't be directly applicable to the real world. This just means that microbenchmarking can't be your only activity, you also need to test in the full context of a real application.

9) readahead - yes, in DB-larger-than-RAM workloads I've found that readahead just needs to be turned off completely.

Some pretty good points there, triggered several r...

2014-07-10T01:46:07.254-07:00

Some pretty good points there, triggered several responses from me:

LevelDB is not small or clean code, by any definition. It is demonstrably complex and bug-ridden, causing no end of teeth-gnashing for its users. https://github.com/bitcoin/bitcoin/issues/2770 etc...

RocksDB is interesting but the complexity takes it far out of the realm of what's expected in a "lightweight embedded database". Config complexity is one of the serious downsides to BerkeleyDB as well; this path is well-trodden and leads to disappointment.

Editorial #2, or "all about write amplification" - I've not been shy about my disdain for LevelDB and LSM designs in general. The world is waking up to the fact that write amplification is a fact of life for LSMs and this is hugely detrimental for the direction storage is going. https://www.usenix.org/conference/hotstorage14/workshop-program/presentation/marmol

Notes on db_bench

1) Good stuff. I've merged the statistics reporting from RocksDB's version into all of the other drivers that I've written.
https://github.com/hyc/leveldb/commit/c803f54d5cd7c3d394e9470161e34799e463f60e

2) RNG seed - yeah, what's worse is that each individual test restarts with the same seed. So even if you do multiple tests in the same run, you're not getting new numbers. I fixed this latter aspect recently as well. https://github.com/hyc/leveldb/commit/b7f0db701653cb23d2e1b4935e4b4f8b6f92169e

3) I should probably merge the RocksDB 64-bit RNG soon...