Monday, August 27, 2018

Review of "Concurrent Log-Structured Memory" from VLDB 2018.

Space-amplification matters for in-memory stores too.

This is a review of Concurrent Log-Structured Memory for Many-Core Key-Value Stores from VLDB 2018 and the engine is named Nibble. The paper is worth reading - the ideas are interesting and the performance results are thorough. Nibble is an example of index+log. Their focus is on huge many-core servers. I wonder how RocksDB would do a a server with 240 cores and many TB of DRAM. I assume there might be a few interesting performance problems to make better. Start with table 2 for a condensed summary. The design overview is:

  • partitioned, resizable hash index - the hash index uses open addressing and linear probing. Each bucket has 15 entries, 64-bits/entry and a version counter. The counter is incremented twice/update -- at start and end. Readers retry if counter is changed or unchanged but odd.
  • per-socket log-structured memory allocators with per-core log heads. There is a log instance per core and each log instance has multiple heads (locations where log inserts are done). They call this write local, read global because reads might be from a log instance written by another core. A cost-based approach is used to select the next log segment for GC.
  • thread-local epochs - after GC there might still be threads reading from a log segment. Epochs are used to determine when that is safe. The CPU time stamp counter is used for the epoch value and each thread writes its epoch value, using a cache line per thread, to a fixed memory location.
Things that are interesting to me:
  • When needed, a partition of the hash index is doubled in size. Rather than allocate 2X more memory, they use the VM to extend the current allocation so only ~1/2 of the objects in the partition must be relocated. I am curious whether there are stalls during resize.
  • What is the range of load factors that the Nibble hash index can sustain? A b-tree with leaf pages 2/3 full is one source of fragmentation, a hash index with a load factor less than 100% is another form of fragmentation. Fortunately, with a load factor of 80%, 20% of memory isn't wasted because the log segments should use much more memory than the hash index. Figure 14 shows throughput as a function of memory utilization.
  • index+log has a large CPU cost during GC from tree index probes, but Nibble uses a hash index which has a lower probe cost. Nibble maintains a live bytes counter per log segment. Segments with zero live bytes can be collected without probing the index to find live entries. Otherwise an index probe per entry is required to determine whether an entry is live.
  • I am curious about the results in Figures 1 and 9b on memory fragmentation per allocator. The results for jemalloc and tcmalloc are similar while ptmalloc2 does the best.  The microbenchmarks are from the Rumble paper. I don't think much can be concluded from such simple allocator workloads -- but I still like the paper. RocksDB or LevelDB with db_bench would be a better test for fragmentation. It would also be good to know which versions of the allocators (jemalloc, tcmalloc, etc) were used and whether any tuning was done.
I have published posts on the benefits from using jemalloc or tcmalloc compared to glibc malloc for RocksDB. A RocksDB/MyRocks instance has ~2X larger RSS with glibc malloc because jemalloc and tcmalloc are better at avoiding fragmentation. See posts one, two, three, four and five. RocksDB does an allocation per block read and puts a lot of stress on the allocator especially when using a fast storage device. An allocation remains cached until it reaches the LRU end in the block cache or the block's SST gets unlinked. I expect blocks in the block cache to have vastly different lifetimes.

2 comments:

  1. Should the following:

    "The counter is incremented twice/update -- at start and end. Readers retry if counter is changed or unchanged but odd."

    be changed to:

    "The counter is incremented twice/update -- at start and end. Readers retry if counter is unchanged or changed but odd."

    ReplyDelete
    Replies
    1. I paused when writing that and considered whether it was correct or confusing. Maybe I didn't pause enough. You are best off reading the paper. It is worth reading.

      Delete

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL , a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions f...