The open issues for index+log include:
- Is GC multi-threaded?
- Does value log GC require index queries?
- Must the index be cached?
- Is block compression possible?
- With an LSM as the index how is relocating an entry in the value log supported?
Is GC multi-threaded?
- I hope so.
- It hasn't been in some of the index+log papers/systems.
Does value log GC require index queries?
- Yes, in most of the index+log papers/systems this is required to determine whether a value is live when scanning the value log during GC
- HashKV proposed a solution that doesn't require such queries. It is a wonderful paper but I think there are more things to figure out. The general idea is to hash KV pairs into buckets such that all keys for a bucket will fit in memory during GC. GC then reads the value log segments for a bucket twice -- first to get the keys, second to copy the live KV pairs into a new log segment. I wonder if managing the hash bucket ranges has things in common with linear and extendible hashing.
Must the index be cached?
- The index for index+log might be 10X larger than for an LSM because an LSM uses a block index while index+log uses a row index. A block index has a key+pointer per block and a row index has that per row. An LSM only needs a block index because rows are clustered by key.
- To do at most one IO per point query for a database larger than memory the index must be cached for index+log to avoid index IO as the IO is spent reading the value log. If the <= 1 IO / query constraint is then cache-amp is larger for index+log compared to LSM because the index+log index is larger (see #1).
- If value log GC queries the index to determine whether each value is live then this query shouldn't do IO or GC will be too slow and inefficient. This is another reason for caching the index+log index.
- If the index must be cached then maybe an LSM isn't the best choice. Consider an index structure optimized for a main-memory DBMS.
Is block compression possible?
- I hope so but this hasn't been explained in some of the index+log papers/systems.
- Per-record compression can be done instead of block compression. That will have a lower compression rate but less CPU overhead when decompressing on read.
- It might be hard to do block compression the first time a KV pair is written to a value log. One option is to write to a redo log until enough data is available to do block compression and then write to the value log. Another option is to defer block compression until KV pairs are rewritten during GC.
When the index is an LSM what happens when values are moved?
- This is an issue that I learned about via CockroachDB. I should have figured it out long ago but mistakes happen.
- The LSM tree is read in order -- top to bottom with leveled compaction and left to right with tiered compaction. This guarantees that the correct result is returned with respect to visibility. If the first entry for a key is a tombstone then the search can stop (ignoring snapshot reads).
- Value log GC moves live KV pairs to new log segments. To find the KV pair after the move either the index entry must be updated or the index entry must have a logical value log key and then another index is needed to map the logical value log key to a physical value log (filename, offset).
- Updating the LSM index entry to reference the new value log location (filename, offset) can be done by inserting a new KV pair into the LSM but that either breaks read consistency semantics or complicates read processing. It would break LSM read processing because inserting back into the LSM implies this is a new write, but it just a move for GC. Something other than an LSM that supports update-in-place makes this easier.
- Details on using a logical value log key are explained in a CockroachDB github issue.