Thursday, April 5, 2018

Index Structures, Access Methods, whatever

I prefer to use Index Structures when writing about algorithms for indexing data stored on disk and SSD but Access Methods is another popular name. Recently I have been working on a high-level comparison (lots of hand waving) of index structures in terms of read, write, space and cache amplification.

I started by dividing the index structures into tree-based and hash-based. Tree-based support range queries while hash-based do not. Most of my experience is with tree-based approaches. There was a lot of research on hash-based in the 80s (extendible, dynamic and linear hashing), they are in production today but not as prominent as b-trees, and there is recent research on them. This paper by Seltzer and Yigit is a great overview on hash-based index structures.

The next classification is by algorithm - page-based, index+log and LSM. I use this for both tree-based and hash-based so LSM means Log Structured Merge rather than Log Structured Merge Tree. Most DBMS implement one or two of these. Tarantool has more algorithmic diversity, but I don't have much experience with it.

For tree-based:
  • page-based - I used to call this update-in-place but think that page-based is a better name because it includes update-in-place (UiP) and copy-on-write-random (CoW-R) b-trees. Great implementations include InnoDB for UiP and WiredTiger for CoW-R.
  • LSM - leveled and tiered compaction are examples (RocksDB, LevelDB, Cassandra). Note that the original LSM design by O'Neil et al didn't have an L0 and I suspect that the L0 might be a mistake but that is for another post. I don't want to insult LevelDB authors, the L0 makes sense when you want to limit memory consumption for database embedded in client apps, I am not sure it makes sense for serious OLTP with RocksDB. O'Neil also did fundamental work on bitmap indexes. I worked on both so he made my career possible (thank you).
  • index+log - use a tree-based index with data in log segments. The index points into the log segments. GC is required. It scans the log segments and probes the index to determine whether a record is live or can be deleted. There are fewer examples of this approach but interesting systems include WiscKey and ForestDB. Range queries will suffer unless the index is covering (this needs more discussion, but not here).

For hash-based:
  • page-based - see dynamic, extendible and linear hashing. TIL that ZFS implements extendible hashing. BerkeleyDB supports linear hashing. The dbm implementation on many Linux/Unix systems also implements some variant of update-in-place persistent hash tables.
  • LSM - today I only know of one example of this - SILT. That is a great paper (read it). I include it as an example of hash-based even though one of the levels is ordered.
  • index+log - BitCask is one example but the index wasn't durable and it took a long time (scan all log segments) to rebuild it on process start. Faster might be another example, but I am still reading the paper. I hope someone names a future system Efficienter or Greener.

Finally, I have a few comments on performance. I will be brief, maybe there will be another post with a lot more detail on my opinions:
  • b-tree - provides better read efficiency at the cost of write efficiency. The worst case for write efficiency is writing back a page for every modified row. Cache efficiency is better for a clustered index than a non-clustered index -- for a clustered index you should cache at least one key/pointer per leaf block but for a non-clustered index you need the entire index in RAM or there will be extra storage reads. Space efficiency for a b-tree is good (not great, not bad) -- the problem is fragmentation.
  • LSM - provides better write efficiency at the cost of read efficiency. Leveled compaction provides amazing space efficiency. Tiered compaction gets better write efficiency at the cost of space efficiency. Compaction does many key comparisons and this should be considered as part of the CPU overhead for an insert (maybe 4X more comparisons/insert than a b-tree).
  • index+log - provides better write efficiency. Depending on the choice of index structure this doesn't sacrifice read efficiency like an LSM. But the entire index must remain in RAM (just like a non-clustered b-tree) or GC will fall behind and/or do too many storage reads. GC does many index probes and the cost of this is greatly reduced by using a hash-based solution. These comparisons should be considered as part of the CPU overhead of an insert. There is also a write vs space efficiency tradeoff. By increasing the amount of dead data that can be tolerated in log segments then GC is less frequent, write-amplification is improved but space-amplification suffers. There are variants that don't need GC, but they are not general purpose.


8 comments:

  1. I'm not 100% sure but I believe Amazon Aurora is also based on index+log.
    I've referred to this talk - https://www.youtube.com/watch?v=-TbRxwcux3c

    ReplyDelete
    Replies
    1. This is a paper that describes the Aurora for MySQL implementation - https://www.allthingsdistributed.com/files/p1041-verbitski.pdf

      I think it is amazing but I am not sure where to put it. AFAIK InnoDB continues to think it is doing update-in-place, but underneath it is log structured. Maybe it is similar to running InnoDB on top of a log structured file system. Those aren't popular in open source space but NetApp has many customers with WAFL.

      Unfortunately, this means you pay for write amp from a b-tree doing update-in-place where the worst case write amp is size(page) / size(row). Then you pay again from doing GC on the log segments which can increase write amp by a factor of 2 or 5 depending on how much garbage you tolerate in the log segments.

      But that cost, in write amp or space amp, might not be a big deal to Amazon. That cost can be passed on to the user and the user will be happy because they are paying for durability and availability, not for the best write and space efficiency.

      Delete
    2. > AFAIK InnoDB continues to think it is doing update-in-place, but underneath it is log structured

      My understanding is that underneath it's still page-based (i.e. updating a record in the index updates the whole page). So the storage write amp is the same as B-tree (e.g. if 100 records are updated that reside on different pages, 100 pages will get updated in the storage).

      The benefit that Aurora provides is that write amp happens locally on each storage node, which avoids inter-node traffic amp. So when a write happens, only log records are sent to storage nodes, so for 100 records there are 100 records that are sent to each storage node (the replication factor is 6, so there is necessary 6x amp due to replication), then each storage node will do page recovery (i.e. apply records to the corresponding pages) which could potentially update up to 100 pages in the storage.

      If the replication happened at the file system level, then the B-tree write amp would translate to inter-node traffic amp (i.e. for 100 record updates up to 100 pages could be replicated 6 ways), which Aurora avoids.

      It's not entirely clear to me how much the storage nodes are actually aware of the index structure, it could be that the semantics of the B-tree index is actually maintained in InnoDB, which would just tell the storage nodes which page it wants to apply the record to (e.g. update record xyz on page 1234 at index 42). Given how they shard volumes across storage nodes, I suspect that the storage nodes just operate on physical pages.

      In any case, it seems like Aurora is somewhat orthogonal to your classification -- Aurora is a way to implement a replicated B-tree index structure, so the underlying properties for the storage read/write amp are the ones of the B-tree. If we consider replicated index structures, we'd get an additional layer of trade-offs with respect to inter-node traffic amp. Running a vanilla InnoDB on top of replicated file system would translate B-tree write amp into inter-node traffic amp. Aurora avoids that by making storage node do physical page recovery, so B-tree write amp happens locally. This still causes inter-node traffic read amp for primary key lookups and writes: InnoDB would need to read B-tree pages to locate the record position in the B-tree. The next level of inter-node traffic optimization would be to make storage nodes do logical record recovery (i.e. InnoDB would talk to storage nodes in terms of logical records rather than physical pages), then InnoDB could do blind writes without any reads (even though each storage node would locally do read-modify-write), but the downside is that the index is harder to shard across storage nodes (logical index can become lopsided and require rebalancing, while physical pages can be easily hashed by pgno) and InnoDB won't be able to use page-based caching, as the storage would provide record-oriented interface. Etc.

      Delete
    3. The network efficiency gains from Aurora are interesting but I don't want that to distract from the basics. One way to explain this is a b-tree using a log structured file system. I previously wrote that logically it is a b-tree and physically the storage is log structured. Neither description is perfect, they are mean to be brief.

      A b-tree on log-structured FS has worse space-amp and worse write-amp than a b-tree running on an update-in-place file system. The b-tree has space-amp from fragmentation on leaf pages. The log-structured FS adds more space-amp because there is space used by dead pages. The b-tree worst-case write-amp occurs when each row change results in write back for a page. The log structured FS adds write-amp from doing GC -- copy live pages out of old segments into new segments. Aurora + b-tree will be worse than b-tree but that doesn't matter as the customers will pay for it, and it doesn't matter because Aurora is a big deal for many other reasons.

      But what I wrote in the previous paragraph also applies to anything running with an SSD -- which is another form of log structured storage. Space-amp comes in two forms there -- space wasted by dead data in uncollected segments, and space not used when the file system is formatted to help with SSD endurance. Write-amp comes from flash GC copying live pages into new blocks.

      Delete
  2. Why is log structured based DB faster in terms of update with respect to update-in-place.
    Log structured based DB also needs to do a select first to find the pointers of the data to be updated and point to the new data, hence time taken for select is not reduced in log structured based DB.
    So what exactly makes it faster for updates?

    ReplyDelete
    Replies
    1. You are right that I ignore the cost of finding the matching data when briefly explaining the cost of an update above. In my less brief description, which has yet to be shared with the public, I include that as a disclaimer. Here I am only describing the cost of performing the write.

      And it can get even more complicated because with a write-optimized solution like LSM or log-structured you can do a blind write where that search that you mention is not done. That isn't possible with update-in-place.

      The write-optimized approaches, LSM and log-structured, frequently benefit from doing write-back faster (less write-amplification means less is written back per logical change). So the write-back is faster, and uses less IO-capacity, which saves more IO-capacity for useful things like reading data for user's queries.

      Delete
    2. I am new to LSM, could you briefly explain how LSM avoid the cost of finding data while updating?

      Delete

    3. First must determine whether we are discussing read-modify-write, blind-write or commutative write. For more on types of writes see http://smalldatum.blogspot.com/2014/04/types-of-writes.html

      Blind write just means to overwrite the value for a key. So you don't care about the old value. With an LSM that is easy to do -- first write to redo log, then to memtable (write buffer). Eventually the memtable is full and flushed to make a new L0 or L1 file. Eventually compaction is done to move data down the tree. Compaction reads from and writes to the LSM tree using large IO so we usually ignore the cost of it -- it isn't free but these reads are amortized over many KV pairs so it is much smaller than the per-page reads done for update-in-place.

      With update-in-place the page to be written back must have been read into memory to be modified. So even were you to do a blind-write it isn't read-free.

      For more on LSM write workflow see Google results for "LSM compaction"
      https://www.google.com/search?q=lsm+compaction

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...