Comments on Small Datum: Index Structures, Access Methods, whatever

The network efficiency gains from Aurora are inter...

2018-04-15T08:48:30.225-07:00

The network efficiency gains from Aurora are interesting but I don't want that to distract from the basics. One way to explain this is a b-tree using a log structured file system. I previously wrote that logically it is a b-tree and physically the storage is log structured. Neither description is perfect, they are mean to be brief.

A b-tree on log-structured FS has worse space-amp and worse write-amp than a b-tree running on an update-in-place file system. The b-tree has space-amp from fragmentation on leaf pages. The log-structured FS adds more space-amp because there is space used by dead pages. The b-tree worst-case write-amp occurs when each row change results in write back for a page. The log structured FS adds write-amp from doing GC -- copy live pages out of old segments into new segments. Aurora + b-tree will be worse than b-tree but that doesn't matter as the customers will pay for it, and it doesn't matter because Aurora is a big deal for many other reasons.

But what I wrote in the previous paragraph also applies to anything running with an SSD -- which is another form of log structured storage. Space-amp comes in two forms there -- space wasted by dead data in uncollected segments, and space not used when the file system is formatted to help with SSD endurance. Write-amp comes from flash GC copying live pages into new blocks.

> AFAIK InnoDB continues to think it is doing u...

2018-04-14T23:02:26.359-07:00

> AFAIK InnoDB continues to think it is doing update-in-place, but underneath it is log structured

My understanding is that underneath it's still page-based (i.e. updating a record in the index updates the whole page). So the storage write amp is the same as B-tree (e.g. if 100 records are updated that reside on different pages, 100 pages will get updated in the storage).

The benefit that Aurora provides is that write amp happens locally on each storage node, which avoids inter-node traffic amp. So when a write happens, only log records are sent to storage nodes, so for 100 records there are 100 records that are sent to each storage node (the replication factor is 6, so there is necessary 6x amp due to replication), then each storage node will do page recovery (i.e. apply records to the corresponding pages) which could potentially update up to 100 pages in the storage.

If the replication happened at the file system level, then the B-tree write amp would translate to inter-node traffic amp (i.e. for 100 record updates up to 100 pages could be replicated 6 ways), which Aurora avoids.

It's not entirely clear to me how much the storage nodes are actually aware of the index structure, it could be that the semantics of the B-tree index is actually maintained in InnoDB, which would just tell the storage nodes which page it wants to apply the record to (e.g. update record xyz on page 1234 at index 42). Given how they shard volumes across storage nodes, I suspect that the storage nodes just operate on physical pages.

In any case, it seems like Aurora is somewhat orthogonal to your classification -- Aurora is a way to implement a replicated B-tree index structure, so the underlying properties for the storage read/write amp are the ones of the B-tree. If we consider replicated index structures, we'd get an additional layer of trade-offs with respect to inter-node traffic amp. Running a vanilla InnoDB on top of replicated file system would translate B-tree write amp into inter-node traffic amp. Aurora avoids that by making storage node do physical page recovery, so B-tree write amp happens locally. This still causes inter-node traffic read amp for primary key lookups and writes: InnoDB would need to read B-tree pages to locate the record position in the B-tree. The next level of inter-node traffic optimization would be to make storage nodes do logical record recovery (i.e. InnoDB would talk to storage nodes in terms of logical records rather than physical pages), then InnoDB could do blind writes without any reads (even though each storage node would locally do read-modify-write), but the downside is that the index is harder to shard across storage nodes (logical index can become lopsided and require rebalancing, while physical pages can be easily hashed by pgno) and InnoDB won't be able to use page-based caching, as the storage would provide record-oriented interface. Etc.

First must determine whether we are discussing rea...

2018-04-08T14:29:23.208-07:00

First must determine whether we are discussing read-modify-write, blind-write or commutative write. For more on types of writes see http://smalldatum.blogspot.com/2014/04/types-of-writes.html

Blind write just means to overwrite the value for a key. So you don't care about the old value. With an LSM that is easy to do -- first write to redo log, then to memtable (write buffer). Eventually the memtable is full and flushed to make a new L0 or L1 file. Eventually compaction is done to move data down the tree. Compaction reads from and writes to the LSM tree using large IO so we usually ignore the cost of it -- it isn't free but these reads are amortized over many KV pairs so it is much smaller than the per-page reads done for update-in-place.

With update-in-place the page to be written back must have been read into memory to be modified. So even were you to do a blind-write it isn't read-free.

For more on LSM write workflow see Google results for "LSM compaction"
https://www.google.com/search?q=lsm+compaction

I am new to LSM, could you briefly explain how LSM...

2018-04-08T13:16:42.452-07:00

I am new to LSM, could you briefly explain how LSM avoid the cost of finding data while updating?

You are right that I ignore the cost of finding th...

2018-04-08T12:42:18.118-07:00

You are right that I ignore the cost of finding the matching data when briefly explaining the cost of an update above. In my less brief description, which has yet to be shared with the public, I include that as a disclaimer. Here I am only describing the cost of performing the write.

And it can get even more complicated because with a write-optimized solution like LSM or log-structured you can do a blind write where that search that you mention is not done. That isn't possible with update-in-place.

The write-optimized approaches, LSM and log-structured, frequently benefit from doing write-back faster (less write-amplification means less is written back per logical change). So the write-back is faster, and uses less IO-capacity, which saves more IO-capacity for useful things like reading data for user's queries.

Why is log structured based DB faster in terms of ...

2018-04-08T12:33:39.339-07:00

Why is log structured based DB faster in terms of update with respect to update-in-place.
Log structured based DB also needs to do a select first to find the pointers of the data to be updated and point to the new data, hence time taken for select is not reduced in log structured based DB.
So what exactly makes it faster for updates?

This is a paper that describes the Aurora for MySQ...

2018-04-05T23:12:10.234-07:00

This is a paper that describes the Aurora for MySQL implementation - https://www.allthingsdistributed.com/files/p1041-verbitski.pdf

I think it is amazing but I am not sure where to put it. AFAIK InnoDB continues to think it is doing update-in-place, but underneath it is log structured. Maybe it is similar to running InnoDB on top of a log structured file system. Those aren't popular in open source space but NetApp has many customers with WAFL.

Unfortunately, this means you pay for write amp from a b-tree doing update-in-place where the worst case write amp is size(page) / size(row). Then you pay again from doing GC on the log segments which can increase write amp by a factor of 2 or 5 depending on how much garbage you tolerate in the log segments.

But that cost, in write amp or space amp, might not be a big deal to Amazon. That cost can be passed on to the user and the user will be happy because they are paying for durability and availability, not for the best write and space efficiency.

I'm not 100% sure but I believe Amazon Aurora ...

2018-04-05T22:40:37.428-07:00

I'm not 100% sure but I believe Amazon Aurora is also based on index+log.
I've referred to this talk - https://www.youtube.com/watch?v=-TbRxwcux3c