Wednesday, September 28, 2022

Storage engines, efficiency and large documents, rows, objects

Storage engines for OLTP focus on storing small objects and by small object I mean sizeof(object) << sizeof(page). They can store larger objects but the engines are not optimized for that and both efficiency and performance will suffer. By larger objects I mean sizeof(object) >= sizeof(page). 

Given the growth of the document model we need some OLTP storage engines to be efficient and performant for larger objects. What might that mean? The rest of this post is speculation by me. I am far from an expert on this topic and will try to be careful. The topic would make for an interesting conference paper.

This post is more truthy then true. It has speculation, unanswered questions and likely has several mistakes or things I omitted because I was not aware of them.

I will focus on four index structures that are popular in my world:

  • Clustered b-tree (InnoDB)
  • Heap-organized b-tree (Postgres)
  • Cow-R b-tree (WiredTiger)
  • LSM (MyRocks)
At a high-level I want to understand where the storage engine consumes CPU & IO while reading and writing large objects. The writes can be small updates (changing a few fields in an object) or big updates/inserts (changing all or most of an object).

The context that I care about is a workload where the working set isn't in memory (because RAM is expensive and fast storage is here).

One more bit of context - with JSON documents stored in a document DBMS like MongoDB or a SQL DBMS like MySQL, I hope to learn about support for partial updates. If only a few fields are updated in a large document does the DBMS have optimizations to reduce the overhead of persistent and replicating such updates?

Document vs Relational

I started to think about this blog post when considering the impact of document vs relational approaches -- or larger documents vs smaller rows -- starting with the scenario where you want to update a few fields in that large document vs updating a few of the small rows in the relational approach. And I assume the working set doesn't fit in cache.

At a high-level there is a big difference between a b-tree and LSM. With an LSM if a 3KB object is modified without support for partial updates, then 3KB is written to the memtable and then compaction does its thing. We can estimate the total write overhead based on average write-amp and sizeof(updated object). The predicted cost is largely determined by sizeof(updated object). But this is less true with a b-tree given my working-set assumption above. With a b-tree whether the updated object is only a few bytes or a few KB, if that object fits in one page and page write-back is usually done for pages with only one modified object then the write-back cost is less dependent on object size -- because the write-back cost is 1 page.

A rule of thumb from the previous paragraph can be that a b-tree write-back overhead is less sensitive than an LSM to large objects as long as sizeof(object) <= sizeof(page).

From the read perspective, if you need all or most of the fields of a large document then it will be more efficient to keep them in a large document and avoid the joins. It will also make better use of the cache memory because the SQL DBMS approach might end up with pages in cache where only one row of such pages are getting accessed.

Efficiency overview

I focus my speculation on efficiency. But being more efficient frequently implies being more performant. The efficiency areas are:
  • storage - how are the larger objects stored in the index structure? In general it is another source of (reasonable) complexity to understand if you want to be a power user.
  • write-back - for a b-tree this means the work (CPU & IO) to write back dirty pages. For an LSM this is the work done by compaction and memtable flush. The overhead for this is measured in the amount of CPU, random page writes and KB written per transaction. The relative importance of random page writes vs KB written depends on your storage technology (random page writes hurt more with spinning disk than with SSD). Note that with compression the CPU overhead includes the cost to compress the data being written back or compacted. There can also be an overhead from reading data -- obviously with compaction, but also with read-modify-write for a b-tree as pages can't be modified unless they are in-memory.
  • redo log - the redo log lets a write transaction do a small amount of writing to the log to defer a larger amount of writing needed by write-back
  • replication - the amount of data that must be written to a replication log and sent over the network per transaction. If log shipping is used then this is just a repeat of the redo log analysis. Postgres does log shipping and there has been work to support statement-based. MySQL started with statement-based and then moved to row-based. My knowledge of MongoDB internals is weaker but I assume it is document-based, which is similar to MySQL's row-based. '
  • partial updates - when a large document gets a partial updates (a few fields are changed) does the DBMS optimize for that case WRT to replication, redo log and write-back?
Efficiency analysis

Disclaimer, this section is more truthy than true. I am curious to learn which of the DBMS (Postgres, MySQL, MongoDB) have optimizations for partial updates to large documents but not curious enough to do the research.

Clustered b-tree (InnoDB)

  • storage - objects are stored inline in the clustered index leaf pages if they are small enough. The definition of small enough depends on the row format but certainly it must be smaller than the InnoDB page size. I wonder if a clustered index is the right structure for large objects unless the large objects are always stored in overflow pages. Percona has a great summary and the reference manual has more info. But large objects are likely to be stored in overflow pages (meaning there can be more random IO to read them) and overflow pages are not shared by multiple blobs (meaning there will be more space amplification). Compression is possible via table or page compression. I am not a fan of the page compression approach but table compression worked great for me long ago. There are many recent improvements to InnoDB for large objects done to support JSON docs including support for partial updates, per-LOB indexes and compressed LOBs. I have yet to figure out whether I can enable compression for large JSON documents stored in LOB overflow pages without enabling table or page compression.
  • write-back - assuming overflow pages are used then write-back likely means writing back both the clustered index leaf page that references the overflow pages along with the overflow pages. This likely means more write amplification. Assuming the doublewrite buffer is enabled then there is extra write amplification from the doublewrite buffer when previously written pages are to be written again.
  • redo log - InnoDB doesn't write page images to the redo log, with one exception (compression) that will ignore for now. But the entire row must be written to the redo log unless there is something clever for partial updates.
  • replication - read up on the binlog_row_image option. But I am not sure if MySQL does anything clever with respect to the partial updates mentioned above. Can it avoid sending the full document when only a few fields have changed?
  • partial updates - yes, see here
Heap-organized b-tree (Postgres)
  • storage - objects are stored in the heap-organized table pages if they fit. When they don't read up on TOAST. I like that there compression can be enabled/disabled just for TOAST.
  • write-back - similar to InnoDB there might be extra write amplification from writing back both the heap table pages that reference TOAST columns and the out-of-line storage used to store the TOAST columns.
  • redo log - the first time a page is modified for a given redo log then that page is written to the redo log. For large objects the entire object must be written to the log.
  • replication
  • partial updates - I don't know whether that is optimized for JSON/JSONB.
Cow-R b-tree (WiredTiger, mostly unanswered because my internals knowledge has faded)
  • storage
  • write-back
  • redo log
  • replication
  • partial updates - I don't know whether MongoDB does anything special
LSM (MyRocks)
  • storage - large objects are stored inline unless you use Integrated BlobDB to get key-value separation. Compression is trivial.
  • write-back - the write-back cost is average-write-amp * sizeof(object)
  • redo log - the entire large object is written to the RocksDB redo log
  • replication - see binlog_row_image
  • partial updates - yes via the merge operator if you use pure RocksDB. Optimizations for large objects have yet to arrive.


  1. JSON partial updates in mysql are supported if you follow the docs but that requires very explicit use of JSON_SET or JSON_REPLACE so from what I've seen in practice requires ver explicit support in the SQL or the ORM. In practice I've seen the JSON column being updated in full so no optimisation is performed by MySQL. For large JSON blobs the lower layers could potentially optimise this further.

    1. It will be interesting if this is expanded. Hopefully we can leverage the RocksDB merge operator in MyRocks for partial updates.

  2. MySQL JSON partial updates yes, but .... you have to use the JSON_SET, JSON_REPLACE or JSON_REMOVE functions. If you don't nothing gets done partially.

    I've seen that a lot of code I've seen just modifies the JSON and UPDATEs it so no optimisation would take place. To get the wanted optimisations you need support from your ORM or directly in the SQL you generate. So in reality you have to make a huge effort to get this optimised via replication.

    Ideally or optionally the server could see the "before json" and "after json" and generate a "json diff" and send that diff and as I've seen a lot of JSON in my machines being relatively large that might be CPU intensive.

    I would certainly like to see such an optional optimisation as this JSON bloats the binlogs which are otherwise pretty efficient in compressed minimal RBR mode.

  3. Thanks for your interesting article.

    Please refer to the following patent where we allow partial updates to large posting lists using large object trees, which will be also used for your purpose.