Storage engines for OLTP focus on storing small objects and by small object I mean sizeof(object) << sizeof(page). They can store larger objects but the engines are not optimized for that and both efficiency and performance will suffer. By larger objects I mean sizeof(object) >= sizeof(page).
Given the growth of the document model we need some OLTP storage engines to be efficient and performant for larger objects. What might that mean? The rest of this post is speculation by me. I am far from an expert on this topic and will try to be careful. The topic would make for an interesting conference paper.
This post is more truthy then true. It has speculation, unanswered questions and likely has several mistakes or things I omitted because I was not aware of them.
I will focus on four index structures that are popular in my world:
- Clustered b-tree (InnoDB)
- Heap-organized b-tree (Postgres)
- Cow-R b-tree (WiredTiger)
- LSM (MyRocks)
A rule of thumb from the previous paragraph can be that a b-tree write-back overhead is less sensitive than an LSM to large objects as long as sizeof(object) <= sizeof(page).
From the read perspective, if you need all or most of the fields of a large document then it will be more efficient to keep them in a large document and avoid the joins. It will also make better use of the cache memory because the SQL DBMS approach might end up with pages in cache where only one row of such pages are getting accessed.
- storage - how are the larger objects stored in the index structure? In general it is another source of (reasonable) complexity to understand if you want to be a power user.
- write-back - for a b-tree this means the work (CPU & IO) to write back dirty pages. For an LSM this is the work done by compaction and memtable flush. The overhead for this is measured in the amount of CPU, random page writes and KB written per transaction. The relative importance of random page writes vs KB written depends on your storage technology (random page writes hurt more with spinning disk than with SSD). Note that with compression the CPU overhead includes the cost to compress the data being written back or compacted. There can also be an overhead from reading data -- obviously with compaction, but also with read-modify-write for a b-tree as pages can't be modified unless they are in-memory.
- redo log - the redo log lets a write transaction do a small amount of writing to the log to defer a larger amount of writing needed by write-back
- replication - the amount of data that must be written to a replication log and sent over the network per transaction. If log shipping is used then this is just a repeat of the redo log analysis. Postgres does log shipping and there has been work to support statement-based. MySQL started with statement-based and then moved to row-based. My knowledge of MongoDB internals is weaker but I assume it is document-based, which is similar to MySQL's row-based. '
- partial updates - when a large document gets a partial updates (a few fields are changed) does the DBMS optimize for that case WRT to replication, redo log and write-back?
- storage - objects are stored inline in the clustered index leaf pages if they are small enough. The definition of small enough depends on the row format but certainly it must be smaller than the InnoDB page size. I wonder if a clustered index is the right structure for large objects unless the large objects are always stored in overflow pages. Percona has a great summary and the reference manual has more info. But large objects are likely to be stored in overflow pages (meaning there can be more random IO to read them) and overflow pages are not shared by multiple blobs (meaning there will be more space amplification). Compression is possible via table or page compression. I am not a fan of the page compression approach but table compression worked great for me long ago. There are many recent improvements to InnoDB for large objects done to support JSON docs including support for partial updates, per-LOB indexes and compressed LOBs. I have yet to figure out whether I can enable compression for large JSON documents stored in LOB overflow pages without enabling table or page compression.
- write-back - assuming overflow pages are used then write-back likely means writing back both the clustered index leaf page that references the overflow pages along with the overflow pages. This likely means more write amplification. Assuming the doublewrite buffer is enabled then there is extra write amplification from the doublewrite buffer when previously written pages are to be written again.
- redo log - InnoDB doesn't write page images to the redo log, with one exception (compression) that will ignore for now. But the entire row must be written to the redo log unless there is something clever for partial updates.
- replication - read up on the binlog_row_image option. But I am not sure if MySQL does anything clever with respect to the partial updates mentioned above. Can it avoid sending the full document when only a few fields have changed?
- partial updates - yes, see here
- storage - objects are stored in the heap-organized table pages if they fit. When they don't read up on TOAST. I like that there compression can be enabled/disabled just for TOAST.
- write-back - similar to InnoDB there might be extra write amplification from writing back both the heap table pages that reference TOAST columns and the out-of-line storage used to store the TOAST columns.
- redo log - the first time a page is modified for a given redo log then that page is written to the redo log. For large objects the entire object must be written to the log.
- partial updates - I don't know whether that is optimized for JSON/JSONB.
- redo log
- partial updates - I don't know whether MongoDB does anything special
- storage - large objects are stored inline unless you use Integrated BlobDB to get key-value separation. Compression is trivial.
- write-back - the write-back cost is average-write-amp * sizeof(object)
- redo log - the entire large object is written to the RocksDB redo log
- replication - see binlog_row_image
- partial updates - yes via the merge operator if you use pure RocksDB. Optimizations for large objects have yet to arrive.