Monday, November 30, 2020

Blind-writes and an LSM

Blind-writes for an LSM can be a challenge in the presence of secondary indexes. This post was inspired by an interesting but unpublished paper based on Tarantool and a blog post on SAI in DataStax Cassandra. Some of this restates ideas from the Tarantool paper.

Contributions of this post are:

  1. Explains blind-writes
  2. Explains a new way to do blind-writes for some SQL-ish statements. The new way doesn't require readers to validate secondary index entries, which is required with the Tarantool approach. The new approach only works for some types of statements while the Tarantool approach had fewer limits.
Longer introduction

Non-unique secondary index maintenance is read-free for MyRocks during a write (insert, update, delete). By this I mean that nothing is read from the secondary index so there is no chance of doing a storage read during index maintenance. This makes secondary indexes cheaper with MyRocks than the typical DBMS that uses a B-Tree. The InnoDB change buffer also reduces the chance of storage reads during index maintenance, which is nice but not as significant as the MyRocks benefit. I ignore the reads that MyRocks might do during LSM compaction so index maintenance is eventually not read-free so allow me to be truthy. Regardless, with an LSM there is (almost always) less read and write IO per write compared to a B-Tree.

If you only want to provide SQL semantics and support write-heavy workloads with secondary indexes, then it is hard to be more efficient than MyRocks as it doesn't do reads other than those required for SQL -- fetching the row(s). There are several reasons for fetching the row(s) during processing of an UPDATE or DELETE statement including returning the affected row count and validating constraints. Getting the row requires a read from the table (if heap organized) or the PK index (if using a clustered PK index like MyRocks and InnoDB).

If you are willing to give up SQL semantics then it is possible to be read-free -- no reads for secondary index maintenance, no reads to get the base row. This is explained in the Tarantool paper but that solution has a cost -- all secondary index entries must be validated by reading the base row because they can be stale (invalid). The validation isn't that different from what can happen in Postgres and InnoDB when vacuum and purge get behind and index-only queries might not be index-only.

There was also a feature in TokuDB for blind writes but I am not sure whether that avoided reads to get the base row.

Blind writes

A blind write is done without hidden or explicit reads. But I need to add an exception for some reads, because an LSM has to search the memtable to insert key-value pairs. So I limit this post to disk-based DBMS and allow a blind-write to read in-memory search structures. The Put operation in the LevelDB and RocksDB API is a blind write.

An explicit read is done explicitly by the user or implied by the operation. A read-modify-write sequence includes an explicit read and an example is SELECT ... FOR UPDATE followed by UPDATE. Reads done to process the WHERE clause for UPDATE and DELETE statements are explicit reads.

A hidden read is a read that isn't explicit and done while processing an operation. Examples include: 

  • Read an index to validate a unique constraint. A bloom filter makes this fast for an LSM.
  • Determine the number of affected rows to provide SQL semantics for UPDATE & DELETE
  • Read B-Tree secondary index leaf pages to maintain them after a write

Secondary Index Maintenance

An index entry is composed (indexed column values, row pointer). The row pointer is usually a row ID or the value of the row's primary key columns. The typical index maintenance after a write consists of some of remove old index entry, insert new index entry. To perform these steps both the index column values and row pointer must be known. When a DBMS has the row then it has the required information, but a read must be done to get the row and we are discussing ways to avoid that read.

SQL

The rest of this post assumes a SQL DBMS and a simple schema. 

create table T (id primary key, a int, b int)
create index xa on T(a)

I am interested in whether blind writes can be done for the following statements. The example statements are listed below. The first 3 statements use the PK index to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the PK index to find the qualifying row, and then perform secondary index maintenance. Secondary index maintenance with a B-Tree means doing a read-modify-write on index leaf pages. With an LSM it can be read-free.

  • P1 - UPDATE T set a = 5 WHERE id=3
  • P2 - UPDATE T set a = a + 1 WHERE id=3
  • P3 - DELETE from T WHERE id=3

The next 3 statements use the secondary index, xa, to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the secondary index to find qualifying index entries, then using the row pointer in each index entry to find the base row and finally performing index maintenance.
  • S1 - UPDATE T set b = 5 WHERE a=4
  • S2 - UPDATE T set b = b + 1 WHERE a=4
  • S3 -  DELETE from T WHERE a=4
Truly read-free

I promised an approach that was truly read-free and after many words have yet to deliver. An LSM makes this possible, but since this is LSM-based there will still be reads during compaction so I am truthy here rather than true. Also, I assume reads can be done from an index to evaluate the WHERE clause. 

With a SQL DBMS that uses an LSM, the base row doesn't have to be fetched for statements S1, S2 and S3 above if SQL semantics aren't required. The base row still has to be fetched for statements P1, P2 and P3. For S1, S2 and S3 I assume that the secondary index xa is scanned to find matching index entries, for each matching index entry the remaining work is:
  • S1 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=5.
  • S2 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=b+1.
  • S3 - Put a tombstone for the clustered PK index to delete the row
I have no idea about the effort required to get this into MySQL. I assume it always reads the base row for UPDATE and DELETE statements.

1 comment:

  1. The reads during compaction are different than the other reads. They are just part of a write. It's true that they are reads at some me level, but those reads are needed even if all the updates are very simple ones. The reads are just part of the cost of the write. Also the reads are of big blocks of data, not of rows. I've always argued that if you are actually using up the bandwidth due to writes and compaction, you just need more disks, and those disks are cost effective. You are using a significant fraction of the disks' storage capacity.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...