Monday, November 30, 2020

Blind-writes and an LSM

Blind-writes for an LSM can be a challenge in the presence of secondary indexes. This post was inspired by an interesting but unpublished paper based on Tarantool and a blog post on SAI in DataStax Cassandra. Some of this restates ideas from the Tarantool paper.

Contributions of this post are:

  1. Explains blind-writes
  2. Explains a new way to do blind-writes for some SQL-ish statements. The new way doesn't require readers to validate secondary index entries, which is required with the Tarantool approach. The new approach only works for some types of statements while the Tarantool approach had fewer limits.
Longer introduction

Non-unique secondary index maintenance is read-free for MyRocks during a write (insert, update, delete). By this I mean that nothing is read from the secondary index so there is no chance of doing a storage read during index maintenance. This makes secondary indexes cheaper with MyRocks than the typical DBMS that uses a B-Tree. The InnoDB change buffer also reduces the chance of storage reads during index maintenance, which is nice but not as significant as the MyRocks benefit. I ignore the reads that MyRocks might do during LSM compaction so index maintenance is eventually not read-free so allow me to be truthy. Regardless, with an LSM there is (almost always) less read and write IO per write compared to a B-Tree.

If you only want to provide SQL semantics and support write-heavy workloads with secondary indexes, then it is hard to be more efficient than MyRocks as it doesn't do reads other than those required for SQL -- fetching the row(s). There are several reasons for fetching the row(s) during processing of an UPDATE or DELETE statement including returning the affected row count and validating constraints. Getting the row requires a read from the table (if heap organized) or the PK index (if using a clustered PK index like MyRocks and InnoDB).

If you are willing to give up SQL semantics then it is possible to be read-free -- no reads for secondary index maintenance, no reads to get the base row. This is explained in the Tarantool paper but that solution has a cost -- all secondary index entries must be validated by reading the base row because they can be stale (invalid). The validation isn't that different from what can happen in Postgres and InnoDB when vacuum and purge get behind and index-only queries might not be index-only.

There was also a feature in TokuDB for blind writes but I am not sure whether that avoided reads to get the base row.

Blind writes

A blind write is done without hidden or explicit reads. But I need to add an exception for some reads, because an LSM has to search the memtable to insert key-value pairs. So I limit this post to disk-based DBMS and allow a blind-write to read in-memory search structures. The Put operation in the LevelDB and RocksDB API is a blind write.

An explicit read is done explicitly by the user or implied by the operation. A read-modify-write sequence includes an explicit read and an example is SELECT ... FOR UPDATE followed by UPDATE. Reads done to process the WHERE clause for UPDATE and DELETE statements are explicit reads.

A hidden read is a read that isn't explicit and done while processing an operation. Examples include: 

  • Read an index to validate a unique constraint. A bloom filter makes this fast for an LSM.
  • Determine the number of affected rows to provide SQL semantics for UPDATE & DELETE
  • Read B-Tree secondary index leaf pages to maintain them after a write

Secondary Index Maintenance

An index entry is composed (indexed column values, row pointer). The row pointer is usually a row ID or the value of the row's primary key columns. The typical index maintenance after a write consists of some of remove old index entry, insert new index entry. To perform these steps both the index column values and row pointer must be known. When a DBMS has the row then it has the required information, but a read must be done to get the row and we are discussing ways to avoid that read.


The rest of this post assumes a SQL DBMS and a simple schema. 

create table T (id primary key, a int, b int)
create index xa on T(a)

I am interested in whether blind writes can be done for the following statements. The example statements are listed below. The first 3 statements use the PK index to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the PK index to find the qualifying row, and then perform secondary index maintenance. Secondary index maintenance with a B-Tree means doing a read-modify-write on index leaf pages. With an LSM it can be read-free.

  • P1 - UPDATE T set a = 5 WHERE id=3
  • P2 - UPDATE T set a = a + 1 WHERE id=3
  • P3 - DELETE from T WHERE id=3

The next 3 statements use the secondary index, xa, to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the secondary index to find qualifying index entries, then using the row pointer in each index entry to find the base row and finally performing index maintenance.
  • S1 - UPDATE T set b = 5 WHERE a=4
  • S2 - UPDATE T set b = b + 1 WHERE a=4
  • S3 -  DELETE from T WHERE a=4
Truly read-free

I promised an approach that was truly read-free and after many words have yet to deliver. An LSM makes this possible, but since this is LSM-based there will still be reads during compaction so I am truthy here rather than true. Also, I assume reads can be done from an index to evaluate the WHERE clause. 

With a SQL DBMS that uses an LSM, the base row doesn't have to be fetched for statements S1, S2 and S3 above if SQL semantics aren't required. The base row still has to be fetched for statements P1, P2 and P3. For S1, S2 and S3 I assume that the secondary index xa is scanned to find matching index entries, for each matching index entry the remaining work is:
  • S1 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=5.
  • S2 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=b+1.
  • S3 - Put a tombstone for the clustered PK index to delete the row
I have no idea about the effort required to get this into MySQL. I assume it always reads the base row for UPDATE and DELETE statements.

Friday, November 6, 2020

Max row per group, sad answers only

Today I learned that frequently asked questions on StackOverflow get their own tag. The tag greatest-n-per-group is for answers to questions about writing SQL to find the max row per group. By max row I mean the aggregated columns, group by columns and other columns for the row that has the max or min value in a group. By sad answers only I mean there is a lot of confusion about this, StackOverflow has over 3000 posts, and the query is harder to write than it needs to be.

I am writing about SQL rather than SQL engines and I am not an expert on writing SQL queries, but that might be appropriate given that I am writing about something that confuses users and could be easier. My motivation for writing this was a slow query plan for MySQL 8.0.22 while implementing the Time Series Benchmark Suite (TSBS) for MySQL.

Reproduction SQL is here and here and output from MySQL 8.0.22 is here and here. I updated the output for the queries with SHOW STATUS printed after each query.

The easy way

The answer is easy if you only want the aggregated and group by columns:

SELECT MAX(agg_col), gb_col FROM t GROUP BY gb_col

But it isn't easy when you want other columns -- columns that are not used for group by or aggregation.  This would be easy had MySQL continued on the ANY_VALUE theme by adding FIRST_VALUE and LAST_VALUE as MongoDB does via the $first and $last aggregation accumulator operators. Well, MySQL has FIRST_VALUE and LAST_VALUE, but for window functions and they don't provide the desired semantics. If they existed with semantics similar to MongoDB then the following query would work and is easy to write:

SELECT MAX(agg_col), gb_col, FIRST_VALUE(other_col) FROM t GROUP BY gb_col

I am not an expert. Perhaps one day I will learn why there isn't an easy way to do this. MySQL docs have a useful page on this type of query. I have yet to try a variant that uses LATERAL.

The less easy ways

There are many ways to write this query. None of them are easy as the example I wrote above that isn't valid SQL. I tried: left join, correlated subquery, uncorrelated subquery and a rank() window function. The performant solutions were uncorrelated subquery and rank() window function. I was surprised that the rank() window function approach was performant because the explain output looked slow. But the runtime and slow log output were OK. By performant I mean a query plan that examines few rows, and loose index scan is an example of that.

This table shows the response time for each query type and the number of rows examined from the slow log. For the window function approach I am confused both by the low value for rows examined and the plan that shows a table scan. I am curious if there is a bug.

ApproachResponse time (secs)Rows examined
Uncorrelated 0.00 20
Window function 0.10 10
Left join 1.07 245750
Correlated 134.11 671170560

Loose index scan

Update - when I published this I claimed the index was on (j DESC, pk) but that was a mistake.

Before I show the queries, my goal is to get a plan that uses the loose index scan optimization. The test table is: create table tq(pk int primary key, j int, k int), there is an index on (j, pk DESC) and a NOT NULL constraint on j. This query uses a loose index scan, however it can't provide the value for the column k. The loose index scan is performant because it fetches one entry from the index per distinct value for (j,pk).

SELECT max(pk), j FROM tq GROUP BY j
EXPLAIN: -> Group aggregate (computed in earlier step): max(
    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10) 

Uncorrelated subquery

This plan is performant because t2 is materialized via a loose index scan and the result from that does one point query per distinct value in j.

SELECT, t1.j, t1.k
FROM tq t1, (SELECT max(pk) as maxpk, j FROM tq GROUP BY j) t2
WHERE t2.maxpk =

EXPLAIN: -> Nested loop inner join
    -> Filter: (t2.maxpk is not null)
        -> Table scan on t2  (cost=3.62 rows=10)
            -> Materialize
                -> Group aggregate (computed in earlier step): max(
                    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10)
    -> Single-row index lookup on t1 using PRIMARY (pk=t2.maxpk)  (cost=0.26 rows=1)

Rank window function

This query is slightly slower than the uncorrelated subquery above. I didn't expect that given the plan that has a table scan on tq. Adding hints to use the index on (j,pk) don't change the query plan. I wonder if this explain outpout is correct as the query doesn't do a full scan when run. Also the query is almost as fast as the uncorrelated approach.

WITH t1 AS (SELECT pk, j, k,
    RANK() OVER (PARTITION by j ORDER BY pk DESC) AS myrank FROM tq)
SELECT pk, j, k from t1 WHERE myrank=1

EXPLAIN: -> Index lookup on t1 using <auto_key0> (myrank=1)
    -> Materialize CTE t1
        -> Window aggregate: rank() OVER (PARTITION BY tq.j ORDER BY desc )
            -> Sort: tq.j, DESC  (cost=8273.25 rows=82170)
                -> Table scan on tq  (cost=8273.25 rows=82170)
Left join

I don't expect this query to be performant because there isn't an equality predicate on pk. This might be a useful approach when there isn't an index on (j,pk), but that is not the case here and this plan examines too many rows.

SELECT, t1.j, t1.k FROM tq t1
LEFT JOIN tq t2 ON t1.j = t2.j AND <

EXPLAIN: -> Filter: (t2.j is null)  (cost=76209940.12 rows=750212100)
    -> Nested loop antijoin  (cost=76209940.12 rows=750212100)
        -> Table scan on t1  (cost=8273.25 rows=82170)
        -> Filter: ( <  (cost=14.38 rows=9130)
            -> Index lookup on t2 using x (j=t1.j)  (cost=14.38 rows=9130)

Correlated subquery

The correlated subquery isn't performant. It examines too many rows. That isn't a surprise.

SELECT pk, j, k FROM   tq t1
WHERE  pk=(SELECT MAX( FROM tq t2 WHERE t1.j = t2.j)

EXPLAIN: -> Filter: ( = (select #2))  (cost=8273.25 rows=82170)
    -> Table scan on t1  (cost=8273.25 rows=82170)
    -> Select #2 (subquery in condition; dependent)
        -> Aggregate: max(
            -> Index lookup on t2 using x (j=t1.j)  (cost=927.37 rows=9130)

Monday, October 26, 2020

InnoDB, fsync and fdatasync - reducing commit latency

MySQL has an commit penalty compared to Postgres and MongoDB. If you want commit to be durable with the binlog enabled then MySQL does two fsyncs per commit -- one for the binlog, one for the InnoDB redo log. But Postgres and MongoDB only need one fsync per commit in the same setup. Note that I am ignoring the benefit of group commit for now, and all of these can benefit from that.

Better performance is the reason for considering fdatasync for the InnoDB redo log. The response time for fdatasync is likely to be less than for fsync because fdatasync does less work. The InnoDB redo log is pre-allocated so the size doesn't change. Therefore InnoDB and fdatasync seem like a good fit and can reduce the commit penalty of MySQL. 

Postgres supports fdatasync for redo (redo == WAL in Postgres) via the wal_sync_method option. InnoDB has the  innodb_flush_method option, but fdatasync isn't one of the choices and the choices are fsync or nothing.

I did a few simple tests to understand fsync vs fdatasync latency on my home NUC servers. The servers use XFS on a Samsung 970 EVO NVMe SSD. First I have results from the pg_test_fsync binary that is part of Postgres and fsync latency was about 2X larger than fdatasync. The average latency was 1353 & 3625 usecs for fdatasync & fsync for one 8kb write and then 1506 & 3346 usecs for two 8kb writes. Full results are here.

Next I did a simple benchmark with mysqlslap and MySQL 8.0.21 with the binlog disabled and a commit per update statement using the upstream binary and then one changed to use fdatasync. The average latency per update statement was 4040 usecs & 6520 usecs for fdatasync & fsync. On my hardware, fsync is clearly slower and that isn't a surprise. I might run this on ec2 hosts using local storage and EBS, but that won't happen today.

I hope InnoDB is updated to support fdatasync as an option. MDEV-21382 is open in MariaDB to support such a change. Percona has perf results for fsync vs fdatasync on different devices along with some history on the use of fsync by InnoDB. In their tests the latency for fsync is larger and is expected. I am confused by some of the Percona docs as they mention fdatasync, but that is tech docs debt as old versions of MySQL used fdatasync as the option value but that meant "use fsync".

I created a MySQL feature request for this -- see 101326.

Overloading the meaning of "copy-on-write" for index structures

I previously used copy-on-write to explain B-Tree index structures that don't overwrite pages during writeback. One benefit of copy-on-write is avoiding the risk of partial page writes as that has overhead including write-amplification. InnoDB uses a doublewrite buffer to recover from partial page writes. Postgres writes full pages to the redo log once per checkpoint. 

This can be called page-based copy-on-write (page-based CoW) and there are at least two types - CoW-R and CoW-S. CoW-R overwrites dead versions of pages and -R means it is likely to do small random writes. CoW-S writes pages to the tail of a log file, -S means sequential IO, and then background garbage collection is needed to reclaim space from old log segments. WiredTiger and LMDB are examples of CoW-R. Amazon Aurora for MySQL might be an example of CoW-S, although the B-Tree (InnoDB) isn't aware that it is CoW.

Row-based Copy-on-Write

But copy-on-write can also be used to explain what happens at the row level on inserts, updates and deletes. Lets call this row-based copy-on-write (row-based CoW). For InnoDB, changes to primary key index entries, where the base row is stored, are update-in-place (UiP) while changes to secondary indexes are CoW. For Postgres, changes to the base row and index entries are CoW except when the HOT optimization is used.

When row-based CoW is used then something must eventually remove the old versions of rows and/or index entries when they are no longer visible to an open or future snapshot. For an MVCC storage engine that something is called MVCC garbage collection (MVCC GC). For a log-structured merge tree (LSM) that something is called compaction. MVCC GC is called purge by InnoDB and vacuum by Postgres.

InnoDB uses update-in-place for updates to the base row stored in the PK index. But delete processing is special in that the base row is delete-marked (a flag is set in the row header) and purge will eventually remove delete-marked rows to reclaim space.

Friday, October 23, 2020

LSM local secondary indexes (LSM LSI)

This expands on my previous post about RID-lists for RocksDB. RocksDB doesn't provide secondary indexes, nor does it know about schemas but applications that use RocksDB can add both and MyRocks is an example of that.

Many DBMS use the concept of local vs global secondary indexes and partitioned tables are one place where this matters. Assuming a B-Tree is used for the index, then a local secondary index with a partitioned table has a B-Tree index per-partition. The benefit of this is that DROP PARTITION is fast -- delete the table and index files for that partition. The cost from this is that a secondary index query might have to probe indexes per-partition and this can use more CPU and IO.

Without defining a local secondary index for an LSM I assert that secondary indexes with MyRocks are global. Global secondary index entries for an LSM use the PK value to reference the base row. Side effects of this include larger secondary index entries and CPU overhead when an LSM point read must be done to find the base row. Any DBMS with a clustered PK index, like InnoDB, also has these costs.

For the following I haven't done an extensive literature search to understand whether this idea has been proposed. Regardless, I hope that LSM LSI (see below) is the name to be be used.

Local secondary indexes for an LSM

The concept of local secondary indexes also applies to a Log Structured Merge tree. An LSM global secondary index (LSM GSI) uses the PK value because the base row is in the primary key index and is likely to be relocated over time by compaction. Therefore, either secondary index entries must be updated with a new location (rowid) during compaction of the PK index or the secondary index entries must use the PK value. There are more complex solutions that I will ignore.

There are other costs when secondary index entries uses the PK value rather than a rowid to find the base row -- it is harder to implement RID-lists and bitmap indexes. But I think this can be fixed by an LSM local secondary index (LSM LSI).

I define an LSM LSI to be local to a sorted run. When compaction merges sorted runs into a larger sorted run, then a new LSM LSI will be created for that new sorted run. The cost from this is fanout for queries that have a predicate on the secondary index and cannot be pruned to a few sorted runs. Pruning to a few sorted runs is most likely to be enabled via predicates on the PK.

The lifetime of the LSM LSI is the same as the lifetime of the data it indexes. Therefore, LSM LSI entries can use something like a rowid rather than the PK to reference the base row. This makes it much easier to implement RID-lists and bitmap indexes for an LSM. The rowid has two parts -- (block#, offset) where block# is a value from 0 to N-1 when an SST has N blocks and the offset is the position of the row in the block. The rowid usually needs <= 4 bytes.

Fanout is less of a concern for a bitmap index because many analytic queries expect to scan a large part of the index. But there will be more merging of bitmap index entries when LSM LSI is used, just as there would be with local bitmap indexes on a partitioned table with a famous SQL DBMS.


  • Cassandra SAI (Storage Attached Index) is an LSM local secondary index. It uses a trie for text keys and a kd-tree for numeric keys. I think the index entries use the row offset within the SST as the row pointer.
  • Cuckoo Index might be relevant, although it is a lighter weight index
  • There is a paper in SIGMOD 2018 on LSM secondary indexes
  • There is a paper from the AsterixDB project on LSM secondary indexes
  • Another paper from AsterixDB and Chen Luo on LSM secondary indexes

Wednesday, October 21, 2020

Self-aware DBMS, RID-lists for LSM trees

This is a short writeup of two ideas I might never have the time to implement -- self-aware DBMS and RID-lists for LSM tree.

Self-aware DBMS

Tuning a B-Tree index structure is hard and tuning an LSM tree is usually harder. Fortunately it will get easier over time. We tune a DBMS today via knobs and there is clever work in progress to automate changing of those knobs. But we can do even better. I hope one day for self-aware DBMS algorithms that understand the impact of their knobs, accept declarative goals & priorities from a user and are able to improve performance with respect to an objective function. Lets move the cleverness into the DBMS.

RID-lists for LSM trees

tl;dr - can I use the merge operator to implement RID-lists for an LSM?

I worked on bitmap indexes at Oracle. It would be fair to call them compressed RID-lists. Many smart people worked on that code in Oracle prior to me and the inventor, Gennady Antoshenkov, has a strong record of innovation. IBM DB2 has RID-lists, but I don't know whether they can be compressed. All of these are useful for analytics whether this is a bitmap index, compressed RID-list or uncompressed RID-list.

Postgres 13 has index deduplication which looks like an uncompressed RID-list and I look forward to new query execution features to do the equivalent of bitmap-AND and bitmap-OR.

But this is about a RID-list for an LSM tree. A RID-list is useful for an index that can have many duplicates for a key -- meaning some secondary indexes and not for a primary index. The LSMs I care about don't know much about schema and secondary index maintenance requires knowledge of schema but I will ignore that for now. MyRocks uses RocksDB for primary and secondary indexes, so this is something that can be solved by an application. WiredTiger supports schemas and secondary indexes without also providing a query language and someone could add the same to RocksDB.

I am curious whether the merge operator can be used to implement RID-lists? Disclaimer - I am far from an expert on the merge operator.

  • the rowid is the value of the primary key. Secondary index entries for an LSM and for Oracle IOT use the primary key value as the rowid because the base row can be relocated (during compaction with an LSM, during B-Tree leaf node splits and merges with IOT). Bitmap indexes for Oracle IOT solves this via a mapping table. If that is not done then the problem is that the primary key value might be large and a list of them might not be compression friendly.
  • the delta that is written by the merge operator specifies whether a rowid is added or removed for a given secondary key value. I wonder if something like SingleDelete is needed.
  • the callback function that combines merge operator entries during compaction will merge entries to create larger RID-lists

Tuesday, October 20, 2020

MyRocks in MySQL 5.6 and 8.0

I am in the process of upgrading my home NUC cluster from Ubuntu 18.04 to 20.04 and that means I get to use newer versions of gcc. It also means I lose easy access to older versions of gcc. I assumed that I wouldn't be able build MyRocks and MySQL 5.6 from source in Ubuntu 20.04 but I never confirmed it and I might be wrong. I have been using gcc 7 on Ubuntu 18.04 which is still available in 20.04, if I am willing to manage multiple versions of gcc. Regardless it is time for me to begin using MyRocks with MySQL 8.

I have results for the insert benchmark to compare MySQL 5.6 and 8.0 where both use MyRocks and the results are what I expected -- I lose between 10% and 20% of throughput (queries or inserts /second) because 8.0 has more CPU overhead. Note that this is for a workload with low-concurrency and my NUC servers have 4 cores with HT disabled. This is similar to what I see when comparing InnoDB between MySQL 5.6 and 8.0. I prefer no new overhead, but new overhead is inevitable and the amounts here seem reasonable given these are two versions apart -- 5.6 to 5.7 to 8.0.

Update - the numbers above include the CPU from the benchmark client and mysqld that share the same host. When I limit the CPU measurement to mysqld then the difference is 1.4X during load without secondary indexes, that is mysqld in MySQL 8.0 uses 1.4X more CPU to do the same amount of work during the load. While I am OK with 1.2X more CPU (~20% mentioned above) I think that 1.4X more is too much.

Background reading that has more detail:


Three workloads were used -- in-memory, IO-bound and extra IO-bound. The in-memory workload loads 20m rows without secondary indexes, creates 3 secondary indexes, loads 20m more rows and then does 12 30-minute read+write tests where the writer is rated limited to 100, 200, 400, 600, 800 and then 1000 inserts/s. The IO-bound workload is similar but loads 100m rows without secondary indexes and then 10m with secondary indexes. The extra IO-bound workload is similar but loads 500m rows without secondary indexes and then 10m with secondary indexes. For my sake, here are the command lines for my helper scripts.

The load and create index test steps are single-threaded. The read+write test steps use one query thread and one rate-limited writer thread.

MyRocks was used with MySQL 5.6.35 and 8.0.17 from the FB branch. The 5.6 version was compiled and run using Ubuntu 18.04 while the 8.0 version was compiled and run with Ubuntu 20.04. The my.cnf contents are here for 5.6 and for 8.0.

Performance summaries that can be interpreted after reading this are here for the in-memory, IO-bound and extra IO-bound workloads. The most important columns are qps and ips for throughput (queries/s, inserts/s). The throughput differences are usually explained by cpupq (cpu/query or cpu/insert depending on the test step).

The table below lists the throughput ratio for each test step as (8.0 value / 5.6 value) and a number less than 1 means that 8.0 is slower than 5.6. I didn't list values for all of the test steps to avoid long lines. The test steps are:
  • l.i0 - load without secondary indexes
  • l.x - create secondary indexes
  • l.i1 - load with secondary indexes
  • qN.2 - read+write where N is the number of inserts/s

                l.i0    l.x     l.i1    q100.2  q200.2  q800.2  q1000.2
in-memory       .79     .99     .87     .89     .89     .88     .88
io-bound        .82     .99     .87     .99     .89     .90     .91
extra io-bound  .80     1.03    .87     .92     .90     .91     .92