Monday, November 30, 2020

Blind-writes and an LSM

Blind-writes for an LSM can be a challenge in the presence of secondary indexes. This post was inspired by an interesting but unpublished paper based on Tarantool and a blog post on SAI in DataStax Cassandra. Some of this restates ideas from the Tarantool paper.

Contributions of this post are:

  1. Explains blind-writes
  2. Explains a new way to do blind-writes for some SQL-ish statements. The new way doesn't require readers to validate secondary index entries, which is required with the Tarantool approach. The new approach only works for some types of statements while the Tarantool approach had fewer limits.
Longer introduction

Non-unique secondary index maintenance is read-free for MyRocks during a write (insert, update, delete). By this I mean that nothing is read from the secondary index so there is no chance of doing a storage read during index maintenance. This makes secondary indexes cheaper with MyRocks than the typical DBMS that uses a B-Tree. The InnoDB change buffer also reduces the chance of storage reads during index maintenance, which is nice but not as significant as the MyRocks benefit. I ignore the reads that MyRocks might do during LSM compaction so index maintenance is eventually not read-free so allow me to be truthy. Regardless, with an LSM there is (almost always) less read and write IO per write compared to a B-Tree.

If you only want to provide SQL semantics and support write-heavy workloads with secondary indexes, then it is hard to be more efficient than MyRocks as it doesn't do reads other than those required for SQL -- fetching the row(s). There are several reasons for fetching the row(s) during processing of an UPDATE or DELETE statement including returning the affected row count and validating constraints. Getting the row requires a read from the table (if heap organized) or the PK index (if using a clustered PK index like MyRocks and InnoDB).

If you are willing to give up SQL semantics then it is possible to be read-free -- no reads for secondary index maintenance, no reads to get the base row. This is explained in the Tarantool paper but that solution has a cost -- all secondary index entries must be validated by reading the base row because they can be stale (invalid). The validation isn't that different from what can happen in Postgres and InnoDB when vacuum and purge get behind and index-only queries might not be index-only.

There was also a feature in TokuDB for blind writes but I am not sure whether that avoided reads to get the base row.

Blind writes

A blind write is done without hidden or explicit reads. But I need to add an exception for some reads, because an LSM has to search the memtable to insert key-value pairs. So I limit this post to disk-based DBMS and allow a blind-write to read in-memory search structures. The Put operation in the LevelDB and RocksDB API is a blind write.

An explicit read is done explicitly by the user or implied by the operation. A read-modify-write sequence includes an explicit read and an example is SELECT ... FOR UPDATE followed by UPDATE. Reads done to process the WHERE clause for UPDATE and DELETE statements are explicit reads.

A hidden read is a read that isn't explicit and done while processing an operation. Examples include: 

  • Read an index to validate a unique constraint. A bloom filter makes this fast for an LSM.
  • Determine the number of affected rows to provide SQL semantics for UPDATE & DELETE
  • Read B-Tree secondary index leaf pages to maintain them after a write

Secondary Index Maintenance

An index entry is composed (indexed column values, row pointer). The row pointer is usually a row ID or the value of the row's primary key columns. The typical index maintenance after a write consists of some of remove old index entry, insert new index entry. To perform these steps both the index column values and row pointer must be known. When a DBMS has the row then it has the required information, but a read must be done to get the row and we are discussing ways to avoid that read.

SQL

The rest of this post assumes a SQL DBMS and a simple schema. 

create table T (id primary key, a int, b int)
create index xa on T(a)

I am interested in whether blind writes can be done for the following statements. The example statements are listed below. The first 3 statements use the PK index to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the PK index to find the qualifying row, and then perform secondary index maintenance. Secondary index maintenance with a B-Tree means doing a read-modify-write on index leaf pages. With an LSM it can be read-free.

  • P1 - UPDATE T set a = 5 WHERE id=3
  • P2 - UPDATE T set a = a + 1 WHERE id=3
  • P3 - DELETE from T WHERE id=3

The next 3 statements use the secondary index, xa, to evaluate the WHERE clause. The traditional way to evaluate these statements is to use the secondary index to find qualifying index entries, then using the row pointer in each index entry to find the base row and finally performing index maintenance.
  • S1 - UPDATE T set b = 5 WHERE a=4
  • S2 - UPDATE T set b = b + 1 WHERE a=4
  • S3 -  DELETE from T WHERE a=4
Truly read-free

I promised an approach that was truly read-free and after many words have yet to deliver. An LSM makes this possible, but since this is LSM-based there will still be reads during compaction so I am truthy here rather than true. Also, I assume reads can be done from an index to evaluate the WHERE clause. 

With a SQL DBMS that uses an LSM, the base row doesn't have to be fetched for statements S1, S2 and S3 above if SQL semantics aren't required. The base row still has to be fetched for statements P1, P2 and P3. For S1, S2 and S3 I assume that the secondary index xa is scanned to find matching index entries, for each matching index entry the remaining work is:
  • S1 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=5.
  • S2 - Put a delta for the clustered PK index via the merge operator. The delta encodes b=b+1.
  • S3 - Put a tombstone for the clustered PK index to delete the row
I have no idea about the effort required to get this into MySQL. I assume it always reads the base row for UPDATE and DELETE statements.

Friday, November 6, 2020

Max row per group, sad answers only

Today I learned that frequently asked questions on StackOverflow get their own tag. The tag greatest-n-per-group is for answers to questions about writing SQL to find the max row per group. By max row I mean the aggregated columns, group by columns and other columns for the row that has the max or min value in a group. By sad answers only I mean there is a lot of confusion about this, StackOverflow has over 3000 posts, and the query is harder to write than it needs to be.

I am writing about SQL rather than SQL engines and I am not an expert on writing SQL queries, but that might be appropriate given that I am writing about something that confuses users and could be easier. My motivation for writing this was a slow query plan for MySQL 8.0.22 while implementing the Time Series Benchmark Suite (TSBS) for MySQL.

Reproduction SQL is here and here and output from MySQL 8.0.22 is here and here. I updated the output for the queries with SHOW STATUS printed after each query.

The easy way

The answer is easy if you only want the aggregated and group by columns:

SELECT MAX(agg_col), gb_col FROM t GROUP BY gb_col

But it isn't easy when you want other columns -- columns that are not used for group by or aggregation.  This would be easy had MySQL continued on the ANY_VALUE theme by adding FIRST_VALUE and LAST_VALUE as MongoDB does via the $first and $last aggregation accumulator operators. Well, MySQL has FIRST_VALUE and LAST_VALUE, but for window functions and they don't provide the desired semantics. If they existed with semantics similar to MongoDB then the following query would work and is easy to write:

SELECT MAX(agg_col), gb_col, FIRST_VALUE(other_col) FROM t GROUP BY gb_col

I am not an expert. Perhaps one day I will learn why there isn't an easy way to do this. MySQL docs have a useful page on this type of query. I have yet to try a variant that uses LATERAL.

The less easy ways

There are many ways to write this query. None of them are easy as the example I wrote above that isn't valid SQL. I tried: left join, correlated subquery, uncorrelated subquery and a rank() window function. The performant solutions were uncorrelated subquery and rank() window function. I was surprised that the rank() window function approach was performant because the explain output looked slow. But the runtime and slow log output were OK. By performant I mean a query plan that examines few rows, and loose index scan is an example of that.

This table shows the response time for each query type and the number of rows examined from the slow log. For the window function approach I am confused both by the low value for rows examined and the plan that shows a table scan. I am curious if there is a bug.

ApproachResponse time (secs)Rows examined
Uncorrelated 0.00 20
Window function 0.10 10
Left join 1.07 245750
Correlated 134.11 671170560

Loose index scan

Update - when I published this I claimed the index was on (j DESC, pk) but that was a mistake.

Before I show the queries, my goal is to get a plan that uses the loose index scan optimization. The test table is: create table tq(pk int primary key, j int, k int), there is an index on (j, pk DESC) and a NOT NULL constraint on j. This query uses a loose index scan, however it can't provide the value for the column k. The loose index scan is performant because it fetches one entry from the index per distinct value for (j,pk).

SELECT max(pk), j FROM tq GROUP BY j
EXPLAIN: -> Group aggregate (computed in earlier step): max(tq.pk)
    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10) 

Uncorrelated subquery

This plan is performant because t2 is materialized via a loose index scan and the result from that does one point query per distinct value in j.

SELECT t1.pk, t1.j, t1.k
FROM tq t1, (SELECT max(pk) as maxpk, j FROM tq GROUP BY j) t2
WHERE t2.maxpk = t1.pk

EXPLAIN: -> Nested loop inner join
    -> Filter: (t2.maxpk is not null)
        -> Table scan on t2  (cost=3.62 rows=10)
            -> Materialize
                -> Group aggregate (computed in earlier step): max(tq.pk)
                    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10)
    -> Single-row index lookup on t1 using PRIMARY (pk=t2.maxpk)  (cost=0.26 rows=1)

Rank window function

This query is slightly slower than the uncorrelated subquery above. I didn't expect that given the plan that has a table scan on tq. Adding hints to use the index on (j,pk) don't change the query plan. I wonder if this explain outpout is correct as the query doesn't do a full scan when run. Also the query is almost as fast as the uncorrelated approach.

WITH t1 AS (SELECT pk, j, k,
    RANK() OVER (PARTITION by j ORDER BY pk DESC) AS myrank FROM tq)
SELECT pk, j, k from t1 WHERE myrank=1

EXPLAIN: -> Index lookup on t1 using <auto_key0> (myrank=1)
    -> Materialize CTE t1
        -> Window aggregate: rank() OVER (PARTITION BY tq.j ORDER BY tq.pk desc )
            -> Sort: tq.j, tq.pk DESC  (cost=8273.25 rows=82170)
                -> Table scan on tq  (cost=8273.25 rows=82170)
Left join

I don't expect this query to be performant because there isn't an equality predicate on pk. This might be a useful approach when there isn't an index on (j,pk), but that is not the case here and this plan examines too many rows.

SELECT t1.pk, t1.j, t1.k FROM tq t1
LEFT JOIN tq t2 ON t1.j = t2.j AND t1.pk < t2.pk
WHERE t2.j IS NULL

EXPLAIN: -> Filter: (t2.j is null)  (cost=76209940.12 rows=750212100)
    -> Nested loop antijoin  (cost=76209940.12 rows=750212100)
        -> Table scan on t1  (cost=8273.25 rows=82170)
        -> Filter: (t1.pk < t2.pk)  (cost=14.38 rows=9130)
            -> Index lookup on t2 using x (j=t1.j)  (cost=14.38 rows=9130)

Correlated subquery

The correlated subquery isn't performant. It examines too many rows. That isn't a surprise.

SELECT pk, j, k FROM   tq t1
WHERE  pk=(SELECT MAX(t2.pk) FROM tq t2 WHERE t1.j = t2.j)

EXPLAIN: -> Filter: (t1.pk = (select #2))  (cost=8273.25 rows=82170)
    -> Table scan on t1  (cost=8273.25 rows=82170)
    -> Select #2 (subquery in condition; dependent)
        -> Aggregate: max(t2.pk)
            -> Index lookup on t2 using x (j=t1.j)  (cost=927.37 rows=9130)

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...