Friday, November 6, 2020

Max row per group, sad answers only

Today I learned that frequently asked questions on StackOverflow get their own tag. The tag greatest-n-per-group is for answers to questions about writing SQL to find the max row per group. By max row I mean the aggregated columns, group by columns and other columns for the row that has the max or min value in a group. By sad answers only I mean there is a lot of confusion about this, StackOverflow has over 3000 posts, and the query is harder to write than it needs to be.

I am writing about SQL rather than SQL engines and I am not an expert on writing SQL queries, but that might be appropriate given that I am writing about something that confuses users and could be easier. My motivation for writing this was a slow query plan for MySQL 8.0.22 while implementing the Time Series Benchmark Suite (TSBS) for MySQL.

Reproduction SQL is here and here and output from MySQL 8.0.22 is here and here. I updated the output for the queries with SHOW STATUS printed after each query.

The easy way

The answer is easy if you only want the aggregated and group by columns:

SELECT MAX(agg_col), gb_col FROM t GROUP BY gb_col

But it isn't easy when you want other columns -- columns that are not used for group by or aggregation.  This would be easy had MySQL continued on the ANY_VALUE theme by adding FIRST_VALUE and LAST_VALUE as MongoDB does via the $first and $last aggregation accumulator operators. Well, MySQL has FIRST_VALUE and LAST_VALUE, but for window functions and they don't provide the desired semantics. If they existed with semantics similar to MongoDB then the following query would work and is easy to write:

SELECT MAX(agg_col), gb_col, FIRST_VALUE(other_col) FROM t GROUP BY gb_col

I am not an expert. Perhaps one day I will learn why there isn't an easy way to do this. MySQL docs have a useful page on this type of query. I have yet to try a variant that uses LATERAL.

The less easy ways

There are many ways to write this query. None of them are easy as the example I wrote above that isn't valid SQL. I tried: left join, correlated subquery, uncorrelated subquery and a rank() window function. The performant solutions were uncorrelated subquery and rank() window function. I was surprised that the rank() window function approach was performant because the explain output looked slow. But the runtime and slow log output were OK. By performant I mean a query plan that examines few rows, and loose index scan is an example of that.

This table shows the response time for each query type and the number of rows examined from the slow log. For the window function approach I am confused both by the low value for rows examined and the plan that shows a table scan. I am curious if there is a bug.

ApproachResponse time (secs)Rows examined
Uncorrelated 0.00 20
Window function 0.10 10
Left join 1.07 245750
Correlated 134.11 671170560

Loose index scan

Update - when I published this I claimed the index was on (j DESC, pk) but that was a mistake.

Before I show the queries, my goal is to get a plan that uses the loose index scan optimization. The test table is: create table tq(pk int primary key, j int, k int), there is an index on (j, pk DESC) and a NOT NULL constraint on j. This query uses a loose index scan, however it can't provide the value for the column k. The loose index scan is performant because it fetches one entry from the index per distinct value for (j,pk).

SELECT max(pk), j FROM tq GROUP BY j
EXPLAIN: -> Group aggregate (computed in earlier step): max(tq.pk)
    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10) 

Uncorrelated subquery

This plan is performant because t2 is materialized via a loose index scan and the result from that does one point query per distinct value in j.

SELECT t1.pk, t1.j, t1.k
FROM tq t1, (SELECT max(pk) as maxpk, j FROM tq GROUP BY j) t2
WHERE t2.maxpk = t1.pk

EXPLAIN: -> Nested loop inner join
    -> Filter: (t2.maxpk is not null)
        -> Table scan on t2  (cost=3.62 rows=10)
            -> Materialize
                -> Group aggregate (computed in earlier step): max(tq.pk)
                    -> Index range scan on tq using index_for_group_by(x)  (cost=13.00 rows=10)
    -> Single-row index lookup on t1 using PRIMARY (pk=t2.maxpk)  (cost=0.26 rows=1)

Rank window function

This query is slightly slower than the uncorrelated subquery above. I didn't expect that given the plan that has a table scan on tq. Adding hints to use the index on (j,pk) don't change the query plan. I wonder if this explain outpout is correct as the query doesn't do a full scan when run. Also the query is almost as fast as the uncorrelated approach.

WITH t1 AS (SELECT pk, j, k,
    RANK() OVER (PARTITION by j ORDER BY pk DESC) AS myrank FROM tq)
SELECT pk, j, k from t1 WHERE myrank=1

EXPLAIN: -> Index lookup on t1 using <auto_key0> (myrank=1)
    -> Materialize CTE t1
        -> Window aggregate: rank() OVER (PARTITION BY tq.j ORDER BY tq.pk desc )
            -> Sort: tq.j, tq.pk DESC  (cost=8273.25 rows=82170)
                -> Table scan on tq  (cost=8273.25 rows=82170)
Left join

I don't expect this query to be performant because there isn't an equality predicate on pk. This might be a useful approach when there isn't an index on (j,pk), but that is not the case here and this plan examines too many rows.

SELECT t1.pk, t1.j, t1.k FROM tq t1
LEFT JOIN tq t2 ON t1.j = t2.j AND t1.pk < t2.pk
WHERE t2.j IS NULL

EXPLAIN: -> Filter: (t2.j is null)  (cost=76209940.12 rows=750212100)
    -> Nested loop antijoin  (cost=76209940.12 rows=750212100)
        -> Table scan on t1  (cost=8273.25 rows=82170)
        -> Filter: (t1.pk < t2.pk)  (cost=14.38 rows=9130)
            -> Index lookup on t2 using x (j=t1.j)  (cost=14.38 rows=9130)

Correlated subquery

The correlated subquery isn't performant. It examines too many rows. That isn't a surprise.

SELECT pk, j, k FROM   tq t1
WHERE  pk=(SELECT MAX(t2.pk) FROM tq t2 WHERE t1.j = t2.j)

EXPLAIN: -> Filter: (t1.pk = (select #2))  (cost=8273.25 rows=82170)
    -> Table scan on t1  (cost=8273.25 rows=82170)
    -> Select #2 (subquery in condition; dependent)
        -> Aggregate: max(t2.pk)
            -> Index lookup on t2 using x (j=t1.j)  (cost=927.37 rows=9130)