Tuesday, May 28, 2024

The Insert Benchmark, Postgres 17beta1, a small server and cached database

This post has results for the Insert Benchmark on a small server with a cached workload. The goal is to compare new Postgres releases with older ones to determine whether they get better or worse over time. The results here are from the newest small servers in my test cluster -- an ASUS PN53.

This work was done by Small Datum LLC.

The workload here has low concurrency (1 or 4 clients) and the database is cached. The results might be different when the workload is IO-bound or has more concurrency. Results were recently shared from tests run on an older small server. 

tl;dr

  • There are no regressions from Postgres 16.3 to 17beta1 for this benchmark
  • All Postgres versions have too much variance on the insert rate for the l.i1 and l.i2 benchmark steps. While tuning fixes that is a too frequent response, I spent much time trying to fix this via tuning last year without success. This has not been a problem for modern MySQL and one example is here (equivalent Postgres graphs are here). The issue is explained below.
  • The builtin provider might have improved performance on two of the benchmark steps (l.i1, l.i2) when comparing Postgres 17beta1 with and without it. But I need to repeat the benchmark to confirm that the result isn't just noise.
Build + Configuration

This post has results from Postgres versions 10.23, 11.22, 12.19, 13.15, 14.12, 15.7, 16.3 and 17beta1. All were compiled from source. I used configurations that are as similar as possible but I won't have access to the test machines for a few days. The config for 16.3 is here.

For Postgres 17beta1 I have results for two different configs (cx9a2_c8r32, cx9a2lc_c8r32). But they really are the same config. The difference is in the flags to initdb:
  • cx9a2_c8r32 - --data-checksums
  • cx9a2lc_c8r32 - --data-checksums --locale-provider=builtin --builtin-local=C.UTF-8
This was done to see if the changes in 17beta1 for the builtin local provider impact performance. The output from running initdb for Postgres 17beta1 is here.

The Benchmark

The benchmark is run with 1 client, a cached workload and 1 table. It is explained here.

The test server was named v8 here and is an ASUS PN53. It has 8 cores, 32G RAM, Ubuntu 22.04 and XFS using 1 m.2 device.

The benchmark steps are:

  • l.i0
    • insert 50 million rows in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts 40M rows and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and 10M rows are inserted and deleted.
    • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results

The performance report is here.
    The summary in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    Results: Postgres

    The base case is Postgres 10.23 with the cx9a2_c8r32 config (pg1023_def.cx9a2_c8r32). It is compared with:
    • Postgres 16.3 with the cx9a2_c8r32 config (pg163_def.cx9a2_c8r32)
    • Postgres 17beta1 with the cx9a2_c8r32 config (pg17beta1_def.cx9a2_c8r32)
    • Postgres 17beta1 with the cx9a2lc_c8r32 config (pg17beta1_def.cx9a2lc_c8r32) that uses the builtin locale provider
    tl;dr
    • Postgres 16.3 and 17beta1 have similar performance
    • It isn't clear that the builtin provider helps this benchmark
    • The results for l.i1 and l.i2 should be revisited to explain why 17beta1 without the builtin locale provider didn't do better
    From the summary the relative throughput per benchmark step is:
    • l.i0
      • relative QPS is 1.14 in PG 16.3
      • relative QPS is 1.18 in PG 17beta1 without the builtin locale
      • relative QPS is 1.18 in PG 17beta1 with the builtin locale
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 2.232.27 in PG 16.3
      • relative QPS is 1.99, 2.00 in PG 17beta1 without the builtin locale
      • relative QPS is 2.292.22 in PG 17beta1 with the builtin locale
    • qr100, qr500, qr1000
      • relative QPS is 1.03, 1.08, 1.08 in PG 16.3
      • relative QPS is 1.041.071.08 in PG 17beta1 without the builtin locale
      • relative QPS is 1.051.071.07 in PG 17beta1 with the builtin locale
    • qp100, qp500, qp1000
      • relative QPS is 0.980.980.97 in PG 16.3
      • relative QPS is 0.97, 0.98, 0.97 in PG 17beta1 without the builtin locale
      • relative QPS is 0.960.970.96 in PG 17beta1 with the builtin locale
    Metrics

    Metrics for each of the benchmark steps starts here. By metrics I mean mostly counters from iostat and vmstat normalized by the insert or query rates. These help to explain differences in performance and spot where something is using more or less HW per operation. Here I focus on the metrics for the l.i1 and l.i2 benchmark steps because that is where there is a difference for Postgres 17beta1 with and without the builtin locale provider.

    The thing to explain is that the insert (and delete) rates sustained by Postgres 17beta1 with the builtin locale provider are better than without it. Note that the benchmark client is setup so that the insert and delete rates are coupled -- they are programmed to run at the same rate. The difference in the results here is about 15%. The most interesting difference in the metrics for l.i1 is in the CPU/operation (cpupq) column where it about 8% larger without the builtin locale provider (127 vs 117).

    The other interesting result is that the number of context switches per operation (cspq) is about 15% smaller in Postgres 17beta1 vs 16.3. This is a good change, but I want to see more results before I am certain about it.

    Variance

    All Postgres versions have too much variance on the insert rate for the l.i1 and l.i2 benchmark steps -- see the graphs for insert/s vs time here (IPS == inserts/s, DPS == deletes/s). While tuning fixes that is a too frequent response, I spent much time trying to fix this via tuning last year without success. This has not been a problem for modern MySQL and one example is here (equivalent Postgres graphs are here).

    The response time histograms for l.i1 makes this easier to understand. The histograms are great for inserts but lousy for deletes. The problems is explained in this Twitter thread and blog post. The issue is that the Postgres optimizer probes an index to figure out the min value for a column in some cases, and while this part of the benchmark isn't written in the best way, I still prefer that Postgres do better here and get_actual_variable_range is the culprit.

    I updated the Insert Benchmark last year to add deletes at the same rate as inserts for the l.i1 and l.i2 benchmark steps to avoid growing the database. And Postgres was much less happy with that change when compared to MySQL. Note that MySQL with InnoDB often probes indexes to determine predicate selectivity. The real issue here is the number of dead index entries that must be skipped to find the first live entry while figuring out the min value.











    Friday, May 24, 2024

    Sysbench on a new small server: MariaDB vs MySQL

    This has results from the sysbench benchmark for MariaDB and MySQL on a new, small server with a cached and low-concurrency workload. For MariaDB I tested LTS releases from 10.2 through 11.4. For MySQL I tested 5.6, 5.7 and 8.0. The results from MySQL here are a good reason to use changepoint detection to spot regressions early, like that provided by Nyrkiƶ.

    This work was done by Small Datum LLC and sponsored by the MariaDB Foundation. Previous posts shared results from older, small servers: here and here. The results here are similar.

    My standard disclaimer is that sysbench with low-concurrency is great for spotting CPU regressions. However, a result with higher concurrency from a larger server is also needed to understand things. Results from IO-bound workloads and less synthetic workloads are also needed. But low-concurrency, cached sysbench is a great place to start.

    tl;dr

    • MySQL suffers more from performance regressions than MariaDB
    • Modern MariaDB is 10% to 30% faster than modern MySQL on most microbenchmarks
    • The update-index microbenchmark is an exception to the MariaDB is faster result because throughput there for modern MySQL improved a lot more than it did for modern MariaDB

    Builds and configuration

    For MariaDB I used the latest point releases from LTS versions: 10.2.44, 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7 and the upcoming LTS 11.4.1. 

    For MySQL I used 5.6.51, 5.7.44 and 8.0.36.

    The first GA release for MariaDB 10.2 was in 2017. The first GA release for MySQL 5.6 was in 2013. So while I cannot claim that my testing covers MySQL and MariaDB from the same time period, I can claim that I am testing old versions of both.

    Everything was compiled from source with similar CMake command lines and CMAKE_BUILD_TYPE set to Release. It is much easier to compile older MariaDB releases than older MySQL releases. For MariaDB I did not have to edit any source files. Notes on compiling MySQL are here for 5.6for 5.6 and 5.7for 5.7 and for 8.0. A note on using cmake is here.

    The my.cnf files are here.

    Benchmarks

    I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement.

    Tests were run on my newest small server, an ASUS PN53 with 8 cores, 32G of RAM, an NVMe SSD with XFS and Ubuntu 22.04. The server is described here.

    The benchmark is run 3 setups. Some of the microbenchmarks suffer from contention and that is worse with 4u.1t than 4u.4t:
    • 1u.1t - 1 connection, 1 table with 50M rows
    • 4u.1t - 4 connections, 1 table with 50M rows
    • 4u.4t - 4 connections, 4 tables with 48M rows (12M per table)
    In all cases
    • each microbenchmark runs for 300 seconds if read-only and 600 seconds otherwise
    • prepared statements were enabled
    The command lines for my helper scripts were:

    bash r.sh 1 50000000 300 600 nvme1n1 1 1 1
    bash r.sh 1 50000000 300 600 nvme1n1 1 1 4
    bash r.sh 4 12000000 300 600 nvme1n1 1 1 4

    Results

    For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data and charts is here. For each group I present a chart and a table with summary statistics for the three setups: 1u.1t, 4u.1t and 4u.4t.

    All of the charts have relative throughput on the y-axis where that is (QPS for $me) / (QPS for $base), $me is some DBMS version (for example MariaDB 11.4.1) and $base is the DBMS version for the base case. The base version is specified below depending on what I am comparing. The y-axis doesn't start at 0 to improve readability. When the relative throughput is > 1 then the throughput on some DBMS version is greater than the throughput for the base case.

    The legend under the x-axis truncates the names I use for the microbenchmarks and I don't know how to fix that other than sharing links (see above) to the Google Sheets I used.

    Results: MariaDB from old to new

    This section uses MariaDB 10.2.44 as the base version and then compares that with MariaDB versions 10.3.39, 10.4.33, 10.5.24, 10.6.17, 10.11.7 and 11.4.1. The goal is to determine how throughput (QPS) changes from older releases like 10.2 to the latest release (11.4). 

    These tables have summary statistics of the throughput for MariaDB 11.4.1 relative to 10.2.44 for each microbenchmark groups. A value greater than one means the throughput for MariaDB 11.4.1 is better than for 10.2.44. From the results here, new MariaDB (11.4.1) gets at least 92% with 1u.1t, at least 93% with 4u.1t and 92% with 4u.4t of the throughput relative to old MariaDB (10.2.44) using the median relative throughput per microbenchmark group. New features added to MariaDB don't get in the way of performance because there aren't significant regressions over time.

    Summary statistics for 1u.1t

    point, part 1point, part 2range, part 1range, part 2writes
    average0.980.980.980.960.92
    median0.990.990.940.940.92
    min0.890.930.930.900.81
    max1.031.031.171.041.00
    stddev0.040.040.080.060.05

    Summary statistics for 4u.1t

    point, part 1point, part 2range, part 1range, part 2writes
    average0.960.970.970.961.04
    median0.970.990.940.931.04
    min0.910.920.930.910.93
    max0.991.011.161.031.15
    stddev0.030.040.080.060.08

    Summary statistics for 4u.4t

    point, part 1point, part 2range, part 1range, part 2writes
    average0.970.960.960.941.00
    median0.970.990.920.921.00
    min0.910.910.900.900.77
    max1.000.991.141.001.16
    stddev0.030.040.080.050.13

    There are three graphs per microbenchmark group - for 1u.1t, 4u.1t and 4u.4t. The y-axis doesn't begin at zero to improve readability.

    Graphs for point queries, part 1. Results are similar across 1u.1t, 4u.1t and 4u.4t.  
    Graphs for point queries, part 2. Results are similar across 1u.1t, 4u.1t and 4u.4t. 
    Graphs for range queries, part 1. Results are similar across 1u.1t, 4u.1t and 4u.4t. Full table scan gets much faster in modern MariaDB as it does in modern MySQL.
    Graphs for range queries, part 2. Results are similar across 1u.1t, 4u.1t and 4u.4t. 

    Graphs for writes. Results are similar across 1u.1t, 4u.1t and 4u.4t. Modern MariaDB is much faster at the update-index microbenchmark for 4u.1t and 4u.4t, which is similar to what happens with MySQL although the improvements for MySQL are larger. Alas, the speedup in update-index decreases from MariaDB 10.5 through 11.4 as it does for most of the microbenchmarks. That decrease is not visible for the read-heavy microbenchmarks above.

    Results: MySQL from old to new

    This section uses MySQL 5.6.51 as the base version and then compares that with MySQL versions 5.7.44 and 8.0.36. The goal is to determine how throughput (QPS) changes from older to newer releases. 

    These tables have summary statistics of the throughput for MySQL 8.0.36 vs 5.6.51 for each of the microbenchmark groups. A value greater than one means the throughput for MySQL 8.0.36 is better than for 5.6.51. From the results here, new MySQL (8.0.36) gets between 66% and 84% of the throughput relative to old MySQL (5.6.51) using the median relative throughput per microbenchmark group. New features in modern MySQL come at the cost of much CPU overhead.

    Summary statistics for 1u.1t

    point, part 1point, part 2range, part 1range, part 2writes
    average0.710.750.690.860.68
    median0.710.720.690.810.66
    min0.660.690.670.730.55
    max0.760.840.711.030.90
    stddev0.030.070.010.140.11

    Summary statistics for 4u.1t

    point, part 1point, part 2range, part 1range, part 2writes
    average0.720.750.680.871.07
    median0.720.720.680.820.82
    min0.670.690.670.740.58
    max0.770.840.691.043.59
    stddev0.030.070.010.140.90

    Summary statistics for 4u.4t

    point, part 1point, part 2range, part 1range, part 2writes
    average0.720.760.680.880.83
    median0.720.720.690.840.81
    min0.680.700.670.760.65
    max0.780.850.701.041.08
    stddev0.030.070.010.130.13

    There are three graphs per microbenchmark group - for 1u.1t, 4u.1t and 4u.4t. The y-axis doesn't begin at zero to improve readability.

    Graphs for point queries, part 1. Results are similar across 1u.1t, 4u.1t and 4u.4t.  There are large regressions from 5.6 to 5.7 and then again from 5.7 to 8.0.
    Graphs for point queries, part 2. Results are similar across 1u.1t, 4u.1t and 4u.4t.  There are large regressions from 5.6 to 5.7 and then again from 5.7 to 8.0.
    Graphs for range queries, part 1. Results are similar across 1u.1t, 4u.1t and 4u.4t.  There are large regressions from 5.6 to 5.7 and then again from 5.7 to 8.0.
    Graphs for range queries, part 2. Results are similar across 1u.1t, 4u.1t and 4u.4t.  There are large regressions from 5.6 to 5.7 and then again from 5.7 to 8.0.
    Graphs for writes. Results are mostly similar across 1u.1t, 4u.1t and 4u.4t with one exception. For the update-index microbenchmark the graphs truncate the relative throughput value for update-index because MySQL 5.7 and 8.0 are much faster than 5.6. The values are in the spreadsheet and the relative throughput for 4u.1t is (2.49, 3.59) for (5.7, 8.0) and for 4u.4t is (1.45, 1.08) for (5.7, 8.0). So the improvement is larger for 4u.1t than for 4u.4t which suggests something was done to greatly reduce contention on that microbenchmark.
    Results: MariaDB vs MySQL

    This section uses MySQL 8.0.36 as the base version and then compares that with MariaDB 11.4.1. The goal is to determine which DBMS gets more throughput (or uses less CPU/query).

    These tables have summary statistics of the throughput for MariaDB 11.4.1 vs MySQL 8.0.36 for each of the microbenchmark groups. A value greater than one means the throughput for MariaDB 11.4.1 is better than for MySQL 8.0.36. From the results here, modern MariaDB (11.4.1) gets between 111% and 123% of the throughput relative to modern MySQL (8.0.36) using the median relative throughput per microbenchmark group. Modern MariaDB is faster than modern MySQL (on this setup) because MySQL has more performance regressions over time.

    There is one outlier from the update-index microbenchmark. For update-index, modern MySQL is faster than modern MariaDB because MySQL 5.7 and 8.0 are much faster than MySQL 5.6, especially in the 4u.1t setup where there is more contention. But the performance for modern MariaDB didn't improve as much as it did for modern MySQL. The result is that modern MySQL is up to 3X faster than modern MariaDB for update-index.

    Summary statistics for 1u.1t

    point, part 1point, part 2range, part 1range, part 2writes
    average1.161.171.181.171.21
    median1.181.181.181.111.23
    min1.101.121.111.091.06
    max1.281.201.371.301.30
    stddev0.050.030.080.100.08

    Summary statistics for 4u.1t

    point, part 1point, part 2range, part 1range, part 2writes
    average1.151.181.181.181.15
    median1.151.201.181.131.18
    min1.061.131.111.120.35
    max1.231.231.371.301.60
    stddev0.050.040.080.090.32

    Summary statistics for 4u.4t

    point, part 1point, part 2range, part 1range, part 2writes
    average1.151.171.161.161.14
    median1.151.191.171.111.16
    min1.081.111.071.090.92
    max1.271.201.331.271.31
    stddev0.060.040.080.090.11

    There are three graphs per microbenchmark group - for 1u.1t, 4u.1t and 4u.4t. The y-axis doesn't begin at zero to improve readability.

    Graphs for point queries, part 1. Results are similar across 1u.1t, 4u.1t and 4u.4t. And modern MariaDB is faster than modern MySQL in all of the microbenchmarks. 
    Graphs for point queries, part 2. Results are similar across 1u.1t, 4u.1t and 4u.4t. And modern MariaDB is faster than modern MySQL in all of the microbenchmarks.
    Graphs for range queries, part 1. Results are similar across 1u.1t, 4u.1t and 4u.4t. And modern MariaDB is faster than modern MySQL in all of the microbenchmarks.
    Graphs for range queries, part 2. Results are similar across 1u.1t, 4u.1t and 4u.4t. And modern MariaDB is faster than modern MySQL in all of the microbenchmarks.

    Graphs for writes. Results are similar across 1u.1t, 4u.1t and 4u.4t. And modern MariaDB is faster than modern MySQL in all of the microbenchmarks except for update-index with 4u.1t and 4u.4t. The reason is that both modern MariaDB and modern MySQL have gotten faster on that microbenchmark but the improvement for MySQL is larger.




    RocksDB on a big server: LRU vs hyperclock, v2

    This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...