Thursday, September 26, 2024

InnoDB code bloat in MySQL 8.0 makes me sad

InnoDB uses a lot more CPU per query in MySQL 8.0 which makes it a lot slower for low-concurrency workloads that are CPU-bound. That is offset by improvements which reduce mutex contention which means it might not be slower for high-concurrency workloads?

It would be great to get the reduction in mutex contention without the increase in CPU overhead. Alas, I don't think that is ever going to happen and this makes me sad. I doubt the performance regressions for InnoDB in MySQL 8.0 will ever be fixed and the workarounds are: MyRocks, MariaDB, MySQL 5.7 forever, Postgres.

I explained in a recent post that the problems are code bloat:

  • MySQL 8.0 uses more instructions per query
  • MySQL 8.0 wastes more time on cache and TLB activity per query 
Another recent post shows that the MySQL binary is growing at an alarming rate. This post has more details with a focus on InnoDB. Again, this makes me sad. InnoDB was very good to me for a long time but that time is ending.

I have spent much time trying to find a way to undo some of the bloat and I have learned multiple times that I can't undo the damage via compile-time options, even though there are a few things that might make things 5% to 20% faster. But some of those things also make older MySQL faster.

Updates
  • v1 of this post claimed that InnoDB started to use Boost in MySQL 8.0.2. That is incorrect, it started to use STL (unordered_map, array) in 8.0.2
  • see Bloat in 8.0.2 from STL below for details on the impact of the diff that adds STL in 8.0.2
Bloat in 8.0.2 from STL

While I suspect that the new usage of STL with InnoDB in fil0fil.cc contributes to code bloat that I report below, and there is a big jump in the size of fil0fil.o from 8.0.1 to 8.0.2, it is difficult to prove that is one of the causes. And even the results here don't prove it because the diff that adds STL in 8.0.2 is a large squash merge with many other changes.

I compiled MySQL at the diff that first uses STL in fil0fil.cc and the one prior to it:
  • 201b2b20d1
    • adds the usage of STL. But it is a squash merge that combines many commits. So STL remains a suspect but I am still not certain. Regardless, the results show that things grow a lot from this commit.
  • 817379925c
    • the diff prior to 201b2b20d1. 
The patches that I had to apply to get these to compile and the CMake command line are here. Too bad it is hard to automate the search for bloat because all of these builds need patches to compile. I won't blame C++ for this because I don't need patches to compile old versions of RocksDB.

The conclusion, from 817379925c to 201b2b20d1
  • libinnobase.a grows by ~5M from 26492K to 31518K
  • fil0fil.o grows by 2.6X from 548K to 1440K
A few things that don't have a big impact

A short summary of things I tried:
  • MySQL 5.7 started to use -fPIC while compiling the server. That was not used in 5.6. Alas, switching from -fPIC to -fpic doesn't have a big impact.
  • Avoiding -fPIC via -DDISABLE_SHARED (in the cases where that doesn't break the build) also doesn't have a big impact. 
  • Compiling with -DWITH_LTO=ON makes things ~5% faster
  • PGO can be a big deal, but needs more evaluation to understand whether I need PGO build per workload which is costly to maintain
Measuring code bloat

InnoDB started to use STL in MySQL 8.0.2-dmr and that is definitely part of the code-bloat problem.

Here I measure bloat using several methods. I am sure some of them are more flawed then others but they all show much bloat in MySQL 8.0. The methods are:
  • lines of code (yes, this is far from perfect)
  • size of libinnobase.a for MySQL compiled with CMAKE_BUILD_TYPE=Release
  • size of object files for MySQL compiled with CMAKE_BUILD_TYPE=Release
From the files I checked, fil0fil.cc grew the most
Up and up, keep on growing!
Based on the size of object files there is much innovation in fil0fil.cc. There is a big jump from 8.0.1 to 8.0.2 -- STL (unordered_map, array) is first used in 8.0.2, and there were many other changes.
There is small growth up to 8.0.28 and then things take off. If 8.0.40 really fixes bug 111538 then some of this will be undone.

Tuesday, September 24, 2024

Configuration options that make writes faster or slower with MySQL and Postgres

This post is inspired by a comment about a recent blog post I shared. While I don't focus on MySQL vs Postgres comparisons, sometimes I share them and in my benchmarks with CPU-bound workloads, and fsync disabled (even more CPU-bound) the general result was that insert throughput was much better with Postgres than MySQL+InnoDB.

The commenter stated that Postgres and MySQL+InnoDB get similar insert rates when the MySQL binlog is disabled. I rarely test such a config because I never ran MySQL in production without the binlog enabled. Regardless, the commenter is a smart person so I decided to run a few more tests.

Some of the configurations I test here are great for performance but lousy for durability so don't take anything here as tuning advice.

tl;dr

  • MySQL+InnoDB suffers more tha Postgres from updates that require index maintenance
  • For MySQL with the binlog disabled, insert throughput gets close to Postgres results but not many deployments run primaries with the binlog disabled. The gap is much larger, in favor of Postgres, for updates 
  • When fsyncs are enabled the difference is much larger, in favor of Postgres. One reason is that MySQL must do an fsync for the binlog and InnoDB redo while Postgres only does it for the WAL. But there is more going on here and the fsync impact doesn't explain all of the performance differences.

Configurations tested

I used MySQL 8.0.39 and Postgres 17rc1 both compiled from source. The config files are here.

For MySQL I used 12 configurations:

name            binlog  fsync   InnoDB doublewrite
bl0_sync0_dw0   off     off     off
bl0_sync0_dw1   off     off     on
bl0_sync0_dwDO  off     off     DETECT_ONLY
bl0_sync1_dw0   off     on      off
bl0_sync1_dw1   off     on      on
bl0_sync1_dwDO  off     on      DETECT_ONLY
bl1_sync0_dw0   on      off     off
bl1_sync0_dw1   on      off     on
bl1_sync0_dwDO  on      off     DETECT_ONLY
bl1_sync1_dw0   on      on      off
bl1_sync1_dw1   on      on      on
bl1_sync1_dwDO  on      on      DETECT_ONLY

For Postgres I used 12 configurations on the large server and on the small server I used 8 (skipped the 4 that use LZ4 for wal_compression:

name                    fsync   wal_compression full_page_writes
sync0_comp0_fpi0        off     off             off
sync0_comp0_fpi1        off     off             on
sync0_comp1_fpi0        off     pglz            off
sync0_comp1_fpi1        off     pglz            on
sync0_compLZ4_fpi0      off     lz4             off
sync0_compLZ4_fpi1      off     lz4             on
sync1_comp0_fpi0        on      off             off
sync1_comp0_fpi1        on      off             on
sync1_comp1_fpi0        on      pglz            off
sync1_comp1_fpi1        on      pglz            on
sync1_compLZ4_fpi0      on      lz4             off
sync1_compLZ4_fpi1      on      lz4             on

And a few notes:
  • the high durability configuration for MySQL is bl1_sync1_dw1 but that means there will be two fsyncs per commit (one for binlog, one for InnoDB redo)
  • the InnoDB doublewrite options (disabled, enable,d DETECT_ONLY) don't impact performance here. They will matter on workloads that are heavier on IO.
  • by fsync enabled I mean fsync (or fdatasync) is done for any logs that are enabled (Postgres WAL, InnoDB redo, MySQL binlog) and enabled for InnoDB means innodb_flush_log_at_trx_commit =1 while disabled for InnoDB means =2.
  • for Postgres I use wal_sync_method=fdatasync but below I often refer to that as fsync
  • for Postgres fsync enabled/disabled is the value for synchronous_commit

Hardware

I used two server types:

  • small
    • Beelink SER7 7840HS with Ryzen 7 7840HS, 8 AMD cores, AMD SMT disabled, 32G RAM, 1T m.2 SSD for OS, an NVMe device for the database using ext4 and Ubuntu 22.04.
  • medium
    • A c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.
I haven't measured it on either the small or medium server but fdatasync latency might not be excellent on the medium server and definitely isn't excellent on the small server.

Benchmark

I used sysbench and my usage is explained here. But I only ran a few of the tests from it rather than the full 40+. The ones for which I report performance are insert, update-index and update-noindex. Note that update-index means that index maintenance is required and update-noindex means that it is not. 

For the small server the test is run with 1 thread, 1 table and 50M rows. For the medium server it is run with 12 threads, 8 tables and 10M rows per table.

Results: small server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

  • MySQL throughput with the binlog on is (57%, 83%, 54%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting and that looks like the issue is the low QPS from the test without the binlog.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
  • MySQL with the binlog off gets (80%, 20%, 62%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. The low value for update-index with MySQL supports the what I write above in the first bullet point.
  • MySQL with the binlog on gets (46%, 17%, 33%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
  • Using Postgres with full_page_writes=on (recommended) cuts throughput in half for the update tests and by ~30% for the insert test when wal_compression uses pglz. The impact is much less from lz4, so just use lz4.

When fsync is enabled

  • It is much harder to spot CPU regressions which I why most of my tests disable it
  • MySQL throughput with the binlog on is ~50% relative to the binlog off. This is expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
  • MySQL with the binlog off gets ~27% of the throughput relative to Postgres and for MySQL with the binlog on that drops to ~14% relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. These differences are larger than I expect.
  • The impact from full_page_writes=on for Postgres is smaller here than it is above, but again the real issue is the CPU overhead from pglz and the solution is to use lz4.
--- without fsync

- no binlog for MySQL
ins     upd-ix  upd-nix dbms
20413    5662   19837   my8039 bl0_sync0_dw0
20458    5881   20271   my8039 bl0_sync0_dw1
20252    5790   19742   my8039 bl0_sync0_dwDO

- binlog for MySQL
ins     upd-ix  upd-nix dbms
11814    4935   11051   my8039 bl1_sync0_dw0
11614    4910   10992   my8039 bl1_sync0_dw1
11649    4896   11062   my8039 bl1_sync0_dwDO

- Postgres
ins     upd-ix  upd-nix dbms
25590   28835   32951   pg17rc1 sync0_comp0_fpi0
25471   28840   32837   pg17rc1 sync0_comp0_fpi1
26452   29196   32702   pg17rc1 sync0_comp1_fpi0
18537   14276   16033   pg17rc1 sync0_comp1_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit: InnoDB)
ins     upd-ix  upd-nix dbms
  139     135     142   my8039 bl0_sync1_dw0
  140     134     142   my8039 bl0_sync1_dw1
  140     134     142   my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit: binlog + InnoDB)
ins     upd-ix  upd-nix dbms
   72      72      73   my8039 bl1_sync1_dw0
   73      71      71   my8039 bl1_sync1_dw1
   73      68      70   my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit: WAL)
ins     upd-ix  upd-nix dbms
  512     513     516   pg17rc1 sync1_comp0_fpi0
  505     507     517   pg17rc1 sync1_comp0_fpi1
  512     514     516   pg17rc1 sync1_comp1_fpi0
  438     422     387   pg17rc1 sync1_comp1_fpi1

Results: medium server

The columns list the throughput for the insert (ins), update-index (upd-ix) and update-noindex (upd-nix).

When fsync is disabled

  • MySQL throughput with the binlog on is (44%, 79%, 43%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1. The result for update-index is the most interesting.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here. They will have an impact on workloads that do more write-back and I am happy that InnoDB now supports DETECT_ONLY.
  • MySQL with the binlog off gets (96%, 16%, 70%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Once again the result for update-index with MySQL is odd.
  • MySQL with the binlog on gets (42%, 13%, 30%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_comp0_fpi1. Most MySQL primaries will run with the binlog enabled.
  • Using Postgres with full_page_writes=on (recommended) reduces throughput by ~5% when wal_compression uses pglz and the solution is lz4.

When fsync is enabled

  • MySQL throughput with the binlog on is (34%, 58%, 41%) of throughput with the binlog off for (insert, update-index, update-noindex) using bl0_sync0_dw1.  This is (mostly) expected because the bottleneck is fsync and there are 2 fsyncs/write with the binlog on vs 1 fsync/write with it off.
  • The various choices for the InnoDB doublewrite buffer (off, on, DETECT_ONLY) don't have an impact here.
  • MySQL with the binlog off gets (133%, 68%, 132%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1. 
  • MySQL with the binlog on gets (45%, 39%, 54%) of the throughput relative to Postgres using bl0_sync0_dw1 vs sync0_compLZ4_fpi1. Most MySQL primaries will run with the binlog enabled.
  • The impact from full_page_writes=on for Postgres is large for updates (reduces throughput by ~50%) and smaller for the insert test because the insert test loads in PK order.

--- without fsync

- no binlog for MySQL
ins     upd-ix  upd-nix dbms
117078   17620  104215  my8039 bl0_sync0_dw0
118038   19821  103625  my8039 bl0_sync0_dw1
117373   19582  103625  my8039 bl0_sync0_dwDO

- binlog for MySQL
ins     upd-ix  upd-nix dbms
 52602   15201   44727  my8039 bl1_sync0_dw0
 52193   15736   44372  my8039 bl1_sync0_dw1
 52456   16293   45195  my8039 bl1_sync0_dwDO

- Postgres
ins     upd-ix  upd-nix dbms
127864  138330  154650  pg17rc1 sync0_comp0_fpi0
123512  123825  148386  pg17rc1 sync0_comp0_fpi1
127456  139634  155127  pg17rc1 sync0_comp1_fpi0
119504  117795  136421  pg17rc1 sync0_comp1_fpi1
127885  139301  154869  pg17rc1 sync0_compLZ4_fpi0
125192  125532  149092  pg17rc1 sync0_compLZ4_fpi1

--- with fsync

- no binlog for MySQL (1 fsync per commit for InnoDB redo)
ins     upd-ix  upd-nix dbms
 20602    6824   15804  my8039 bl0_sync1_dw0
 20350    6640   15095  my8039 bl0_sync1_dw1
 20677    6836   16134  my8039 bl0_sync1_dwDO

- binlog for MySQL (2 fsync per commit, one for binlog, one for InnoDB redo)
ins     upd-ix  upd-nix dbms
  6972    3863    6238  my8039 bl1_sync1_dw0
  6963    3829    6183  my8039 bl1_sync1_dw1
  6922    3904    6266  my8039 bl1_sync1_dwDO

- Postgres (1 fsync per commit, for WAL)
ins     upd-ix  upd-nix dbms
 19754   24293   26250  pg17rc1 sync1_comp0_fpi0
 15089    9507   10963  pg17rc1 sync1_comp0_fpi1
 19732   24335   26280  pg17rc1 sync1_comp1_fpi0
 13031    8571    9204  pg17rc1 sync1_comp1_fpi1
 19698   24629   26163  pg17rc1 sync1_compLZ4_fpi0
 15309    9712   11420  pg17rc1 sync1_compLZ4_fpi1

The size of the mysqld binary as a proxy for innovation

I have been documenting performance regressions over time in MySQL. The regressions mean that some workloads get less throughput because the server uses more CPU per SQL operation. From perf stat I see there is more instruction cache and TLB activity per query. I also see that it takes more instructions per query. A recent blog post from me has more details.

This is bloat if you are a pessimist and a side-effect of innovation if you are an optimist. I must repeat that the issue is more serious for low-concurrency workloads because there has been much great work to reduce mutex contention.

The table below shows the size of the mysqld binary both as-is (not stripped, includes debug symbols) and stripped. The size of the stripped binary (almost) doubled from the last 5.6 release (5.6.51) to the last 5.7 release (5.7.44). It doubled again from 5.7.44 to a somewhat recent 8.0 release (8.0.28).

Assuming the binary size is a proxy for innovation then there is much innovation (2X per major release). But I am not sure that ends well given the impact on CPU overhead.

So I ask two things:

  1. Be more careful about innovation going forward
  2. Start using Nyrkio to detect regressions early in the development cycle

        -- size in MB --
version as-is   stripped
5651     28     16
571      29     15
573      31     16
575      40     19
577      69     26
579      71     27
5710     72     27
5719     72     27
5727     73     28
5735     80     30
5744     81     30
800     117     39
801     109     39
802     159     47
804     176     47
8012    179     50
8013    190     52
8014    192     52
8015    192     52
8016    194     54
8017    196     55
8018    207     57
8019    209     57
8020    213     58
8021    219     58
8022    221     59
8023    223     60
8024    225     60
8025    225     60
8026    242     62
8027    232     61
8028    233     61

Monday, September 23, 2024

Why do table scans get slower with MySQL from 5.6 to 8.0?

As I search to explain why MySQL is getting (too much) slower over time I collect a lot of data from many servers and then get overwhelmed trying to summarize it. Here I focus on one microbenchmark (table scan) on one server (ax162-s from Hetzner) using upstream MySQL with InnoDB and FB MyRocks.

While my primary goal is to explain the regressions in MyRocks, many of them are inherited from upstream MySQL so I must first explain them in upstream MySQL with InnoDB.

My focus is on a test that does full table scans so the ratio of (time spent in storage engine / time spent elsewhere) is much larger for that test than many of the other tests I use with sysbench.

The basic problem is that the table scan done by my tests is

  • ~1.5X faster for InnoDB in MySQL 5.6.51 vs 8.0.39
  • ~1.5X faster for MyRocks in FB MySQL 5.6.35 vs 8.0.32

The problem is spread over many releases, but the summary is:

  • Code bloat - more instructions are executed per SQL statement
    • InnoDB 8.0.28 uses ~1.3X more instructions/query than InnoDB 5.6.51
    • MyRocks 8.0.32 uses ~1.3X more instructions/query than MyRocks 5.6.35

    • Memory system bloat - more TLB and cache misses per SQL statement
      • InnoDB 8.0.28 has a lot more cache & TLB loads & misses relative to InnoDB 5.6.51. The difference ranges from ~1.6X more cache-misses to ~1700X more iTLB-loads.
      • MyRocks 8.0.32 has a lot more cache & TLB loads & misses relative to MyRocks 5.6.35. The difference ranges from ~1.6X more cache-misses to ~22X more iTLB-loads.

    The benchmark

    I used sysbench but not as described here. Instead I just did load, a few point-query tests and then the full scan test. Here I focus on the full scan. I ran a subset of the tests to get more results in less time.

    Tests are run with 1 thread, 1 table and 50M rows. While modern MySQL suffers from new CPU overheads it also benefits from a reduction in mutex contention. Alas, I don't highlight the improvements here. And modern MySQL would be even faster if it could reduce mutex contention without increasing CPU overheads.

    I ran things 3 times:

    1. plain - just run the test, don't use perf to get HW counters or flamegraphs
    2. stat - run perf stat to collect HW counters
    3. svg - run perf record -e cycles -F 1999 -g -p $pid to generate flame graphs
    For stat the counters are samples over 10-second intervals for several groups of counters via:

    perf stat -e cpu-clock,cycles,bus-cycles,instructions,branches,branch-misses -p $pid -- sleep 10
    perf stat -e cache-references,cache-misses,stalled-cycles-backend,stalled-cycles-frontend -p $pid -- sleep 10
    perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores -p $pid -- sleep 10
    perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -p $pid -- sleep 10
    perf stat -e iTLB-load-misses,iTLB-loads,L1-icache-loads-misses,L1-icache-loads -p $pid -- sleep 10
    perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetches -p $pid -- sleep 10
    perf stat -e alignment-faults,context-switches,migrations,major-faults,minor-faults,faults -p $pid -- sleep 10

    Database versions tested

    For upstream MySQL I tested InnoDB from 5.6.51, 5.7.1, 5.7.3, 5.7.5, 5.7.7, 5.7.9, 5.7.10, 5.7.19, 5.7.35, 5.7.44, 8.0.0, 8.0.2, 8.0.4, 8.0.11, 8.0.13, 8.0.15, 8.0.17, 8.0.19, 8.0.20, 8.0.21, 8.0.23, 8.0.25, 8.0.27, 8.0.28, 8.0.39.

    For MyRocks I tested:

    • 5635.old - MyRocks 5.6.35 from Apr 7, 2021 at git hash f896415f with RocksDB 6.19.0
    • 5635.new - MyRocks 5.6.35 from Oct 16, 2023 at git hash 4f3a57a1 with RocksDB 8.7.0
    • 8028.old - MyRocks 8.0.28 from Aug 29 2022 at git hash a35c8dfe with RocksDB 7.5.2
    • 8028.new - MyRocks 8.0.28 from Dec 2 2023 at git hash 4edf1eec with RocksDB 8.7.0
    • 8032.old - MyRocks 8.0.32 from Dec 4 2023 at git hash e3a854e8 with RocksDB 8.7.0
    • 8032.new - MyRocks 8.0.32 from May 29 2024 at git hash 49b37dfe with RocksDB 9.2.1

    Charts: InnoDB

    The charts use relative QPS which is: (QPS for my version / QPS for InnoDB 5.6.51). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

    The regressions during 5.7 are small.
    The regressions during 8.0 are larger. The worst occurs after 8.0.28 which hopefully is resolved in the soon-to-be released 8.0.40 -- see bug 111538..
    Charts: MyRocks

    The charts use relative QPS which is: (QPS for my version / QPS for MyRocks 5635.old). When it is less than zero then there is a performance regression. The spreadsheet with charts is here.

    The large regression occurs during the 5.6.35 to 8.0.28 transition.
    Hardware counters

    Above I explained how perf stat is used to collect HW counters over 10 second intervals. It is important to remember that measurements are made over a fixed amount of time rather than a fixed amount of work (queries). I want know to know things like how many instructions are executed per query and for that I have to normalize the results from perf stat by QPS.

    But before I spend too much time evaluating results from perf stat I confirm that the clock rate during the samples is similar. I learned this the hard way when using the schedutil CPU frequency governor with MyRocks (see here). The information for this comes from perf stat -e cycles via the line for cycles:

             10,010.49 msec cpu-clock                 #    1.001 CPUs utilized
        27,526,557,704      cycles                    #    2.750 GHz
       <not supported>      bus-cycles
       122,334,672,524      instructions              #    4.44  insn per cycle
        23,595,569,084      branches                  #    2.357 G/sec
             9,662,382      branch-misses             #    0.04% of all branches

          10.002641688 seconds time elapsed

    I assume that the 10-second sample is representative and that is more true for read-heavy tests and less true for write-heavy tests because write-heavy has intermittent background work (b-tree write back, LSM compaction).

    Using relative values

    I focus on using relative values for the counters which is: (value for my version) / (value for base case). This makes it easy to see how things change over time from old to new releases but it doesn't show the significance of the change. For example, in many cases the value for context switches during a table scan grows by a large amount (as in 40X) but that isn't significant because the change might be from 10 context switches per second to 400. And I doubt such a change has any impact on the performance regressions.

    Unfortunately, it isn't trivial to understand the significance of changes for other metrics. For example, when should I care more about cache vs TLB misses?

    The data

    At last, I can share some of the data collected by perf stat. The data is here for:
    • InnoDB 5.6.51 to 5.7.44
      • see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.
    • InnoDB 5.7.44 to 8.0.39
      • see here. The gist has three parts -- first is the counters (as relative values) that might explain the regressions, the second is the absolute values for those counters and the third is all of the counters as relative values.
    • MyRocks 5.6.35 to 8.0.32
      • see here. The gist has two parts -- first is the counters that might explain the regression and the second is all of the counters. Everything uses relative values.
    First up is InnoDB:
    • MySQL 8.0.28 uses ~1.3 times more instructions/query than 5.6.51
    • MySQL 8.0.28 has 1.6X or more cache & TLB misses and loads per query than 5.6.51. The worst ratio is for iTLB-loads (over 1700X) but I don't know whether the absolute cost from that is more significant than for the other counters.
    (value/query for MySQL 8.0.28) / (value/query for MySQL 5.6.51)

    instructions            (.991 / .109) / (1 / .135) =    1.227
    cache-misses            (1.301 / .109) / (1 / .135) =   1.611
    cache-references        (1.504 / .109) / (1 / .135) =   1.862
    dTLB-loads              (2.903 / .109) / (1 / .135) =   3.595
    iTLB-load-misses        (2.011 / .109) / (1 / .135) =   2.490
    iTLB-loads              (1412.0 / .109) / (1 / .135) =  1748.903
    L1-dcache-load-misses   (1.723 / .109) / (1 / .135) =   2.134
    L1-icache-loads         (3.101 / .109) / (1 / .135) =   3.840
    L1-icache-loads-misses  (15.607 / .109) / (1 / .135) =  19.330
    stalled-cycles-backend  (1.427 / .109) / (1 / .135) =   1.767

    Next up is MyRocks:
    • MyRocks 8032.new uses ~1.3X more instructions/query vs 5635.old
    • MyRocks 8032.new has 1.6X or more cache & TLB misses and loads per query than 5635.old. The worst ratio is for iTLB-loads (over 22X) but I don't know whether the absolute cost from that is more significant than for the other counters.
    (value/query for MyRocks 8032.new) / (value/query for 5635.old)

    instructions            (0.883 / .049) / (1 / .073) =   1.315
    cache-references        (1.410 / .049) / (1 / .073) =   2.100
    dTLB-loads              (1.253 / .049) / (1 / .073) =   1.866
    iTLB-load-misses        (1.104 / .049) / (1 / .073) =   1.644
    iTLB-loads              (15.068 / .049) / (1 / .073) =  22.449
    L1-dcache-load-misses   (1.275 / .049) / (1 / .073) =   1.899
    L1-icache-loads         (1.432 / .049) / (1 / .073) =   2.133
    L1-icache-loads-misses  (5.306 / .049) / (1 / .073) =   7.905
    stalled-cycles-frontend (1.429 / .049) / (1 / .073) =   2.128

    Friday, September 20, 2024

    Ubuntu 22.04, Linux kernels and me

    This is a summary of interesting performance issues I have had with the HWE kernels (6.5.x) on Ubuntu 22.04, mostly this year. This is mostly a note to my future self.

    • Use ext4 instead of XFS
      • I have used XFS since pre-2010 because it avoided a few perf problems that show up in the ext family. But I finally found a reason to switch to ext4. The problem is that with 6.5 kernels (when HWE is used) there are perf problems and many error messages (WQ_UNBOUND, CPU hogging). Some details are here. I am curious if XFS needs a few updates to match whatever changed in the 6.5 kernels.
    • Switch from schedutil to performance CPU frequency governor
      • This issue hurt MyRocks a lot more than Postgres or InnoDB. I don't know the root cause but MyRocks perf was lousy with the schedutil CPU frequency governor. See here for more details from me, and this LWN post might be relevant.
    • readahead requests are dropped
      • RocksDB changed how readahead is done when O_DIRECT is not used and the change uses the readahead call. It looks like readahead requests are ignored when the request size is larger than the value of max_sectors_kb. Alas, to figure out the value of max_sectors_kb you must read /sys/block/$device/queue/max_sectors_kb and I prefer to avoid that (whether I have to do it, or RocksDB code does it). Surely there is a better way to handle this.


    Saturday, September 14, 2024

    MySQL and Postgres vs the Insert Benchmark on a large server

    This has benchmark results for MySQL and Postgres vs the Insert Benchmark on a large server. My intent is to document how performance changes over time more so than start a Postgres vs MySQL argument.

    MySQL has accumulated large regressions from 5.6 to 8.0 that are obvious on low-concurrency benchmarks. While they are less obvious on high-concurrency benchmarks, and there have been significant improvements to make MySQL 8 better at high-concurrency, the regressions that hurt low-concurrency results also reduce throughput at high-concurrency.

    tl;dr

    • For Postgres 17rc1 vs 15.8
      • Performance is mostly the same
    • For MySQL 8.0.39 vs 5.6.51
      • Writes are much faster in 8.0.39
      • Reads throughput is mixed,  8.0.39 is slower than, similar to and faster than 5.6.51 depending on the context
    • For MySQL vs Postgres
      • MySQL is faster on point queries
      • Postgres is faster on writes and range queries
    Builds, configuration and hardware

    I compiled from source:
    • Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
    • MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer
    The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

    The Postgres configuration file is here.
    The MySQL configuration files are here for 5.6.515.7.44 and 8.0.39

    The Benchmark

    The benchmark is explained here and is run with 20 clients and a table per client with three workloads:
    • cached - database fits in the Postgres buffer pool
    • less IO-bound - database is larger than memory with 64M rows per table
    • more IO-bound - database is larger than memory with 200M rows per table
    The benchmark steps are:

    • l.i0
      • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 20 for cached, 64 for less IO-bound and 200 for more IO-bound.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for less/more IO-bound.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for less/more IO-bound.
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    The performance reports are here:
    The summary section in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is MySQL 5.6.51 for MySQL vs Postgres, MySQL 5.6.51 for only MySQL and Postgres 15.8 for only Postgres.

    When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    Results: cached

    From the summary for only Postgres:
    • Write-heavy steps are up to 11% faster in 17rc1 vs 15.8
    • Read-heavy steps are unchanged in 17rc1 vs 15.8
    From the summary for only MySQL:
    • Write-heavy steps are ~2.5X to ~4X faster in 8.0.39 vs 5.6.51
    • Read-heavy steps are between 12% and 27% slower in 8.0.39 vs 5.6.51 
    From the summary for MySQL vs Postgres:
    • Postgres 17rc1 is faster on write-heavy and range queries
    • MySQL 8.0.39 and Postgres 17rc1 get similar QPS on point queries
    Results: less IO-bound

    From the summary for only Postgres:
    • Write-heavy steps are up 13% faster in 17rc1 vs 15.8 except for l.i2 which suffers from an intermittently high CPU overhead from get_actual_variable_range
    • Read-heavy steps are mostly similar between 17rc1 and 15.8
    From the summary for only MySQL:
    • Write-heavy steps are ~1.9X to ~4X faster in 8.0.39 vs 5.6.51
    • Read-heavy steps are up to 24% slower in 8.0.39 vs 5.6.51
    From the summary for MySQL vs Postgres:
    • Postgres 17rc1 is faster on write-heavy and range queries
      • The range queries are covering and the secondary indexes are mostly cached so the tests are CPU bound for both Postgres and MySQL (see rpq here)
    • MySQL 8.0.39 is faster on point queries
      • MySQL does ~0.1 reads/query while Postgres does ~3 (see rpq here). The table is index organized with MySQL and heap organized with Postgres. So with InnoDB the point queries just access the PK index while with Postgres they first access the PK index and then the heap table to get all of the columns.
    Results: more IO-bound

    From the summary for only Postgres:
    • Write-heavy steps have similar throughput between 17rc1 and 15.8 ignoring index create (l.x) and variance from l.i2 (problems with get_actual_variable_range)
    • Read-heavy steps are up to 6% faster in 17rc1 vs 15.8
    From the summary for only MySQL:
    • Write-heavy steps are up to ~4X faster in 8.0.39 vs 5.6.51
    • Point queries are up to ~3X faster in 8.0.39 vs 5.6.51
    • Range queries are up to 27% slower in 8.0.39 vs 5.6.51
    From the summary for MySQL vs Postgres:
    • Postgres 17rc1 is faster on write-heavy and range queries
      • See the less IO-bound section above
      • MySQL does more read IO per range query (see rpq here)
    • MySQL 8.0.39 is faster on point queries
      • See the less IO-bound section above
      • Postgres does ~2X more read IO per point query (see rpq here)








    Thursday, September 12, 2024

    HTAP benchmarks: trying out HATrick with MySQL

    For a few years I have only used sysbench and the Insert Benchmark for work but now I need an HTAP benchmark. There are several to choose from but I will start with HATrick. The source is here. It uses C++ and ODBC. I prefer Java (and JDBC), golang or Python but lets see how this goes.

    My current OS is Ubuntu 22.04 and I install MySQL in custom locations so I assume that I need to install MySQL's Connector/ODBC from source. That depends on an ODBC driver manager, and for Ubuntu I will try iodbc. I have no experience with ODBC and now I am worried about pointless complexity (I am looking at you Log4J, CORBA and EJB).

    The benchmark looks interesting and useful but I'd rather not deal with setting up ODBC again, nor with core dumps. So I will move on and try CH-benchmark from benchmark (Java + JDBC).

    The details

    So the first step is to install iodbc via: sudo apt install libiodbc2 libiodbc2-dev

    Then configure MySQL Connector/ODBC. I didn't expect to need ODBC_INCLUDES= because CMake was able to find iodbc without it, but I do because that path wasn't added to CXXFLAGS unless I added it:

    cmake .. -G "Unix Makefiles" \

      -DMYSQLCLIENT_STATIC_LINKING=ON \
      -DBUNDLE_DEPENDENCIES=ON \
      -DMYSQL_CONFIG_EXECUTABLE=/path/to/mysql_config \
      -DDISABLE_GUI=1 \

     -DCMAKE_INSTALL_PREFIX=/install/path/to/myodbc \
     -DODBC_INCLUDES=/usr/include/iodbc

    make; make install


    Then I compile HATrack after downloading it from github. I had to edit Makefile:


    diff --git a/Makefile b/Makefile

    index b55f767..fea8860 100644

    --- a/Makefile

    +++ b/Makefile

    @@ -1,6 +1,7 @@

     CC=g++

    -CXXFLAGS=-c -std=c++2a

    -LDFLAGS=-lodbc -pthread

    +CXXFLAGS=-c -std=c++2a -I/usr/include/iodbc

    +LDFLAGS=-L/home/me/d/myodbc/lib -lmyodbc8a -pthread

    +#LDFLAGS=-lodbc -pthread

     SRCS= src/DataGen.cpp src/DataSrc.cpp src/UserInput.cpp src/Driver.cpp \

     src/DBInit.cpp src/SQLDialect.cpp src/Barrier.cpp src/AnalyticalClient.cpp \

     src/TransactionalClient.cpp src/Workload.cpp src/Globals.cpp \


    Then I compile HATrack. The scale factor is hardcoded in src/UserInput.h so you probably need to edit it and compile once per different scale factor you want to try. Compiling just means: make


    Then generate data. I assume that ":" is a good value for the delimiter. I had to add LD_LIBRARY_PATH to find the MySQL ODBC libraries that I compiled above.

    LD_LIBRARY_PATH=/home/me/d/myodbc/lib \
    ./HATtrickBench -gen -pa /home/me/hatrick.data -d :


    Then load the database. I added secure_file_priv="" to etc/my.cnf for this to work


    LD_LIBRARY_PATH=/home/me/d/myodbc/lib \
    ./HATtrickBench -init -pa /home/me/hatrick.data -usr root -pwd pw -db mysql


    All 4 steps completed OK but there was a core dump at the end. While it might be easy to fix I will move on to CH-benchmark which is in benchbase as that means Java + JDBC so I don't have to deal with ODBC, nor must I worry about core dumps.

    Program received signal SIGABRT, Aborted.

    __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352611648) at ./nptl/pthread_kill.c:44

    44      ./nptl/pthread_kill.c: No such file or directory.

    (gdb) where

    #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737352611648) at ./nptl/pthread_kill.c:44

    #1  __pthread_kill_internal (signo=6, threadid=140737352611648) at ./nptl/pthread_kill.c:78

    #2  __GI___pthread_kill (threadid=140737352611648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89

    #3  0x00007ffff6e42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

    #4  0x00007ffff6e287f3 in __GI_abort () at ./stdlib/abort.c:79

    #5  0x00007ffff6e2871b in __assert_fail_base (fmt=0x7ffff6fdd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7ffff6fd9599 "mutex->__data.__owner == 0", file=0x7ffff6fd956e "pthread_mutex_lock.c", line=94, function=<optimized out>)

        at ./assert/assert.c:92

    #6  0x00007ffff6e39e96 in __GI___assert_fail (assertion=assertion@entry=0x7ffff6fd9599 "mutex->__data.__owner == 0", file=file@entry=0x7ffff6fd956e "pthread_mutex_lock.c", line=line@entry=94,

        function=function@entry=0x7ffff6fe1840 <__PRETTY_FUNCTION__.0> "___pthread_mutex_lock") at ./assert/assert.c:101

    #7  0x00007ffff6e980d0 in ___pthread_mutex_lock (mutex=<optimized out>) at ./nptl/pthread_mutex_lock.c:94

    #8  0x0000555555579fdf in __gthread_mutex_lock(pthread_mutex_t*) ()

    #9  0x000055555557a034 in std::mutex::lock() ()

    #10 0x000055555557a30f in std::unique_lock<std::mutex>::lock() ()

    #11 0x000055555557a1a3 in std::unique_lock<std::mutex>::unique_lock(std::mutex&) ()

    #12 0x00007ffff770caca in ENV::remove_dbc(DBC*) () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

    #13 0x00007ffff770cf2e in DBC::~DBC() () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

    #14 0x00007ffff770d610 in my_SQLFreeConnect(void*) () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

    #15 0x00007ffff770e4bb in SQLFreeHandle () from /home/markcallaghan/d/myodbc/lib/libmyodbc8a.so

    #16 0x000055555556ff64 in Driver::disconnectDB(void*&) ()

    #17 0x000055555558a6c8 in main ()


    MySQL + InnoDB vs sysbench on a large server

    This has benchmark results for MySQL 5.6.51, 5.7.44 and 8.0.39 using sysbench on a large server. I also add a few results comparing Postgres 17rc1 with MySQL 8.0.39. My goal with that is to highlight things that can be made better in MySQL (and in Postgres).

    One of the responses to my claims that MySQL is getting too many performance regressions over time is that this is true for low-concurrency tests but not for high-concurrency tests. Alas, that claim is more truthy than true and fixing some of these regressions would help make modern MySQL not look so slow when compared to modern Postgres.

    tl;dr

    • MySQL 8.0 is faster than 5.6 for point queries and writes but slower for range queries
    • PostgresSQL 17rc1 is a lot faster than MySQL 8.0 for point queries and writes. For range queries Postgres was still faster but the difference was usually not as large.

    Builds, configuration and hardware

    I compiled 
    • Postgres versions 17rc1 from source using -O2 -fno-omit-frame-pointer.
    • MySQL versions 5.6.51, 5.7.44 and 8.0.39 from source using -DCMAKE_BUILD_TYPE =Release and -O2 -fno-omit-frame-pointer
    The server is an ax162-s from Hetzner with 48 cores, AMD 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 using SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

    The Postgres configuration file is here.
    The MySQL configuration files are here for 5.6.51, 5.7.44 and 8.0.39.

    Benchmark

    I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

    For the large server the tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds. The command line to run all tests was: bash r.sh 8 10000000 180 300 md2 1 1 40

    Results

    For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here.

    Values from iostat and vmstat divided by QPS are here for MySQL 5.6, 5.7, 8.0 and here for MySQL 8.0.39 vs Postgres 17rc1. These help to explain why something is faster or slower because it shows how much HW is used per request.

    The numbers in the spreadsheets are the relative QPS and $version is faster than the base case when it is greater than 1.0. When it is 3.0 then $version is 3X faster than the base case.

    The relative QPS is computed by: (QPS for $version) / (QPS for base case).

    For the comparison between MySQL 5.6.51, 5.7.44 and 8.0.33 the base case is 5.6.51 and $version is one of 5.7.44 or 8.0.39.

    For the comparison between MySQL 8.0.39 and Postgres 17rc1 the base case is MySQL and $version is Postgres.

    Results: MySQL 5.6.51 vs 5.7.44 vs 8.0.39

    Point queries, part 1
    • MySQL 8.0 is slower than 5.6 on 4 of 11 tests and faster on 7 of 11 tests. 
    • The 4 on which it is slower are all for secondary index lookups. From vmstat metrics, when MySQL 8.0 is slower the problem is CPU overhead (see cpu/o here).
    • The 7 on which it is faster are all for PK lookups. From two of those tests, it is faster because it uses less CPU per query (see cpu/o here).
    Point queries, part 2
    • MySQL 8.0 is faster than 5.6 in all tests here. All of these tests use a PK index. The perf difference is larger for queries that fetch more data. This suggests that new optimizer overheads prevent it from being faster in cases when it fetches less data and/or InnoDB in MySQL 8.0 doesn't have regressions for point queries relative to MySQL 5.6.
    • From vmstat metrics (see cpu/o here) MySQL 8.0 is faster because it uses less CPU
    Range queries, part 1
    • MySQL 5.7 and 8.0 are slower than 5.6 for range queries. This is a surprise and the regressions are getting larger over time (worse in 8.0 than 5.7). The queries here do range scans of various lengths without aggregation.
    • From vmstat metrics (see cpu/o here) MySQL 8.0 is slower because it uses more CPU per query
    Range queries, part 2
    • All of the tests here do some aggregation except for scan (the right most bars). And MySQL 8.0 is faster than 5.6 for all such tests.
    • When MySQL 8.0 is faster the primary reason is that is uses less CPU per query (see cpu/o here) and sometimes it also does fewer context switches per query (see cs/o here) where a high context switch rate often implies mutex contention.
    Writes
    • MySQL 8.0 is significantly faster than 5.6 for all tests. These show the benefit of the work to improve performance for high-concurrency workloads.
    • MySQL 8.0 is faster because it uses less CPU per SQL statement (see cpu/o here) and does few context switches per SQL statement (see cs/o here). A high context switch rate often implies mutex contention.
    Results: MySQL 8.0.39 vs Postgres 17rc1

    Point queries, part 1
    • The result for hot-points is an outlier, perhaps because prepared statements are better in Postgres than in MySQL. In MySQL, server-side prepared statements just save on parse. The difference is large because MySQL uses a lot more CPU than Postgres (see cpu/o here).
    • In most other tests Postgres is ~1.5X faster than MySQL and the root cause is that MySQL uses more CPU per query (see cpu/o here). For the context switch rate, sometimes it is larger for MySQL and sometimes it is smaller (see cs/o here).
    • Excluding hot-points, the largest difference occurs on tests that do non-covering secondary index lookups. InnoDB tables are index-organized and for non-covering secondary index lookups to fetch the missing columns InnoDB must traverse the PK index while Postgres just uses the rowID (tuple ID) to access the heap pages.
    Point queries, part 2
    • Postgres is ~1.5X faster for point queries, see above in Point queries, part 1 for more details
    Range queries, part 1
    • Postgres is much slower than InnoDB on scan (full table scan) and on range-notcovered-pk. While I can only wave hands about the scan result for the range-notcovered-pk result Postgres must scan the PK index and then lookup missing columns from heap pages but InnoDB has all columns in the PK index because it is index-organized. For scan, the CPU overhead and context switch rates are much larger for Postgres (see cpu/o and cs/o here).
    • Postgres is much faster on the two tests that do non-covering secondary index scans because it fetches missing columns by following the tupleID to a heap page while InnoDB must traverse the PK index.
    • Excluding the 4 outliers, InnoDB and Postgres have similar performance
    Range queries, part 2
    • The performance difference here is smaller than it is above for point queries. In one case, MySQL is much faster, in three cases Postgres is much faster and in two cases perf is similar.
    • For tests run before writes (less b-tree fragmentation) or after writes (more b-tree fragmentation) the pattern is that the advantage for Postgres is larger when the length of the range query is shorter -- see results for read-only.pre_range=X and read-only_range=X for X in 10000, 100 and 10. My guess is that MySQL optimizer overhead hurts performance and that is more obvious for queries that take less time -- see the CPU overhead metrics here (cpu/o columns).
    Writes
    • Postgres is faster for all tests that do writes and is up to ~5X faster. The reason is that Postgres uses less CPU per statement (see cpu/o here) and does fewer context switches per statement (see cs/o here) where context switch rates often predict mutex contention.
    • The difference is smaller for read-write* tests because they do a mix of reads and writes.

    I recently published results with a summary of HW performance counters for sysbench with MySQL on four CPU types. The performance reports we...