Tuesday, August 27, 2024

Postgres 17beta3 vs the Insert Benchmark on a medium server: looking good

This has benchmark results for Postgres 12 through 17beta3 using the Insert Benchmark and a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large. A recent result up to Postgres 17 beta2 from the same server is here

This work was done by Small Datum LLC.

    tl;dr

    • 17beta3 looks (mostly) good
    • There might be regressions in 17 beta1, beta2 and beta3 on the l.i1 and l.i2 benchmark steps related to get_actual_variable_range
    Builds, configuration and hardware

    I compiled Postgres versions 12.19, 12.20, 13.15, 13.16, 14.12, 14.13, 15.7, 15.8, 16.3, 16.4, 17beta1, 17beta2 and 17beta3 from source using -O2 -fno-omit-frame-pointer.

    The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

    The configuration file is in the pg* subdirectories here with the name conf.diff.cx10a_c24r64.

    The Benchmark

    The benchmark is explained here and is run with 8 clients and a table per client with two workloads:
    • cached - database fits in the Postgres buffer pool
    • IO-bound - database is larger than memory and there are many reads from disk
    The benchmark steps are:

    • l.i0
      • insert X million rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. The value of X is 10 for cached and 128 for IO-bound.
    • l.x
      • create 3 secondary indexes per table. There is one connection per client.
    • l.i1
      • use 2 connections/client. One inserts X rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. The value of X is 40M for cached and 4M for IO-bound.
    • l.i2
      • like l.i1 but each transaction modifies 5 rows (small transactions) and X rows are inserted and deleted per table. The value of X is 10M for cached and 1M for IO-bound.
      • Wait for X seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of X is a function of the table size.
    • qr100
      • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. This step is run for 1800 seconds. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested.
    • qp100
      • like qr100 except uses point queries on the PK index
    • qr500
      • like qr100 but the insert and delete rates are increased from 100/s to 500/s
    • qp500
      • like qp100 but the insert and delete rates are increased from 100/s to 500/s
    • qr1000
      • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
    • qp1000
      • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
    Results: overview

    The performance reports are here for cached and for IO-bound.

    The summary (for cached and for IO-bound) in each performance report has 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts and all systems sustained the target rates. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

    Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is my version and $base is the version of the base case. The base case here is Postgres 16.3 and the versions for $me are 17beta1 and 17beta2. When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
    • insert/s for l.i0, l.i1, l.i2
    • indexed rows/s for l.x
    • range queries/s for qr100, qr500, qr1000
    • point queries/s for qp100, qp500, qp1000
    Below I use colors to highlight the relative QPS values with red for <= 0.95, green for >= 1.05 and grey for values between 0.95 and 1.05.

    Results: cached

    Summary:
    • Results for 17beta3 are similar to 16.4 with one exception. Results for the l.i2 benchmark step in 17 beta1, beta2 and beta3 are between 5% and 8% worse than in 16.4. I assume this is another problem related to get_actual_variable_range
      • This is confusing because 17 beta3 does better than the base case on l.i1 and the l.i1 workload is similar to l.i2 except there are more rows modified per transaction (so the optimizer overhead is amortized over more work).
      • For Postgres 17beta I see a slight increase in CPU per operation (cpupq) and a slight reduction in context switches per operation (cspq) in the metrics section relative to Postgres 16.4.
    From the summary the relative throughput per benchmark step is below. The base case is Postgres 12.19.
    • l.i0
      • relative QPS is 1.02 in PG 16.4
      • relative QPS is 0.98 in PG 17 beta3
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.06, 1.00 in PG 16.4
      • relative QPS is 1.090.92 in PG 17 beta3
    • qr100, qr500, qr1000
      • relative QPS is 1.031.041.04 in PG 16.4
      • relative QPS is 1.041.051.08 in PG 17 beta3
    • qp100, qp500, qp1000
      • relative QPS is 0.990.990.99 in PG 16.4
      • relative QPS is 0.980.980.98 in PG 17 beta3
    Results: IO-bound

    Summary:
    • Results for 17beta3 are similar to 16.4 with one exception. Results for the l.i1 and l.i2 benchmark steps in 17 beta1, beta2 and beta3 are mostly much worse. I assume this is another problem related to get_actual_variable_range
      • This is confusing because 17 beta3 does better than the base case on l.i1 and the l.i1 workload is similar to l.i2 except there are more rows modified per transaction (so the optimizer overhead is amortized over more work).
      • For Postgres 17beta I see a an increase in CPU per operation (cpupq) in the metrics section relative to Postgres 16.4.
    From the summary the relative throughput per benchmark step is below. The base case is Postgres 12.19.
    • l.i0
      • relative QPS is 1.00 in PG 16.4
      • relative QPS is 0.96 in PG 17 beta3
    • l.x - I ignore this for now
    • l.i1, l.i2
      • relative QPS is 1.011.45 in PG 16.4
      • relative QPS is 0.881.01 in PG 17 beta3
    • qr100, qr500, qr1000
      • relative QPS is 0.980.980.98 in PG 16.4
      • relative QPS is 1.001.00, 0.99 in PG 17 beta3
    • qp100, qp500, qp1000
      • relative QPS is 1.01, 1.011.01 in PG 16.4
      • relative QPS is 1.011.011.01 in PG 17 beta3

    Friday, August 16, 2024

    Postgres 17beta3 vs sysbench on a small server: looking good

    This has benchmark results for Postgres 17beta3 using sysbench and a small server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large.

    A recent result for Postgres 17beta3 on a medium server is here.

      tl;dr - 17beta3 looks great

      • There are no regressions
      • Throughput on write microbenchmarks is often ~5% to ~10% better than 16.x
      • Throughput on hot-points (read-only) is more than 2X faster in 17beta than 16.x
      Builds, configuration and hardware

      I compiled Postgres versions 10.23, 11.22, 12.20, 13.16, 14.13, 15.7, 15.8, 16.0, 16.1, 16.2, 16.3, 16.4, 17beta1, 17beta2 and 17beta3 from source.

      The server is named v3 or PN53 here and has 8 AMD cores with SMT disabled, 16 G of RAM and uses Ubuntu 22.04 and XFS with 1 m.2 device. I need to switch to ext-4 soon to match what I use elsewhere.

      The configuration files have the name conf.diff.cx10a_c8r32 and are in the pg* subdirectories here.

      Benchmark

      I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres with 1 table and 50M rows. There is 1 client thread (connection), read-heavy microbenchmarks run for 300 seconds and write-heavy run for 600 seconds.

      The command lines for my helper script was:
      bash r.sh 1 50000000 300 600 md0 1 1 1

      Results

      For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here

      Values from iostat and vmstat divided by QPS are here. This can help to explain why something is faster or slower because it shows how much HW is used per request.

      The numbers in the spreadsheets are the relative QPS as below. When the relative QPS is > 1 then $version is faster than Postgres 10.23.
      (QPS for $version) / (QPS for Postgres 10.23)

      Notes on the charts

      • the y-axis starts at 0.9 to make it easier to see differences
      • for the first graph, the y-axis truncates one outlier (hot-points) from Postgres 17beta where the relative QPS is greater than 2.0 for Postgres 17beta. That is a great improvement
      Point queries, part 1
      • The y-axis ends at 1.2 and truncates the (great) outliers for hot-points where it is ~2.2X faster in Postgres 17beta
      Point queries, part 2
      Range queries, part 1
      • While the scan benchmark has some variance that I don't yet understand the trend is positive starting in Postgres 13.x and continuing through 17beta3
      Range queries, part 2
      Writes
      • Many of these are ~5% to ~10% faster in Postgres 17beta3 vs 16.x

      Monday, August 12, 2024

      Postgres 17beta3 vs sysbench on a medium server: looking good

      This has benchmark results for Postgres 17beta3 using sysbench and a medium server. By small, medium or large server I mean < 10 cores for small, 10 to 19 cores for medium, 20+ cores for large.

      A recent result for Postgres 17beta2 is here.

        tl;dr

        • 17beta3 looks good
        • Write microbenchmarks are much faster in 17beta1 and 17beta2 vs 16.3
        • There might be a regression in Postgres 16 for two of the update-only benchmarks. More work is in progress to explain this.
        • Read microbenchmarks have similar performance between 16.3, 17beta1 and 17beta2
        Builds, configuration and hardware

        I compiled Postgres versions 12.19, 12.20, 13.15, 13.16, 14.12, 14.13, 15.7, 15.8, 16.3, 16.4, 17beta1, 17beta2 and 17beta3 from source.

        The server is a c2d-highcpu-32 instance type on GCP (c2d high-CPU) with 32 vCPU, 64G RAM and SMT disabled so there are 16 cores. It uses Ubuntu 22.04 and storage is ext4 (data=writeback) using SW RAID 0 over 2 locally attached NVMe devices.

        The configuration files have the name conf.diff.cx10a_c24r64 and are in the pg* subdirectories here.

        Benchmark

        I used sysbench and my usage is explained here. There are 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres with 8 tables and 10M rows/table. There are 12 client threads, read-heavy microbenchmarks run for 300 seconds and write-heavy run for 600 seconds.

        The command lines for my helper script was:
        bash r.sh 8 10000000 300 600 md0 1 1 12

        Results

        For the results below I split the 42 microbenchmarks into 5 groups -- 2 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here.

        Values from iostat and vmstat divided by QPS are here. This can help to explain why something is faster or slower because it shows how much HW is used per request.

        The numbers in the spreadsheets are the relative QPS as below. When the relative QPS is > 1 then $version is faster than Postgres 12.19.
        (QPS for $version) / (QPS for Postgres 12.19)

        Notes on the charts

        • the y-axis starts at 0.8 to make it easier to see differences
        • for the first graph, the y-axis truncates one outlier (hot-points) from Postgres 17beta where the relative QPS is ~1.85. That is a great improvement
        Point queries, part 1
        • the value for hot-points with 17beta is an outlier at ~1.85 and truncated 
        • in most cases, there are no regressions and small improvements from 12.19 to 17beta

        Point queries, part 2

        • in most cases, there are no regressions and small improvements from 12.19 to 17beta
        • Range queries, part 1

          • in most cases, there are no regressions and small improvements from 12.19 to 17beta
          • for scan there were improvements in Postgres 13 that have not gone away

          Range queries, part 2

          • there are no regressions and some small improvements

          Writes

          • for Postgres 17beta there are no regressions and some small improvements
          • for Postgres 16 there might be small regressions in update-index and update-one. I have more tests in progress to figure that out. From the iostat and vmstat results the CPU overhead (cpu/o) and context switch rates (cs/o) increase from Postgres 15 to 16 and then drop in 17beta. See here for update-index and for update-one.


          Thursday, August 8, 2024

          MySQL regressions: delete vs InnoDB

          I started to look at CPU overheads in MyRocks and upstream InnoDB. While I am happy to file bugs for MyRocks as they are likely to be fixed, I am not sure how much energy I want to put into proper bug reports for upstream InnoDB. So I will just write blog posts about them for now.

          I created flamegraphs while running sysbench with cached databases (more likely to be CPU bound) and the problem here occurs on an 8-core PN53 where sysbench was run with 1 thread. Here I use perf record -e cycles to collect data for flamegraphs and then I focus on the percentage of samples in a given function (and its callees) as a proxy for CPU overhead.

          The numbers below are the percentage of perf samples from a function and its callees and this is a proxy for CPU time. Here I used cycles as the HW counter with perf but I have more tests in progress to get flamegraphs with other counters. The flamegraphs are here.

          I am curious about whether more data is written to the binlog per transaction with 8.0.37, but I have to repeat tests to get that data. While my helper scripts archive results from SHOW GLOBAL STATUS which provides the Binlog_bytes_written counter, that counter is only in MariaDB and Percona Server. For upstream MySQL I have to use this much more fun query to get the answer from the perf schema.
          select SUM_NUMBER_OF_BYTES_WRITE from file_summary_by_event_name where EVENT_NAME="wait/io/file/sql/binlog"
          Time is spent in background and foreground work. The background work are mostly purge and writeback. The foreground work is networking, SQL and binlog. At a high level the percentage of time in SQL + binlog is about 2X larger with MySQL 8.0.37 vs 5.6.51 from ~30% to ~60%. It will take time to understand why.

          5.6.51
          ---
          17.77   mysql_delete
             0.35   QUICK_RANGE_SELECT::reset
             1.69   SQL_SELECT::test_quick_select
             0.30   THD::binlog_query
             1.40   binlog_log_row
             4.52   handler::ha_delete_row
             0.60   lock_tables
             1.16   mysql_prepare_delete
             1.28   open_normal_and_derived_tables
             0.73   optimize_cond
             4.06   rr_quick

           9.82   ha_commit_trans
             5.61   MYSQL_BIN_LOG::commit
               1.83   MYSQL_BIN_LOG::process_commit_stage_queue
               0.86   MYSQL_BIN_LOG::process_flush_stage_queue
               2.26   my_b_flush_io_cache
           3.78   ha_prepare_low

          5.7.44
          ---
          36.79   Sql_cmd_delete::execute
             0.73   QUICK_RANGE_SELECT::reset
             1.58   Sql_cmd_delete::mysql_prepare_delete
             0.60   THD::binlog_query
             3.33   binlog_log_row
             9.59   handler::ha_delete_row
             1.47   lock_tables
             2.09   open_tables_for_query
             1.03   optimize_cond
             8.65   rr_quick
             3.45   test_quick_select

          18.04   ha_commit_trans
            16.28   MYSQL_BIN_LOG::commit
               3.81   MYSQL_BIN_LOG::flush_cache_to_file
               4.22   MYSQL_BIN_LOG::process_commit_stage_queue
               6.67   MYSQL_BIN_LOG::process_flush_stage_queue
                 1.70   MYSQL_BIN_LOG::flush_thread_caches
                 4.58   ha_flush_logs
             0.99   ha_prepare_low

          8.0.37
          ---
          37.60   Sql_cmd_dml::execute
            28.57   Sql_cmd_delete::delete_from_single_table
               6.80   IndexRangeScanIterator::Read
               0.83   THD::binlog_query
               6.89   binlog_log_row
               6.32   handler::ha_delete_row
               1.24   optimize_cond
               3.27   test_quick_select
             0.73   Sql_cmd_delete::check_privileges
             0.81   Sql_cmd_delete::restore_cmd_properties
             1.83   lock_tables
             4.04   open_tables_for_query

          20.26   ha_commit_trans
            17.41   MYSQL_BIN_LOG::commit
               0.75   MYSQL_BIN_LOG::change_stage
               3.09   MYSQL_BIN_LOG::flush_cache_to_file
               3.29   MYSQL_BIN_LOG::process_commit_stage_queue
               8.37   MYSQL_BIN_LOG::process_flush_stage_queue
                 5.42   MYSQL_BIN_LOG::fetch_and_process_flush_stage_queue
                 2.40   MYSQL_BIN_LOG::flush_thread_caches
             1.37   MYSQL_BIN_LOG::prepare

          Wednesday, August 7, 2024

          MySQL regressions: update-nonindex vs InnoDB

          I started to look at CPU overheads in MyRocks and upstream InnoDB. While I am happy to file bugs for MyRocks as they are likely to be fixed, I am not sure how much energy I want to put into proper bug reports for upstream InnoDB. So I will just write blog posts about them for now.

          I created flamegraphs while running sysbench with cached databases (more likely to be CPU bound) and the problem here occurs on an 8-core PN53 where sysbench was run with 1 thread. Here I use perf record -e cycles to collect data for flamegraphs and then I focus on the percentage of samples in a given function (and its callees) as a proxy for CPU overhead.

          The flamegraphs are here.

          The workload here is the update-nonindex microbenchmark. The throughput for a release relative to MySQL 5.6.51 is -> (QPS for $version) / (QPS for 5.6.51). The results below show that 8.0.37 gets about 62% of the QPS relative to 5.6.51.
          • 0.86 in 5.7.44
          • 0.79 in 8.0.11
          • 0.67 in 8.0.28
          • 0.62 in 8.0.37
          From the numbers above there will soon be a release that only gets 50% of the QPS relative to 5.6.51. From the flamegraphs there is time spent in purge, networking and processing update statements. I focus on the time spent processing update statements. The percentage of samples (a proxy for CPU time) in mysqld_stmt_execute and its callees is:
          • 32.77% in 5.6.51 from ~17% in mysql_update and ~12% from trans_commit_stmt
          • 65.89% in 5.7.44 from ~39% in Sql_cmd_update::execute and ~22% from trans_commit_stmt
          • 67.30% in 8.0.28 from ~38% in Sql_cmd_dml::execute and ~23% from trans_commit_stmt
          • 67.97% in 8.0.37 from ~40% in Sql_cmd_dml::execute and ~21% from trans_commit_stmt

          MySQL regressions: skip_concurrency_ticket

          I started to look at CPU overheads in MyRocks and upstream InnoDB. While I am happy to file bugs for MyRocks as they are likely to be fixed, I am not sure how much energy I want to put into proper bug reports for upstream InnoDB. So I will just write blog posts about them for now.

          I created flamegraphs while running sysbench with cached databases (more likely to be CPU bound) and the problem here occurs on an 8-core PN53 where sysbench was run with 1 thread. Here I use perf record -e cycles to collect data for flamegraphs and then I focus on the percentage of samples in a given function (and its callees) as a proxy for CPU overhead.

          The problem here is that during the scan benchymark the skip_concurrency_ticket function accounts for ~3% of CPU in 8.0.37, half that in 5.7.44 and the function doesn't exist in 5.6.51. It is called from innobase_srv_conc_enter_innodb which was a bit simpler in 5.6.

          The flamegraphs (*.svg files) are here.

          Also visible in those flamegraphs, the percentage of samples (CPU overhead prox) accounted for by row_sel_store_mysql_rec and callees
          • 27.20% in 5.6.51
          • 25.46% in 5.7.44
          • 31.17% in 8.0.28
          • 34.99% in 8.0.37

          MySQL regressions: binlog_log_row

          I started to look at CPU overheads in MyRocks and upstream InnoDB. While I am happy to file bugs for MyRocks as they are likely to be fixed, I am not sure how much energy I want to put into proper bug reports for upstream InnoDB. So I will just write blog posts about them for now.

          I created flamegraphs while running sysbench with cached databases (more likely to be CPU bound) and the problem here occurs on an 8-core PN53 where sysbench was run with 1 thread. For the insert microbenchmark the insert rate ...

          • in MySQL 5.7.44 is 82% of the rate in MySQL 5.6.51
          • in MySQL 8.0.11 is 72% of the rate in MySQL 5.6.51
          • in MySQL 8.0.37 is 57% of the rate in MySQL 5.6.51 
          Here I use perf record -e cycles to collect data for flamegraphs and then I focus on the percentage of samples in a given function (and its callees) as a proxy for CPU overhead.

          The percentage of samples that binlog_log_row and its children account for is ...

          • 3.04% in MySQL 5.6.51
          • 4.28% in MySQL 5.7.44
          • 10.29% in MySQL 8.0.37
          I don't know what happened to make this so much more expensive in 8.0. The numbers below are the percentage samples in the function and the functions it calls. The flamegraph (*.svg) files are here.

          For MySQL 5.6.51 with InnoDB

          3.04  binlog_log_row
            2.23  THD::binlog_write_table_map
              0.41  Table_map_log_event::Table_map_log_event
              0.52  binlog_cache_data::write_event
              0.92  binlog_start_trans_and_stmt
            0.59  Write_rows_log_event::binlog_row_logging_function
              0.51 THD::binlog_write_row

          For MySQL 5.7.44 with InnoDB

          4.28  binlog_log_row
            3.13  THD::binlog_write_table_map
            0.82  Write_rows_log_event::binlog_row_logging_function

          For MySQL 8.0.37 with InnoDB

          10.29  binlog_log_row
            5.79  THD::binlog_write_table_map
              2.53  Table_map_log_event::Table_map_log_event
              0.60  binlog_cache_data::write_event
              1.74  binlog_start_trans_and_stmt
            2.97  Write_rows_log_event::binlog_row_logging_function
              2.87 THD::binlog_write_row
            0.49  add_pke

          Evaluating vector indexes in MariaDB and pgvector: part 2

          This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...