Tuesday, September 26, 2023

Trying out OrioleDB

I am curious about OrioleDB given my unsuccessful attempts to resolve some Postgres perf issues for the Insert Benchmark via tuning. 

The first step is to compile and install their Postgres fork from source.

  git clone https://github.com/orioledb/postgres.git oriole_pg
  cd oriole_pg
  git checkout patches16_18

  bash mk.pg.o3_native_lto ~/d/pg16oriole
  make install
  cd contrib/pgstattuple
  make install
  
The nexts step is to compile and install the OrioleDB extension from source. It has to find pg_config so I updated my PATH for that.
  PATH=$PATH:~/d/pg16oriole
  make USE_PGXS=1
  make USE_PGXS=1 install
  # installcheck requires testgres
  pip3 install testgres
  make USE_PGXS=1 installcheck

There is a test failure during make installcheck (see issue 275) but that doesn't matter to me today. I am not trying to use this in production.

The next step is to modify my Postgres init scripts so I can use OrioleDB. I lack many basic PG admin skills and stumbled my way through the locales issue. I hope OrioleDB updates their README to explain that.

  cd ~/d/pg16oriole

  # Add this to the Postgres configuration file
  # shared_preload_libraries = 'orioledb.so'

  # Add this to my init scripts. 
  #    First make the test database use the C locale because OrioleDB
  #        tables must use ICO, C or POSIX locales.
  #    For this to work I must use template0.
  bin/createdb ib -l "C" -T template0

  # Add this to my init scripts.
  bin/psql ib -c 'create extension orioledb'

  # Finally initialize Postgres
  bash ${insert-init-script-here}

The final step is to modify my benchmark clients to create OrioleDB tables and that is done by appending using orioledb to the end of the CREATE TABLE statements.

Configuration

The next step is to tune some of the OrioleDB options based on the advice here.

I am using this on a small (8 cores, 16G RAM) server.

Checking MyRocks 5.6 for regressions with the Insert Benchmark and a large server, revisited

I previously shared results for MyRocks 5.6 and claimed there was a perf regression in the Insert Benchmark. I then shared a follow up post as I searched for the source of the regression. The results were bogus, there is no regression and this post explains what happened.

The mistake I made is that all builds used FDO (feedback driven optimization) and while the builds tested span ~18 months (from early 2022 until mid 2023) they were all using the same profile input for FDO. While I don't know much about FDO I assume that is a bad idea. So I recompiled everything with FDO disabled and repeated the tests.

The results here use MyRocks builds from Feb 2022 to June 2023 which use RocksDB versions from 6.28 to 8.6. The goal is to determine whether there are perf regressions from the old to new versions. While MyRocks 5.6 did not change much in that time, RocksDB has changed significantly. So this is mostly a search for perf regressions in RocksDB.

tl;dr

  • There are no perf regressions for MyRocks from RocksDB 6.28.2 to 8.6.3
  • Throughput for the initial load (l.i0 benchmark step) is ~10% better starting with RocksDB 7.10.0
Builds

All builds used the Release build type, LTO (link time optimization) but disabled FDO (feedback driven optimization).

The builds are:

  • fbmy5635_20220307_6282_fdo0_lto1 - FB MySQL 5.6.35 at git hash e7d976ee (7 Mar 2022 tag) with RocksDB 6.28.2
  • fbmy5635_20220519_722_fdo0_lto1 - FB MySQL 5.6.35 at git hash d503bd77 (19 May 2022 tag) with RocksDB 7.2.2
  • fbmy5635_20220809_731_fdo0_lto1 - FB MySQL 5.6.35 at git hash 877a0e58 (9 Aug 2022 tag) with RocksDB 7.3.1
  • fbmy5635_20221011_731_fdo0_lto1 - FB MySQL 5.6.35 at git hash c691c716 (11 Oct 2022 tag) with RocksDB 7.3.1
  • fbmy5635_20230216_7100_fdo0_lto1 - FB MySQL 5.6.35 at git hash 21a2b0aa (16 Feb 2023 tag) with RocksDB 7.10.0
  • fbmy5635_20230412_7102_fdo0_lto1 - FB MySQL 5.6.35 at git hash 205c31dd (12 Apr 2023 tag) with RocksDB 7.10.2
  • fbmy5635_20230529_821_fdo0_lto1 - FB MySQL 5.6.35 at git hash b739eac1 (29 May 2023 tag) with RocksDB 8.2.1
  • fbmy5635_20230628_821_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) with RocksDB 8.2.1
  • fbmy5635_20230628_833_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.3.3
  • fbmy5635_20230628_844_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.4.4
  • fbmy5635_20230628_853_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.5.3
  • fbmy5635_20230628_863_fdo0_lto1 - FB MySQL 5.6.35 at git hash 7e40af67 (28 Jun 2023 tag) upgraded to RocksDB 8.6.3
The c5 configuration file (my.cnf) was used.

Benchmark

The Insert Benchmark was run in two setups:

  • cached by RocksDB - all tables fit in the RocksDB block cache
  • IO-bound - the database is larger than memory
The server has 80 HW threads, 40 cores, 256G of RAM and fast NVMe storage with XFS.

The benchmark is run with 24 clients, 24 tables and a client per table. The benchmark is a sequence of steps.

  • l.i0
    • insert X million rows across all tables without secondary indexes where X is 20 for cached and 500 for IO-bound
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 50 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail.
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 3600 seconds.
Results

Performance reports are here for Cached by RocksDB and for IO-bound.

From the Summary sections for cached and for IO-bound there are tables for absolute and relative QPS per benchmark step. The relative QPS is (QPS for me / for QPS fbmy5635_20220307_6282_fdo0_lto1).

For cached benchmark steps in modern MyRocks the QPS for the l.i0 benchmark step (initial load) is up to 10% faster than the base case. Otherwise there are no regressions.

For IO-bound benchmark steps in modern MyRocks the QPS for the l.i0 benchmark step (initial load) is up to 10% faster than the base case. Most other benchmark steps get 1% or 2% more QPS in modern MyRocks than the base case.


Monday, September 25, 2023

Checking Postgres for perf regressions from 11.21 to 16.0 with sysbench and a small server

This has results for Sysbench, a small server and Postgres 11, 12, 13, 14, 15 and 16. The goal is to see if there are perf regressions over time. This is a common problem as code gets added over time the cumulative impact is using more CPU to do the same amount of work. MySQL is an example where there are significant regressions over time.

Context - the workload here is a cached database with 1 client. I am searching for CPU regressions. I will repeat this on a medium server with more concurrency (8 clients).

tl;dr

  • Postgres avoids perf regressions over time. This is starting to get boring.
  • In many cases Postgres 16.0 is ~5% faster than 11.21
  • For full scans Postgres 16.0 is ~20% faster than 11.21
  • Postgres 16.0 is also ~20% faster than 11.21 on several of the update-only benchmark steps
  • See the results for range query, part 2 below. Postgres 16.0 does better than 11.21 for aggregation + short range scan, but then worse than 11.21 for aggregation + long range scan. This benchmark step is a read-only variant of the traditional sysbench workload. This might be worth investigating. I am sure whether it is a regression, but it might be one.

Builds

I compiled Postgres 11.21, 12.16, 13.12, 14.9, 15.4 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

I used sysbench and my usage is explained here. Postgres was configured to cache all tables.

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. 

The benchmark is run with 1 client and 1 table with 20M rows. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each. The command line for my wrapper scripts is:

bash r.sh 1 20000000 600 1200 nvme0n1 1 1 1

The benchmark used the a2 config and I commented out the usage of autovacuum_vacuum_insert_scale_factor for Postgres 11.21 and 12.16 as they do not support that option.

Results

A spreadsheet with all results is here. It has two sheets: one with absolute QPS for each version tested, the other with the relative QPS per version. The relative QPS is: (QPS for me) / (QPS for 11.21). With relative QPS it is easy to quantify improvements and regressions.

There are ~42 tests and each can be called a benchmark step or microbenchmark. I will call them benchmark steps. These are put into one of 5 groups based on the workload:

  • point query, part 1 - point queries that don't use the random-points benchmark step
  • point query, part 2 - point query variants that use the random-points benchmark step
  • range query, part 1 - range queries without aggregation
  • range query, part 2 - read-only variants of the original sysbench workload that does range queries with aggregation
  • writes - benchmark steps that do insert, update and delete

The y-axis starts at 0.9 rather than 0 to make it easier to see the changes.

The results below for range queries, part 2 have an interesting pattern. There are two sets of results there - the first 3 groups of bars from the left are run prior to subjecting Postgres to random writes. The other 3 groups bars are the same tests but run after subjecting Postgres to random writes. All of the tests do range scans with some aggregation -- these are read-only variants of the traditional sysbench workload. The difference is that the length of the range scans vary. So below, some 

  • Groups 1 and 4 scan 10,000 rows
  • Groups 2 and 5 scan 1,000 rows
  • Groups 3 and 6 scan 10 rows
Postgres 16.0 looks best with groups 3 and 6 (only scan 10 rows) and worst with groups 1 and 4 (scan 10,000 rows). 

Variance in peak RSS with jemalloc 5.2.1

Peak RSS for jemalloc 5.2.1 has much variance with the Insert Benchmark with MyRocks. The variance is a function of how you build and configure jemalloc. The worst case (largest peak RSS) is jemalloc 5.2.1 provided by Ubuntu 22.04 and I have yet to figure out how to reproduce that result using jemalloc 5.2.1 compiled from source.

I previously shared results to show that jemalloc and tcmalloc are better than glibc malloc for RocksDB. That was followed by a post that shows the peak RSS with different jemalloc versions. This post has additional results for jemalloc 5.2.1 using different jemalloc config options. 

tl;dr

  • Peak RSS has large spikes with jemalloc 4.4, 4.5 and somewhat in 5.0 and 5.1. Tobin Baker suggested these might be from changes to the usage of MADV_FREE and MADV_DONTNEED. These start to show during the l.i1 benchmark step and then are obvious during q100, q500 and q1000.
  • For tests that use Hyper Clock cache there is a large peak RSS with Ubuntu-provided jemalloc 5.2.1 that is obvious during the l.x and l.i1 benchmark steps. I can't reproduce this using jemalloc 5.2.1 compiled from source despite my attempts to match the configuration.
  • Benchmark throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3)

Builds

My previous post explains the benchmark and HW. 

To get the jemalloc config details I added malloc-conf="stats_print:true" to my.cnf which causes stats and the config details to get written to the MySQL error log on shutdown.

I compiled many versions of jemalloc from source -- 4.0.4, 4.1.1, 4.2.1, 4.3.1, 4.4.0, 4.5.0, 5.0.1, 5.1.0, 5.2.0, 5.2.1, 5.3.0. All of these used the default jemalloc config, and while it isn't listed there the default value for background_thread is false.

  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.opt_safety_checks: false
  config.prof: false
  config.prof_libgcc: false
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false

The config for Ubuntu-provided 5.2.1 is below. This is also the config used by je-5.2.1.prof (see below). It also gets background_thread=false. It differs from what I show above by:

  • uses config.prof: true
  • uses config.prof_libgcc: true

  config.cache_oblivious: true
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: ""
  config.opt_safety_checks: false
  config.prof: true
  config.prof_libgcc: true
  config.prof_libunwind: false
  config.stats: true
  config.utrace: false
  config.xmalloc: false

Finally, I tried one more config when compiling from source to match the config that is used at work. I get that via: 

configure --disable-cache-oblivious --enable-opt-safety-checks --enable-prof --disable-prof-libgcc --enable-prof-libunwind --with-malloc-conf="background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"

With that the option values are the following, plus the background thread is enabled. The build that uses this is named je-5.2.1.prod below.


  config.cache_oblivious: false
  config.debug: false
  config.fill: true
  config.lazy_lock: false
  config.malloc_conf: "background_thread:true,metadata_thp:auto,abort_conf:true,muzzy_decay_ms:0"
  config.opt_safety_checks: true
  config.prof: true
  config.prof_libgcc: false
  config.prof_libunwind: true
  config.stats: true
  config.utrace: false
  config.xmalloc: false

Now I have results for variants of jemalloc 5.2.1 and the names here match the names I used on the spreadsheets that show peak RSS.

  • je-5.2.1.ub - Ubuntu-provided 5.2.1
  • je-5.2.1 - compiled from source with default options
  • je-5.2.1.prof - compiled with configure -enable-prof --enable-prof-prof_libgcc to get a config that matches je-5.2.1.ub
  • je-5.2.1.prod - 5.2.1 compiled from source using 

Benchmarks

I ran the Insert Benchmark using a 60G RocksDB block cache. The benchmark was repeated twice -- once using the (older) LRU block cache, once using the (newer) Hyper Clock cache.

The benchmark was run in the IO-bound setup and the database is larger than memory. The benchmark used a c2-standard-30 server from GCP with Ubuntu 22.04, 15 cores, hyperthreads disabled, 120G of RAM and 1.5T of storage from RAID 0 over 4 local NVMe devices with XFS.

The benchmark is run with 8 clients and 8 tables (client per table). The benchmark is a sequence of steps and the peak RSS problem is worst for the l.x benchmark step that creates indexes and allocates a lot of memory while doing so:

  • l.i0
    • insert 500 million rows per table
  • l.x
    • create 3 secondary indexes. I usually ignore performance from this step.
  • l.i1
    • insert and delete another 100 million rows per table with secondary index maintenance. The number of rows/table at the end of the benchmark step matches the number at the start with inserts done to the table head and the deletes done from the tail. 
  • q100, q500, q1000
    • do queries as fast as possible with 100, 500 and 1000 inserts/s/client and the same rate for deletes/s done in the background. Run for 1800 seconds.

Configurations

The benchmark was run with 2 my.cnf files: c5 and c7 edited to use a 40G RocksDB block cache. The difference between them is that c5 uses the LRU block cache (older code) while c7 uses the Hyper Clock cache.

Results: perf reports

My standard perf reports are here for both types of block caches: LRU and Hyper Clock.

  • Throughput is generally improving over time from old jemalloc (4.0) to modern jemalloc (5.3). See the tables with absolute and relative throughput in the Summary section for LRU and for Hyper Clock.
  • HW performance metrics are mostly similar regardless of the peak RSS spikes. See the tables for LRU and for Hyper Clock. The interesting columns include: cpupq has CPU per operation, cpups has the average value for vmstat's us + sy, csps has the average value for vmstat's cs and cspq has context switches per operation.
So the good news is that tests here don't find performance regressions, although the more interesting test would be on larger HW with more concurrency.

Results: peak RSS

I measured the peak RSS during each benchmark step. The spreadsheet is here

Summary:

  • The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.

Summary

  • The larger values for jemalloc 4.4, 4.5, 5.0 and 5.1 might be from changes in how MADV_FREE and MADV_DONTNEED were used.
  • The peak RSS is larger values for je-5.2.1.ub during l.x and l.i1. I have been unable to reproduce that with jemalloc compiled from source despite matching the configuration.

Saturday, September 23, 2023

Postgres 16.0, sysbench and a medium server

This has results for Postgres 16.0 vs sysbench on a medium server and the news continues to be good. I previously shared results for this on a small server.

tl;dr

  • Postgres 16.0 is faster than 15.2
  • The improvement for reads is larger than for writes

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16 rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

I used sysbench and my usage is explained here. Postgres was configured to cache all tables.

This benchmark used a c2-standard-30 server from GCP with 15 cores, hyperthreads disabled, 120G of RAM, Ubuntu 22.04 and 1.5TB of NVMe SSD with XFS (SW RAID 0 over 4 local devices).

The benchmark is run with 8 clients and 4 table with 20M rows per table. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each. The command line for my wrapper scripts is:

bash r.sh 4 20000000 600 1200 md127 1 1 8

The benchmark used the x7a28 config.

Results

A spreadsheet with the results is here. It has two sheets: one with absolute QPS for each version tested, the other with the relative QPS per version. The relative QPS is: (QPS for me) / (QPS for 15.2). With relative QPS it is easy to quantify improvements and regressions.

There are ~42 tests and each can be called a benchmark step or microbenchmark. I will call them benchmark steps. These are put into one of 5 groups based on the workload:

  • point query, part 1 - point queries that don't use the random-points benchmark step
  • point query, part 2 - point query variants that use the random-points benchmark step
  • range query, part 1 - range queries without aggregation
  • range query, part 2 - read-only variants of the original sysbench workload that does range queries with aggregation
  • writes - benchmark steps that do insert, update and delete

The y-axis starts at 0.8 rather than 0 to make it easier to see the changes. In a few cases the y-axis range excludes outliers, values too big or too small. But everything can be read on the spreadsheet.

  • point queries, part 1 - Postgres 16.0 is faster than 15.2
  • point queries, part 2 - Postgres 16.0 is faster than 15.2
  • range queries, part 1 - Postgres 16.0 is faster than 15.2
  • range queries, part 2 - Postgres 16.0 is faster than 15.2
  • writes - Postgres 16.0 is faster than 15.2, but the improvement here is smaller than for reads

Postgres 16.0, sysbench and a small server

I recently shared results for Postgres 16rc1 vs sysbench on a small server. Now I have results for Postgres 16.0 and the news continues to be good.

tl;dr

  • insert QPS might have dropped by ~5%, otherwise there are no regressions for writes
  • for everything else QPS in Postgres 16.0 is within ~2% of 15.2, which is a great result

Builds

I compiled Postgres 15.2, 15.3, 15.4, 16 beta1, 16 beta2, 16 beta3, 16 rc1 and 16.0 from source. The builds are named o3_native_lto which is shorthand for using: -O3 -march=native -mtune=native -flto.

Benchmark

I used sysbench and my usage is explained here. Postgres was configured to cache all tables.

This benchmark used the Beelink server explained here that has 8 cores, 16G RAM and 1TB of NVMe SSD with XFS and Ubuntu 22.04. 

The benchmark is run with 1 client and 1 table with 20M rows. The read-only tests ran for 600 seconds each and the other tests ran for 1200 seconds each. The command line for my wrapper scripts is:

bash r.sh 1 20000000 600 1200 nvme0n1 1 1 1

The benchmark used a1 config.

Results

A spreadsheet with the results is here. It has two sheets: one with absolute QPS for each version tested, the other with the relative QPS per version. The relative QPS is: (QPS for me) / (QPS for 15.2). With relative QPS it is easy to quantify improvements and regressions.

There are ~42 tests and each can be called a benchmark step or microbenchmark. I will call them benchmark steps. These are put into one of 5 groups based on the workload:

  • point query, part 1 - point queries that don't use the random-points benchmark step
  • point query, part 2 - point query variants that use the random-points benchmark step
  • range query, part 1 - range queries without aggregation
  • range query, part 2 - read-only variants of the original sysbench workload that does range queries with aggregation
  • writes - benchmark steps that do insert, update and delete

The y-axis starts at 0.8 rather than 0 to make it easier to see the changes.

  • point queries, part 1 - there are no regression. There is too much variance for the second and third benchmark steps from the left -- point-query.pre_range=100 and point-query_range=100. I have to explain the cause.
  • point queries, part 2 - there are no regressions
  • range queries, part 1 - QPS might have dropped by ~2% or there might be noise
  • range queries, part 2 - there are no regressions
  • writes - insert QPS might have dropped by ~5%, otherwise there are no regressions