Posts

Showing posts from June, 2023

Insert+delete benchmark, medium server and MyRocks

This has results for the new insert benchmark (with deletes enabled) for MyRocks on a medium server. I ended up repeating the benchmark 3 times (round 2 after realizing I needed to log the response time for a potentially slow queries, round 3 because I needed to use a new MyRocks build). Results for the benchmark on a small server are here and here . tl;dr MyRocks in 8.0.28 has better perf than in 5.6.35 for most of the benchmark steps. This wasn't true on the small server. I suspect that one reason for the change is that a server CPU was used here while the small server uses a mobile CPU -- the Beelink uses AMD Ryzen 7 4700u and /proc/cpuinfo from the c2 server shows Intel(R) Xeon(R) CPU @ 3.10GHz. there is a ~5 second write stall for the l.i1 benchmark step in one of the configurations. I have more work to do to explain it Benchmarks The medium server is c2-standard-30 from GCP with 15 cores, hyperthreads disabled, 120G of RAM, and 1.5T of XFS vis SW RAID 0 over 4 local NVMe de

Universal Compaction in RocksDB and me

I almost always use leveled compaction with RocksDB. I rarely use universal. It is useful for many workloads but will take more time for users to get working. One overview of RocksDB compaction is here  and details on universal are here . A comparison of RocksDB universal with other implementations of tiered is here . If you are a pessimist this is a rant. If you are an optimist this is a short list of work that can be done to improve universal compaction. Universal compaction in RocksDB is called tiered compaction by every other LSM. I wish RocksDB didn't create a new name as that is a source of confusion.  Updates I added the Trivial Move section with stats to show write amp during fillseq How it began Universal started as a clever hack by reusing much code from leveled. The clever hack was to keep all SSTs into the L0. Alas, this has a bad side-effect. Bloom filters and block indexes are per-SST and not paged (paged == split into ~4kb pieces) when cached. When the SST is huge th

Tuning Postgres and the new insert benchmark, round 4

After looking at the results from round 3 I reran the insert+delete benchmark for a new configuration that combines the best changes identified in the previous tests. The goal is to find a configuration that improves throughput and reduces variance. And in this post I am explaining the impact of a new configuration (a18) that improves on the previous best configuration (a16). tl;dr The new config (a18) improves on the previous best one (a16) There is still too much variance in the l.i1 benchmark step Configurations The new configuration is a18 and I use two variations of it: one with shared_buffers=70GB for the cached by Postgres and IO-bound workloads, and one with shared_buffers=4GB to simulate fast storage (reads are done from the OS page cache). It starts with the base config and then combines the changes from the a1, a10, a12 and a16 configs. The changes are: wal_compression=lz4, autovacuum_work_mem=256MB, wal_buffers=64MB, autovacuum_scale_factor=0.05 and autovacuum_insert_sc

Tuning Postgres and the new insert benchmark on a small server

This has results for the updated insert benchmark with Postgres 15.2 on a small server. The goal is to search for a better configuration. Previous posts have results from a medium server: post 1 , post 2 , post 3 . My focus is on the l.i1 benchmark step that is write-heavy -- inserts & deletes with a PK index and 3 secondary indexes in place. And the goal is to find a configuration that minimizes write stalls and response time variance. tl;dr the 1-client tests are more sensitive to config options than the 4-client tests at least for the l.i1 benchmark step. Most of the configs get about 10k inserts/s on l.i1 for the 1-client tests, except for the a16 and a18 configs which get >= 17k/s. But on the 4-client tests all of the configs get ~22k inserts/s for cached by Postgres and ~14k inserts/s for cached by OS  the largest improvement to the base config is from the a16 config that tunes autovacuum*scale_factor ( see here ). This is the second big improvement from tuning autovaccu

Tuning InnoDB for the insert benchmark on a medium server

This explains my attempt to improve the my.cnf settings that I use for InnoDB with the insert benchmark. Here I use the updated insert benchmark with deletes per insert to avoid growing tables. For MySQL 5.6, 5.7 & 8.0 I tested 3, 4 & 11 changes to the my.cnf options I have been using. tl;dr In some benchmarks setting innodb_use_native_aio=OFF improves query while hurting insert performance. I assume the problem is that IO queues get overloaded with aio=ON. Too many concurrent IO requests is OK for background activity (page writeback) but bad for foreground activity (queries). In some cases I would accept this trade off and use aio=OFF but I hope InnoDB adds a limit to concurrent IO requests with aio=ON. Other than the a1 config (innodb_use_native_aio=OFF) none of the configs consistently help performance. The base config I have been using is good enough for now. While I don't focus on comparisons to other DBMS here, MySQL QPS on the read+write benchmark steps suffers a l

Insert+delete benchmark: small server, MyRocks and IO-bound

This has results for an IO-bound, low-concurrency insert benchmark on a small server. The previous report for a CPU-bound workload  is here . This used the  --delete_per_insert  option so that the write-heavy steps ran for a long time while the working set remained in memory. This work was delayed because I had to figure out the memory demand for create index to avoid OOM. tl;dr 5.6.35 has better perf than 8.0.28 courtesy of new CPU overheads in upstream MySQL 8 and with 8.0.28 the throughput is up to 22% less than 5.6.35 Variance is visible, but not horrible. Benchmark See the previous report . MyRocks used the cy10a_bee config for 5.6.35 and for 8.0.28 . The insert benchmark was repeated for 1 client and 4 clients. For 1 client the l.i0 benchmark step loaded 800M rows into 1 table. For 4 clients it loaded 200M rows per table and there were 4 tables. In both cases: the l.i1 benchmark step did 50M inserts matched by 50M deletes the q100, q500 and q1000 benchmark steps each ran for

Create index, MyRocks and memory

I got OOM during index create for a few MyRocks benchmarks and I am somewhat ignorant about how MyRocks handles create index. This occurred on a server that has 16G of RAM, RocksDB block cache was 10G and mysqld RSS was ~11G so there wasn't much spare memory. I am still somewhat ignorant but will share what I learned. By OOM I mean that the Linux OOM killer decided to kill mysqld when the memory demand exceeded the memory supply. There is one approach for creating a secondary index: bulk. tl;dr Plan for the increase to mysqld RSS as a function of rocksdb_merge_combine_read_size Bulk approach MyRocks can optimize bulk loading and those features have been extended to support secondary index creation. Documentation on bulk loading is here . My request to improve the docs is here . My possibly correct description of how this works: The PK is scanned to extract the columns needed for the secondary index and stored in memory When a memory limit is reached the columns are sorted and writt

Tuning Postgres and the new insert benchmark, round 3

This has results for the insert benchmark on a medium server when Postgres caches the database. The previous post is for a similar setup except the database was cached by the OS but not by Postgres.  tl;dr the base config I have been using is good  once I tuned  autovacuum to be more agressive the big improvement is from adding autovacuum_scale_factor=0.05 and autovacuum_insert_scale_factor=0.05  adding wal_compression=lz4 has a small benefit several of the changes made performance much worse increasing autovacuum_work_mem didn't help here but will probably help on IO-bound tests Benchmark This continues to use Postgres 15.3, a c2-standard-30 server with 15 cores and the insert benchmark with delete_per_insert enabled to avoid growing the table so that write-heavy benchmark steps can run for a long time while keeping the working set in memory. The configurations all use shared_buffers=70G and the database is cached by Postgres. All of the config files  are here . The base config f