Posts

Showing posts from 2021

Summarizing the different implementations of tiered compaction

This is my first attempt to summarize tiered compaction as implemented by RocksDB, Cassandra, ScyllaDB and HBase. I have limited experience with tiered compaction (that will change) and limited knowledge of Cassandra, ScyllaDB and HBase (that won't change soon) so I will make mistakes but enjoy the learning process. To keep this brief I ignore the variants optimizations for time-series and I don't explain compaction basics. Different DBMS use different names for tiered compaction: RocksDB has universal compaction HBase has ExploringCompactionPolicy and RatioBasedCompactionPolicy, see here Cassandra has STCS (Size Tiered Compaction Strategy) ScyllaDB has STCS and ICS (Incremental Compaction Strategy), see here Context The workload matters a lot when considering compaction algorithms. There are many but I list only three below. Note that many real workloads are a mixture of patterns, an example is some overwrites, some inserts. Overwrite-mostly - all of my time in LSM-land has b

Sysbench: MySQL 5.6, 5.7 & 8.0 on a small server

This has results for sysbench with MySQL versions 5.6.49, 5.7.35 and 8.0.2x using the same setup as  previously used  for the insert benchmark. The versions for 8.0.2x are 8.0.20, 8.0.22, 8.0.23, 8.0.26 and 8.0.27. Executive summary: For CPU-bound setups, MySQL 8.0.27 is slower than 5.6.49 on 39 of the 42 microbenchmarks. For the microbenchmarks when it was slower, on average it gets 72% of the throughput vs 5.6.49. For IO-bound setups, MySQL 8.0.27 is slower than 5.6.49 on 27 of the 42 microbenchmarks, faster on 13 and had the same throughput on 2. In most cases more CPU overhead in 8.0 is the reason it is slower than 5.6. Details See the Postgres post  for more details on how I run sysbench. The summary is there are 43 invocations of sysbench and each is a microbenchmark. But a typo meant I only have results for 42 of the 43. Three configurations were tested: 10M rows without prepared statements, 10M rows with prepared statements, 400M rows without prepared statements. The 10M row te

Sysbench: PostgreSQL 12, 13 and 14 on a small server

This has results for sysbench with PostgreSQL versions 12.4, 13.4 and 14.0 using the same setup as previously used for the insert benchmark. Executive summary: Postgres continues to be boring. There are a few regressions and more improvements. There were ~40 tests and each ran for 5 minutes. Running each test for such a short time means there is more chance for confusing results. Alas, running each test for 30 minutes or more would have taken too long because I repeated the benchmark many times for different DBMS, configurations and database sizes. I started another round just for the Postgres IO-bound configuration that will take about 3 days to get results. Details See the insert benchmark post for more details on the hardware and Postgres configurations. As always, there are layers upon layers of shell scripts. The file all_small.sh lists the sequence in which the test steps are run and each step is run for 300 seconds with 1 connection (low concurrency!). From the all_small.sh s

Insert benchmark: MySQL 5.6, 5.7 & 8.0 on a small server

I have another performance report for the insert benchmark using MySQL 5.6, 5.7 and 8.0 on a small server . As normal, I had to repeat the tests a few times: I make mistakes and also want results for new my.cnf settings. The goal for the tests is to understand how low-concurrency performance changes in new MySQL releases. Low-concurrency tests are great for finding CPU regressions. High-concurrency tests are also great but I only have small servers at home. For the CPU-bound setup, new MySQL (8.0) gets between 10% and 37% less throughput than old MySQL (5.6) because 8.0 uses more CPU/query. The increase in CPU/query for MySQL from 5.6 to 8.0 is large while PostgreSQL avoids CPU regressions from 12.4 to 14.0. Disclaimers: Be careful about interpreting these results, the context here is a workload with simple queries. These results have yet to be reviewed by others. There can be mistakes. 8.0.20 and 8.0.22 were exciting releases. Performance is slowly improving since then. Details Tests

Insert benchmark: PostgreSQL 12, 13 and 14 on a small server

I have another performance report for the insert benchmark using PostgreSQL 12.4, 13.4 and 14.0 on a  small server . As normal, I had to repeat the tests a few times: I make mistakes and sometimes need results for new tuning options. The goal for the tests is to understand how low-concurrency performance changes in new PostgreSQL releases. Low-concurrency tests are great for finding CPU regressions. High-concurrency tests are also great but I only have small servers at home. Executive summary: Postgres is boring. There were no regressions. Details Tests were run for Postgres versions 12.4, 13.4 and 14.0. I used the same configuration file (see here ). The script to build Postgres from source is here . The test servers have 4 cores and tests were run with either 1 insert connection or 2 connections (rate limited inserts, not rate-limited queries).  This page  has links to explain the insert benchmark and perf report format. The test was run in CPU-bound and IO-bound modes. For each mode

Welll actually, LSM and B-tree

 Things I read about LSM that make me go well actually: LSM does sequential writes - well actually, writes are sequential logically (per-file) but not physically (per-device). From the physical perspective compaction writes are large & concurrent rather than sequential. Compaction writes to files sequentially but it is usually concurrent (multi-threaded) with a file written sequentially per-thread. There can also be small writes to the WAL. With concurrent compaction threads the large (maybe 1MB+) writes are interleaved at the device. Compaction does merge-sort - well actually, that is a k-way merge , not a merge-sort . Leveled compaction suffers from write-amplification  - well actually, write-amplification was 10X smaller with MyRocks when the first workload was moved to it from InnodDB. Write-amp with leveled compaction is larger than with tiered, but that is the (C)RUM Conjecture in action: leveled means less space-amp at the cost of more write-amp. One of the problems is t

DeWitt Clause vs the public cloud

The DeWitt Clause is the name of a clause in the terms of service that prevents a customer from disclosing performance of a licensed product. Things turned out OK for David DeWitt and he even went on to work for at least one company that now uses a DeWitt Clause for their database products. I am not a lawyer but I have a strong opinion on the DeWitt Clause and non-compete agreements. I think both are anti-competitive and hope more is done to limit their usage. One of the arguments for DeWitt Clauses is that benchmarking is hard, others will get it wrong and that can mislead customers. While the first parts are correct (it is hard, people will get it wrong) the last clause feels like a bad faith argument to me. Vendors haven't always pursued truth while running their benchmarks. I agree there will be plenty of benchmarketing in a world without DeWitt Clauses. I think the benefits outweigh the problems. Database vendors We used to think of this as a thing for legacy enterprise datab

Compatible with MySQL or Postgres?

Open and closed source scale-out DBMS that are compatible with MySQL and Postgres have emerged on the market. This is great for the community but there will be much confusion about the meaning of compatible .  This post has yet to have anything on the cloud vendors in China. They are doing impressive work but I don't know enough about it. I am happy to update this post when I learn more. But first, the MySQL and Postgres teams. Team MySQL includes PlanetScale (Vitess), PingCAP (TiDB), Amazon (Aurora), MariaDB ( Xpand ),  SingleStore ,  ClickHouse  and Dolt . Team Postgres includes CockroachDB , Yugabyte , Google (Spanner), Amazon ( Aurora & Redshift ), Microsoft ( CitusDB ),  TimescaleDB ,  CrateDB , YellowBrick ,  Greenplum  and Materialize . It wouldn't be fair to exclude MongoDB, so Team MongoDB is MongoDB, Amazon ( DocumentDB ) and Microsoft (Cosmos DB API for MongoDB ). But the rest of the post is about MySQL and PG. One way to describe compatibility is via three

The other way to compress InnoDB: outsource it

There are at least three ways to do compression for InnoDB - classic, holepunch and outsource.  The classic approach ( table compression ) was used and enhanced by the FB MySQL team . It might not have been widely used elsewhere. While it works, even for read/write workloads, the implementation is complicated and it isn't clear that it has a bright future. The holepunch approach ( page compression ) is much simpler than the classic approach. Alas, I am skeptical that a filesystem will be happy tracking the metadata from doing a holepunch for every (or most) pages written. I am also skeptical that unlink() response times of seconds to minutes (because of the holepunch usage) will be good for a production DBMS. I wrote a few posts about my experience with the holepunch approach: here , here , here and here . The outsource approach is the most simple from the perspective of InnoDB - let the filesystem or storage do the compression for you. In this case InnoDB continues to do filesyst

Automating memory management in a DBMS

I recently read two papers on automating memory management - one for DB2 , one for Oracle . While the papers aren't new (~2005), they are definitely worth reading. AFAIK both made it into the products, I have no idea how that turned out but the outcome doesn't change my opinion that the papers were excellent. Disclaimer - I worked on query execution for Oracle at the time and made a few changes to the row sources that I maintained to support the work described by the Oracle paper . Both describe methods to automate memory management with the goal of improving performance. In many DBMS memory management is difficult for several reasons: It isn't global and the DBA must estimate good values for the sizes of various caches and other large memory consumers: sort and hash for order by, aggregation, join and index create. It isn't bounded when allocations are specified per instance of sort or join rather than one limit on the memory used by all sorts or joins. Too many concur

Developer experience, what about the other stakeholders?

While developer experience gets a lot of press, there are four stakeholders when you provide a database service. So we can call these DX, UX, MX and OX for developer, user, management and operations experience: developers  want the DBMS to stay out of their way. Schema is one example because waiting for a schema change gets in the way. Note that NoSQL databases have less schema rather than being schema-less because indexes are schema. I am curious whether less schema leads to a great developer experience in the long run for large scale projects given the risk of an unmanaged schema and poorly understood data. The developers who were able to move fast early in the project can create much tech debt for those who arrive years later. users  of the services that depend on the database want great QoS - high uptime, low and predictable latency for queries, low chance of lost data. They just want to use your database-backed app without problems. management  wants to minimize cost while getting

Review of DiffKV

This is a review of Differentiated Key-Value Storage Management for Balanced I/O Performance that was published in ATC 2021. This is a wonderful paper that resolves several of my concerns about key-value separation ( here  and here ). The authors have experience with key-value separation via Titan which is part of TiDB . The key idea in the paper is to use leveled compaction for keys and tiered compaction for values. Write-amplification is reduced by not using leveled compaction for values. Read-amplification is reduced, relative to classic key-value separation (see WiscKey ), by using tiered compaction for values rather than logs. With classic key-value separation the worst case read-amp for a scan is the need to do a random read from the log for every qualifying key as that random read can include a block decompression and a storage IO. With DiffKV the use of tiered compaction for values provides some ordering to avoid that worst case. While I enjoyed the paper, it didn't have

Happy 15th to Percona

I am thrilled to be celebrate Percona's  15th anniversary . My time with MySQL began about the same time as the founding of Percona. Those years, the mid-2000s, were the dark ages for MySQL. There was doubt about the future of MySQL because there were many things that needed to be made better. Fortunately, upstream and the community, with much help from Percona, rallied to fix the problems and add needed features. Percona has been a huge part of the community. I don't have time to list everything, but examples include saving the user conference, providing an open-source hot backup solution, educating many of us (including me) via their blog posts and helping push the product to get so much better. I have also been a customer as they helped with the MyRocks effort and in the great patch migration (5.6 to 8.0). I like that the anniversary book starts by mentioning the multi-core scaling problem that InnoDB had in the mid-2000s. My MySQL deployment used 4-core, 1-socket CPUs at th

On storage engines

I wish it were easier to implement new storage engines for MySQL, Postgres and MongoDB and other OSS databases. There is so much innovation that we miss out on - FASTER is one example. All (MySQL, MongoDB, Postgres) have storage engine APIs but there are not many OSS implementations of them. MyRocks, MySQL Aurora and MySQL HeatWave are examples of the benefits. But they also show that it helps to have the backing of a well-funded company because this is a huge undertaking. The API for Postgres is the most recent and perhaps that will be the most popular. The API for MySQL is the oldest and has become harder as more requirements (like partitioning) are pushed into it. The MongoDB API is friendlier than the MySQL API, but MongoRocks was deprecated when the API was enhanced to support transactions and RocksDB wasn't able to support user-provided commit timestamps. One problem is naming. MongoDB and WiredTiger are databases, but WiredTiger is also a component of MongoDB. To avoid conf

Sequential IO and an LSM tree

I see statements that sequential IO is a benefit of using an LSM tree . That is a vague and truthy statement. It is correct that the writes done by compaction are sequential per-file. But I am interested in the IO from the perspective of the storage device.  The truthier claim is that with an LSM there will be many streams of IO (read & write) that benefit from large IO requests. The reads and writes done by compaction are sequential per file, but there is much concurrency and the storage device will see many concurrent streams of IO. The IO patterns for an LSM with a busy workload are: N compaction threads where each thread is reading from ~10 files, merging the results and writing the output to create new SSTs. Each thread has one file open at a time that is written sequentially. The reads are sequential per-file. At any point in time there can be 10*N read streams and N write streams. The reads benefit from large IO requests and prefetch. The writes benefit from async write-back

How to submit a PR for RocksDB

I submitted my first PR to RocksDB as an external contributor and have advice on the process. Read the Contribution Guide Run make format on your diffs. This depends on clang-format and you might need to install it. Edit HISTORY.md if there are user-facing changes Optionally run make check to confirm tests pass. Some tests failed because my small servers have 16G of RAM, the tests use /dev/shm and needed more than the 8G that /dev/shm gets on my servers. Confirm that RocksDB LITE builds still work: LITE=1 make db_bench Some of the CI tests are flaky now and have bogus failures ( see here ). To get CI tests to repeat I edit and push small changes to HISTORY.md. The team is fixing the intermittent test failures. Alas, one of the intermittent problems was from internal gcc failures on ARM -- something I can't reproduce on my x86 hardware. The good news is that after three attempts everything passes . Google for advice on how to do this in git. I am not a git guru but the advice

More on read-only benchmarks and an LSM

Image
This adds to my last post on the impact of LSM tree shape on read performance. I updated the test script to add one more stage, l0_and_lmax, that runs after compact_all. A full compaction is done for compact_all to show the benefit (more QPS) of reducing the number of trees to search to 1. For the l0_and_lmax stage overwrites are done to get 6 SSTs in the L0. At this point there is data in the L0 and max level of the LSM tree -- thus the name l0_and_lmax. The updated results and changes to my test scripts are here . The spreadsheet with data and charts is here . The summary is that query performance suffers if you do all of the work to fully compact the LSM tree and then put 6 SSTs in the L0. By suffers, I mean it is almost as bad as when the LSM tree has data everywhere (memtable, L0, all levels after the L0). This is visible in the charts below -- compare l0_and_lmax with pre_flush and post_flush. Charts!

Read-only benchmarks with an LSM are complicated

Image
I often see benchmark results for an LSM tree where the shape of the tree is not described. When the results are worse than desired I wonder how much of that is inherent to the LSM tree versus the configuration. It might be easier to make an LSM tree look bad than to make a B-Tree look bad. I have results from RocksDB for which QPS varies by up to 5X depending on the shape of the LSM tree. The workload is read-only and CPU-bound. The impact of the LSM tree shape on CPU overhead is known but not widely known. Hopefully this helps with that. Read amplification for an LSM is usually worse than for a B-Tree. But read-amp should be described separately for IO vs CPU because (usually) CPU read-amp is the problem. The unit for IO read-amp is pages read from storage (real storage, not a read from the OS page cache). The unit for CPU read-amp is CPU overhead but some combination of cache lines read, keys compared and bloom filters checked per query might be a good proxy. The extra CPU read-amp