Wednesday, April 27, 2016

Fun with scons while building MongoDB from source

This post might not have a large audience as not many people build MongoDB from source. Fortunately MongoDB has a thriving developers email list where my questions get answered quickly. Some of my builds must specify non-default paths for the compiler toolchain including the location of binaries like gcc, include paths and library paths. Last year MongoDB added options to their scons build to make that possible and I appreciate that they move fast to make things better.

Build tools (scons, cmake, autoconf/automake) are like snowflakes. Everyone is different and those differences are painful to me because I'd rather not invest time to become an expert. Today's fun problem was figuring out how to specify multiple directories for include and library paths. I assumed this would be like LD_LIBRARY_PATH and I could use a colon as the path separator. Alas I was wrong and the path separator is a space. The docs claim that a colon should work. I am still confused, but I have a working build for MongoRocks!
  • This is OK: scons CPPPATH="/path/to/inc1ude1 /path/to/include2" mongod
  • This is not OK: scons CPPPATH="/path/to/include1:/path/to/incude2" mongod

Monday, April 25, 2016

TRIM, iostat and Linux

I use iostat and vmstat to measure how much CPU and storage is used during my performance tests. Many of the database engines have their own counters to report disk IO but it is good to use the same measurement across engines. I use the "-k" option with iostat so it reports KB written per second per device.

The rate of writes to storage can be overstated by a factor of two in one case and I don't think this is widely known. When TRIM is done for an SSD then the Linux kernels that I use report that as bytes written. If I create an 8G file then I will see at least 8G of writes reported by iostat. If I then remove the file I will see an additional 8G of writes reported by iostat assuming TRIM is used. But that second batch of 8G of writes wasn't really writes.

One of the database engines that I evaluate, RocksDB, frequently creates and removes files. When TRIM is counted as bytes written then this overstates the amount of storage writes done by RocksDB. The other engines that I evaluate do not create and remove files as frequently -- InnoDB, WiredTiger, TokuDB, mmapv1.

The best way to figure out whether TRIM is done for your favorite SSD is to test it yourself.
  1. If TRIM is done then iostat reports TRIM as bytes written. 
  2. If iostat reports TRIM as bytes written and your database engine frequently removes files then iostat wKB/second might be overstated.

Testing this:


My test case is:

output=/path/to/many/GB/file/this/will/create

# run this long enough to get a file that is many GB in size

dd if=/dev/zero of=$output bs=1M oflag=direct &
dpid=$!

sleep 30
kill $dpid


iostat -kx 1 >& o.io &
ipid=$!
sleep 3
rm -f $output; sync
sleep 10
kill $ipid
# look at iostat data in o.io

Example iostat output from a 4.0.9 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65538.00     0.00 8387632.00   255.96     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00

Example iostat output from a 3.10.53 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 31078.00     0.00 3977984.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 283935.00     0.00 36343552.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 288343.00     0.00 36907908.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 208534.00     0.00 26692352.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Wednesday, March 2, 2016

Using jemalloc heap profiling with MySQL

I spent too much time figuring this out.

This works for me:
MALLOC_CONF="prof:true,prof_gdump:true,prof_prefix:/path/to/files/jez" \
libexec/mysqld ...

This does not work for me:
MALLOC_CONF="prof:true,prof_gdump:true" \
libexec/mysqld ...

MyRocks and sql-bench

MyRocks is now able to run sql-bench with support recently added for tables that are missing a PK. I found one bug and two performance problems in MyRocks when running sql-bench.

While writing this I found a post that claims sql-bench will be removed from the MySQL repo. It is useful and I hope it remains in some repo.

I run sql-bench for MyRocks with this command line:
./run-all-tests --server=mysql --create-options="engine=rocksdb default collate latin1_bin"

Monday, February 22, 2016

Concurrent transaction performance in RocksDB

Support for transactions was added to RocksDB last year. Here I explain performance for the pessimistic transaction API with a concurrent workload. I used the db_bench client rather than MySQL but MyRocks reuses the RocksDB transaction API and gets the performance benefits. For pessimistic transactions:
  • there is not much overhead from the transaction API
  • throughput improves as the work done per transaction increases
  • throughput improves with the concurrent memtable

Configuration


The test server has 2 sockets, 24 CPU cores and 48 HW threads. The database used fast storage (tmpfs). The benchmark is db_bench --benchmarks=randomtransaction via this test script. It was run was run for 1 to 48 concurrent clients to understand the impact of concurrency. It was run without transactions and with pessimistic transactions to understand the impact of the transaction API. It was run with --batch_size=1 and --batch_size=4 to understand the impact of doing more work per transaction. The test database was cached by RocksDB.

Results


The first two graphs show the transaction throughput with (concurMT=1) and without (concurMT=0) the concurrent memtable enabled. Throughput is much larger with the concurrent memtable enabled. The benefit is larger with batch_size=4 than batch_size=1 because there is more work done with a larger batch_size and more mutex contention to avoid. Throughput is larger with batch_size=1 because there is 4X more work done per transaction with batch_size=4.



The next two graphs show the efficiency of the transaction API. It compares the ratio of the throughput with pessimistic transactions versus the throughput without transactions. When the value is 1.0 then the throughput with transactions matches the throughput without transactions. From the graphs below the efficiency is better with batch_size=1 than with batch_size=4 and the efficiency improves with concurrency.

Data for the graphs is here:

Wednesday, February 17, 2016

Less slow versus faster

I describe my work as making things less slow rather than making things faster. While making something less slow tends to make it faster I think there are two different tasks here and both are important:
  • By making things faster I mean a faster response time and more throughput for a single-threaded workload. For example, cache-friendly memory allocation might increase throughput from 100 QPS to 125 QPS and decrease response time from 10ms to 8ms.
  • By making things less slow I mean that the throughput gap - the difference between linear scaling and actual throughput - has been reduced. Assume the server can do 100 QPS at one thread and with linear scaling it can do 1000 QPS at 10 threads. But linear scaling is hard to achieve and the server might be limited to 400 QPS on a real workload. Eventually someone has time for performance debugging and finds things to make better with PMP and we get 500 QPS at 10 threads. This is making something less slow.
The USL provides a formal model to reason about making software less slow for concurrent workloads. I learned about Dr. Gunther's work on the USL thanks to Baron.

Friday, February 12, 2016

RocksDB, InnoDB and TokuDB vs sysbench

This compares RocksDB and InnoDB storage engines using sysbench and extends the results I shared yesterday. I continue to get great performance from MyRocks and this is a bonus as my motivation for using it is better storage efficiency.

tl;dr - MyRocks is usually equal to or faster than InnoDB on these tests

Configuration


The test server for InnoDB and RocksDB has 2 sockets, 24 cores (48 HW threads), 256G of RAM and a fast SSD. I also ran tests for TokuDB on an older server, but the HW is too much slower so I won't share the results here. From past experience, TokuDB does poorly on these tests.

For all tests I disabled the binlog and sync-on-commit. The my.cnf files were similar to my previous post but included extra settings inherited from production.

I used my sysbench branch in three configurations:
  • single-threaded & CPU-bound - engines used a buffer pool large enough to cache the database. Writes were fast thanks to the fast SSD. Tests were run for one database client, sysbench process & database table. The table had 10M rows.
  • multi-threaded & CPU-bound - engines used a buffer pool large enough to cache the database. Writes were fast thanks to the fast SSD. Tests were run for 8, 16, 24, 32, 40 and 48 concurrent clients. Tests used 8 sysbench processes and 8 tables (table per process). Each table had 10M rows.
  • multi-threaded & IO-bound - engines used a 1G buffer pool and buffered IO so reads were fast from the OS page cache. Writes were fast thanks to the fast SSD. Tests were run for 8, 16, 24, 32, 40 and 48 concurrent clients. Tests used 8 sysbench processes and 8 tables (table per process). Each table had 10M rows.
Four tests were run for each configuration:
  • point-query - fetch all columns of one row by primary key lookup (command line). Performance is measured as queries per second (QPS) and each query is an auto-commit SELECT statement.
  • read-only - the sysbench read only transaction (command line). Performance is measured as transactions per second (TPS).
  • read-write - the sysbench read write transaction (command line). Performance is measured as transactions per second (TPS).
  • increment - increment a non-indexed column of one row by primary key lookup (command line). Performance is measured as queries per second (QPS) and each query is an auto-commit UPDATE statement. The BEGIN and COMMIT statements are not used.
Finally, I tested the following binaries:
  • My56.Rocks - Facebook MySQL 5.6, RocksDB engine with zlib compression
  • My56.Inno - MySQL 5.6.29, InnoDB
  • My56.Inno.zlib - MySQL 5.6.29, InnoDB with zlib compression
  • My57.Inno - MySQL 5.7.10, InnoDB
  • My57.Inno.zlib - MySQL 5.7.10, InnoDB with zlib compression

Single-threaded & CPU-bound


The first graph is for QPS on the point-query and increment tests. The second graph is for TPS on the read-only and read-write tests. Data for the graphs is here.

Performance summary:
  • RocksDB is comparable with InnoDB for point queries, increment and read-only.
  • RocksDB is comparable with uncompressed InnoDB for read-write. Compressed InnoDB is slower for read-write and I didn't try to explain it.
  • Compressed InnoDB matches uncompressed InnoDB on the point-query test because compression code isn't hit during the test as the buffer pool is large enough to cache all pages as uncompressed.
  • Compressed InnoDB is slower on increment because some (de)compression operations are done in the foreground and increase response time.

Multi-threaded & CPU-bound


Tests were run for 8 to 48 concurrent clients using 8 sysbench processes and a different table per process. Data for the graphs is here.

Performance summary:

  • All engines were great on the point-query and read-only workloads but MySQL 5.6 is faster than 5.7. The difference for point-query is ~12% at 8 clients and ~4% at 48 clients. The difference for read-only is ~12% at 8 clients and ~10% at 48 clients.
  • InnoDB in MySQL 5.6 suffers on the increment workload. I did not debug it. InnoDB in MySQL 5.7 was ~15% faster than MyRocks in 5.6.
  • Compressed InnoDB in MySQL 5.6 and 5.7 suffers on the read-write workload. Given that MyRocks matched InnoDB in MySQL 5.6 I wonder if the 5.7 improvements are engine independent.


Multi-threaded & IO-bound


All engines used a 1G buffer pool for these tests with the database cached by the OS. Tests were run for 8 to 48 concurrent clients using 8 sysbench processes and a different table per process. All data for the graphs is here.

Performance summary:
  • For point-query and read-only uncompressed InnoDB did best, MyRocks was next best and compressed InnoDB was the worst
  • For read-write uncompressed InnoDB in MySQL 5.7 and MyRocks were best. Other engines were similar and not as good.
  • For increment MyRocks was the best, followed by InnoDB in MySQL 5.7 followed by InnoDB in MySQL 5.6.