Wednesday, April 27, 2016

Fun with scons while building MongoDB from source

This post might not have a large audience as not many people build MongoDB from source. Fortunately MongoDB has a thriving developers email list where my questions get answered quickly. Some of my builds must specify non-default paths for the compiler toolchain including the location of binaries like gcc, include paths and library paths. Last year MongoDB added options to their scons build to make that possible and I appreciate that they move fast to make things better.

Build tools (scons, cmake, autoconf/automake) are like snowflakes. Everyone is different and those differences are painful to me because I'd rather not invest time to become an expert. Today's fun problem was figuring out how to specify multiple directories for include and library paths. I assumed this would be like LD_LIBRARY_PATH and I could use a colon as the path separator. Alas I was wrong and the path separator is a space. The docs claim that a colon should work. I am still confused, but I have a working build for MongoRocks!
  • This is OK: scons CPPPATH="/path/to/inc1ude1 /path/to/include2" mongod
  • This is not OK: scons CPPPATH="/path/to/include1:/path/to/incude2" mongod

Monday, April 25, 2016

TRIM, iostat and Linux

I use iostat and vmstat to measure how much CPU and storage is used during my performance tests. Many of the database engines have their own counters to report disk IO but it is good to use the same measurement across engines. I use the "-k" option with iostat so it reports KB written per second per device.

The rate of writes to storage can be overstated by a factor of two in one case and I don't think this is widely known. When TRIM is done for an SSD then the Linux kernels that I use report that as bytes written. If I create an 8G file then I will see at least 8G of writes reported by iostat. If I then remove the file I will see an additional 8G of writes reported by iostat assuming TRIM is used. But that second batch of 8G of writes wasn't really writes.

One of the database engines that I evaluate, RocksDB, frequently creates and removes files. When TRIM is counted as bytes written then this overstates the amount of storage writes done by RocksDB. The other engines that I evaluate do not create and remove files as frequently -- InnoDB, WiredTiger, TokuDB, mmapv1.

The best way to figure out whether TRIM is done for your favorite SSD is to test it yourself.
  1. If TRIM is done then iostat reports TRIM as bytes written. 
  2. If iostat reports TRIM as bytes written and your database engine frequently removes files then iostat wKB/second might be overstated.

Testing this:


My test case is:

output=/path/to/many/GB/file/this/will/create

# run this long enough to get a file that is many GB in size

dd if=/dev/zero of=$output bs=1M oflag=direct &
dpid=$!

sleep 30
kill $dpid


iostat -kx 1 >& o.io &
ipid=$!
sleep 3
rm -f $output; sync
sleep 10
kill $ipid
# look at iostat data in o.io

Example iostat output from a 4.0.9 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65538.00     0.00 8387632.00   255.96     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00

Example iostat output from a 3.10.53 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 31078.00     0.00 3977984.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 283935.00     0.00 36343552.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 288343.00     0.00 36907908.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 208534.00     0.00 26692352.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...