Friday, May 6, 2016

smartctl and Samsung 850 EVO

I have 3 small servers at home for performance testing. Each is an Intel NUC with 8G of RAM and a core i3 via 5i3RYH.  These work quietly under my desk. I have been collecting results for MongoDB and MySQL to understand storage engine performance and efficiency. I use them for single-threaded workloads to learn when storage engines sacrifice too much performance at low concurrency to make things better at high concurrency.

Each NUC has one SATA disk and one SSD. Most tests use the SSD because the disk has the OS install and I don't want to lose the install when too much testing makes the disk unhappy. My current SSD is Samsung 850 EVO with 120G and one of these became sick.
[3062127.595842] attempt to access beyond end of device
[3062127.595847] sdb1: rw=129, want=230697888, limit=230686720
[3062127.595850] XFS (sdb1): discard failed for extent [0x7200223,8192], error 5

Other error messages were amusing.
[2273399.254789] Uhhuh. NMI received for unknown reason 3d on CPU 3.
[2273399.254818] Do you have a strange power saving mode enabled?
[2273399.254840] Dazed and confused, but trying to continue

What does smartctl say? I am interested in Wear_Leveling_Count. The raw value is 1656. If that means what I think it means then this device can go to 2000 thanks to 3D TLC NAND (aka 3D V-NAND). The VALUE is 022 and that counts down from 100 to 0 so this device is 80% done and Wear_Leveling_Count might reach 2000. I created a new XFS filesystem on the device, rebooted the server and restarted my test. I don't think I need to replace this SSD today.

sudo smartctl -a /dev/sdb1
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       4323
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       42
177 Wear_Leveling_Count     0x0013   022   022   000    Pre-fail  Always       -       1656
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   055   049   000    Old_age   Always       -       45
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       9
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       365781411804

Wednesday, April 27, 2016

Fun with scons while building MongoDB from source

This post might not have a large audience as not many people build MongoDB from source. Fortunately MongoDB has a thriving developers email list where my questions get answered quickly. Some of my builds must specify non-default paths for the compiler toolchain including the location of binaries like gcc, include paths and library paths. Last year MongoDB added options to their scons build to make that possible and I appreciate that they move fast to make things better.

Build tools (scons, cmake, autoconf/automake) are like snowflakes. Everyone is different and those differences are painful to me because I'd rather not invest time to become an expert. Today's fun problem was figuring out how to specify multiple directories for include and library paths. I assumed this would be like LD_LIBRARY_PATH and I could use a colon as the path separator. Alas I was wrong and the path separator is a space. The docs claim that a colon should work. I am still confused, but I have a working build for MongoRocks!
  • This is OK: scons CPPPATH="/path/to/inc1ude1 /path/to/include2" mongod
  • This is not OK: scons CPPPATH="/path/to/include1:/path/to/incude2" mongod

Monday, April 25, 2016

TRIM, iostat and Linux

I use iostat and vmstat to measure how much CPU and storage is used during my performance tests. Many of the database engines have their own counters to report disk IO but it is good to use the same measurement across engines. I use the "-k" option with iostat so it reports KB written per second per device.

The rate of writes to storage can be overstated by a factor of two in one case and I don't think this is widely known. When TRIM is done for an SSD then the Linux kernels that I use report that as bytes written. If I create an 8G file then I will see at least 8G of writes reported by iostat. If I then remove the file I will see an additional 8G of writes reported by iostat assuming TRIM is used. But that second batch of 8G of writes wasn't really writes.

One of the database engines that I evaluate, RocksDB, frequently creates and removes files. When TRIM is counted as bytes written then this overstates the amount of storage writes done by RocksDB. The other engines that I evaluate do not create and remove files as frequently -- InnoDB, WiredTiger, TokuDB, mmapv1.

The best way to figure out whether TRIM is done for your favorite SSD is to test it yourself.
  1. If TRIM is done then iostat reports TRIM as bytes written. 
  2. If iostat reports TRIM as bytes written and your database engine frequently removes files then iostat wKB/second might be overstated.

Testing this:


My test case is:

output=/path/to/many/GB/file/this/will/create

# run this long enough to get a file that is many GB in size

dd if=/dev/zero of=$output bs=1M oflag=direct &
dpid=$!

sleep 30
kill $dpid


iostat -kx 1 >& o.io &
ipid=$!
sleep 3
rm -f $output; sync
sleep 10
kill $ipid
# look at iostat data in o.io

Example iostat output from a 4.0.9 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65538.00     0.00 8387632.00   255.96     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md2               0.00     0.00    0.00 65528.00     0.00 8387584.00   256.00     0.00    0.00   0.00   0.00

Example iostat output from a 3.10.53 Linux kernel after the rm command:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 31078.00     0.00 3977984.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 283935.00     0.00 36343552.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 288343.00     0.00 36907908.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00 208534.00     0.00 26692352.00   256.00     0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Wednesday, March 2, 2016

Using jemalloc heap profiling with MySQL

I spent too much time figuring this out.

This works for me:
MALLOC_CONF="prof:true,prof_gdump:true,prof_prefix:/path/to/files/jez" \
libexec/mysqld ...

This does not work for me:
MALLOC_CONF="prof:true,prof_gdump:true" \
libexec/mysqld ...

MyRocks and sql-bench

MyRocks is now able to run sql-bench with support recently added for tables that are missing a PK. I found one bug and two performance problems in MyRocks when running sql-bench.

While writing this I found a post that claims sql-bench will be removed from the MySQL repo. It is useful and I hope it remains in some repo.

I run sql-bench for MyRocks with this command line:
./run-all-tests --server=mysql --create-options="engine=rocksdb default collate latin1_bin"

Monday, February 22, 2016

Concurrent transaction performance in RocksDB

Support for transactions was added to RocksDB last year. Here I explain performance for the pessimistic transaction API with a concurrent workload. I used the db_bench client rather than MySQL but MyRocks reuses the RocksDB transaction API and gets the performance benefits. For pessimistic transactions:
  • there is not much overhead from the transaction API
  • throughput improves as the work done per transaction increases
  • throughput improves with the concurrent memtable

Configuration


The test server has 2 sockets, 24 CPU cores and 48 HW threads. The database used fast storage (tmpfs). The benchmark is db_bench --benchmarks=randomtransaction via this test script. It was run was run for 1 to 48 concurrent clients to understand the impact of concurrency. It was run without transactions and with pessimistic transactions to understand the impact of the transaction API. It was run with --batch_size=1 and --batch_size=4 to understand the impact of doing more work per transaction. The test database was cached by RocksDB.

Results


The first two graphs show the transaction throughput with (concurMT=1) and without (concurMT=0) the concurrent memtable enabled. Throughput is much larger with the concurrent memtable enabled. The benefit is larger with batch_size=4 than batch_size=1 because there is more work done with a larger batch_size and more mutex contention to avoid. Throughput is larger with batch_size=1 because there is 4X more work done per transaction with batch_size=4.



The next two graphs show the efficiency of the transaction API. It compares the ratio of the throughput with pessimistic transactions versus the throughput without transactions. When the value is 1.0 then the throughput with transactions matches the throughput without transactions. From the graphs below the efficiency is better with batch_size=1 than with batch_size=4 and the efficiency improves with concurrency.

Data for the graphs is here:

Wednesday, February 17, 2016

Less slow versus faster

I describe my work as making things less slow rather than making things faster. While making something less slow tends to make it faster I think there are two different tasks here and both are important:
  • By making things faster I mean a faster response time and more throughput for a single-threaded workload. For example, cache-friendly memory allocation might increase throughput from 100 QPS to 125 QPS and decrease response time from 10ms to 8ms.
  • By making things less slow I mean that the throughput gap - the difference between linear scaling and actual throughput - has been reduced. Assume the server can do 100 QPS at one thread and with linear scaling it can do 1000 QPS at 10 threads. But linear scaling is hard to achieve and the server might be limited to 400 QPS on a real workload. Eventually someone has time for performance debugging and finds things to make better with PMP and we get 500 QPS at 10 threads. This is making something less slow.
The USL provides a formal model to reason about making software less slow for concurrent workloads. I learned about Dr. Gunther's work on the USL thanks to Baron.