Saturday, August 9, 2014

Bug marketing - fixing MongoDB replication bugs in TokuMX

Asynchronous replication is lossy by nature. It is also useful in production especially for replication across a WAN where the latency for one or two network roundtrips between replicas is too much. Latency is reduced when replicas are nearby but that greatly increases cost from using more replicas. In MySQL land we have work-in-progress that uses enhanced semi-sync replication and binlog archivers to make the binlog 2-safe within a datacenter and greatly reduce the chance that a committed transaction is lost (some people call this lossless). Even better, it can be done without running another local replica so it has a minor impact on the cost of the tier and the network latency overhead from it is very low.

In a perfect world we could always afford the network latency and/or extra hardware cost from synchronous replication. But production would be much less fun in a perfect world. Even for systems that have implemented sync replication I am skeptical that the system behaves as advertised until it has aged in production for a few years

Replication can also be lossy in practice because of bugs. MySQL replication slaves were not crash-safe for too long because the replication state was maintained in a file that wasn't kept consistent with InnoDB on crash recovery. This was first fixed in the Google patch for MySQL via rpl_transaction_enabled. It was fixed many years later in upstream MySQL.

Reputations from bugs and missing features can linger long after the bugs have been fixed and the features have been implemented. MySQL suffered from this because of MyISAM but some of the damage was self inflicted. Compare the claims about integrity for atomic operations & MyISAM in an older version of the manual with the current version.
In transactional terms, MyISAM tables effectively always operate in autocommit = 1 mode. Atomic operations often offer comparable integrity with higher performance.
MongoDB had a MyISAM moment with the initial version of its storage engine that was not crash safe. Fortunately they rallied and added journalling. I think they are having another MyISAM moment with replication. The TokuMX team has an excellent series of posts and a great white paper describing performance and correctness problems that exist in MongoDB and have been fixed in TokuMX. The problems include slow elections and unexpected data loss on master failover. 

The post by Aphyr is also worth reading. While there read the posts on other systems. A lot of innovative correctness testing & debugging was done to produce those results.

Will these problems be fixed before the reputation sticks? This is a good reason to consider TokuMX.

Thursday, July 10, 2014

How to answer questions

Are you somehow responsible for operating a DBMS in production? Congratulations! You are going to get a lot of questions so learn how to be efficient while answering them. Questions can be cheap to ask because speculation is free. A non-speculative answer is expensive when it requires research and experiments. Good and bad questions will arrive faster than good answers can be provided. How do you remain productive assuming you don't ignore the questions? Make the answer as inexpensive as the question. Problem solved? Here is a short guide.

Preferred version

I prefer these questions over the ones that follow. I don't know whether the non-preferred variants are asked more frequently or whether I am more likely to remember them. My blogs, smalldatum and mysqlha, have a lot more detail.
  • Q: Why are you using Y?
  • Q: What would it take to get you to migrate?
  • A: One of the most important reasons that doesn't get enough attention is the value of operations expertise. That is hard to find and it takes time to get expertise for new products. That expertise can make a good product great. Migration requires something that is significantly better to be worth the cost, time & risk of migration. [insert many more reasons]. 

Less preferred variant #1

This is a variant of the first. It deserves a short answer. X is usually the system that the person sells. It is better to understand why Y is used before suggesting a replacement. But this might be a good way to get a potential customer talking. 
  • Q: Have you considered using X?
  • A: yes | no | maybe

Less preferred variant #2

Here the question morphs into an assertion. I tend to get this at conferences. There are a lot of new DBMS products in development. Some of them are very interesting. They are more likely to win new customers than migrate existing users of "legacy" products. Alas, no team has the resources to do proper benchmarks or even spend a few days thinking about each of them.
  • Q: Why aren't you using X?
  • Q: You should be using X!
  • A: Why do you think I should be using X?

Monday, July 7, 2014

Benchmarking the leveldb family

In the beginning there was LevelDB. It is wonderful to see so much functionality provided by a small amount of clean code. The benchmark results show very high performance for a small, cached database without concurrency. AFAIK, this is a great solution for at least one widely-used and important workload. But it isn't a server DBMS and I don't think the project owners want it to become one as that will make the code more complex and even add a few bugs.

Editorial #1

While LevelDB isn't a server DBMS, it is a good basis for one. So we have forks or projects based on it like RocksDB and HyperLevelDB and we are starting to get benchmark results that compare the forks with each other and with non-LevelDB DBMS products. Where there is benchmarking there might be benchmarketing so it helps to understand the client, db_bench, that is frequently used to generate benchmark results.

For starters, my faith in a particular benchmark result is inversely related to the number of products evaluated in the test. It is hard to find someone with expertise in multiple systems to vouch that the result for a specific product wasn't lousy because of misconfiguration or misuse. Results should be explained and that takes more time as more systems are included in a result. While recent changes have made RocksDB better, it is much harder to configure than LevelDB so I won't be surprised when lousy results are caused by misconfiguration. Fortunately, it is easy to share db_bench command lines.

Editorial #2

The RocksDB team has done a lot of great work to improve in-memory performance. When I was helping (a long time ago), we spent a lot of time on IO-bound & write-heavy workloads. Prior to the addition of universal compaction by Dhruba we added monitoring and learned that if you have a write-heavy workload then you are not going to have a good time with LevelDB. We aren't unique in learning that.

These problems described below become visible when you run a random-write workload on a large database and get the LSM into a steady state. It is possible to not reproduce the problem by avoiding some of these steps.

  1. LevelDB suffers from high write-amplification. Add 1 for the redo log, 1 for the memtable flush to the L0, 1 for L0 -> L1 compaction assuming sizeof(L0) == sizeof(L1) and then ~10 for each level beyond the L1 assuming there is no skew in keys to be written (uniform, not zipfian distribution). The above means you might end up with a write-amplification of 50 for a large database. I don't believe a clever choice of compaction inputs leads to a significant reduction in write-amplification in the steady state unless you ignore space-amplification. And if we ignore space-amplification then disabling compaction is the best strategy as data is written once to the L0 and optionally once to the redo log and then we are done. RocksDB has monitoring to report the value and sources of write-amplification.
  2. LevelDB has but one thread to do compaction in the background. This thread does read, decompress, merge keys, write, compress in a loop. It is unlikely to keep up with a few or even one thread on a write-intensive workload. When write-amplification is 50 it must repeat that loop 50 times for each byte put into the memtable. It can also stall on disk reads and writes. RocksDB supports many background compaction threads which is a big improvement from LevelDB and universal compaction to reduce write-amplification.
  3. Storage can also fall behind when write-amplification is large. If storage can sustain 200MB/second of writes and write-amplification is 50 then the peak ingest rate is 4 MB/second (peak-write-rate / write-amplification) and that assumes storage reads are free or not done because the database is cached. A b-tree that does random writes might be able to beat that. 
  4. The result of the above is that L0 -> L1 compaction will fall behind and likely trigger many stalls on Put operations. Note that LevelDB doesn't have features to stall Put operations when levels beyond L0 (L1 -> Lmax) get too much data. All of this leads to more Put stalls and an LSM tree with too much data on the wrong levels. RocksDB has an option to stall writes when levels L1 -> Lmax-1 have too much data. But the best solution is to match compaction algorithm to the workload -- universal compaction has less write-amplification.

Notes on db_bench

With all of that said, here is a short and incomplete guide to understanding the LevelDB benchmark client, db_bench. I am a big fan of db_bench. We use it a lot to understand and improve RocksDB. Some of the problems described below have been fixed in RocksDB but others remain.

  1. The db_bench client doesn't report performance metrics per time interval. This has been fixed in RocksDB. LevelDB's db_bench reports average throughput at the end of a test run. It can report percentiles and histograms for response times. For some tests I want to confirm that response time and throughput were stable and did not degrade continuously or intermittently over the duration of the test. That is hard to judge unless throughput and response time are also reported per time interval as the test runs. This is enabled in RocksDB via the --statistics, --stats_per_interval and --stats_interval options.
  2. Client threads reuse the same RNG seed. See the ThreadState constructor in LevelDB's db_bench. There is a --seed option for RocksDB's db_bench to prevent this. The problem is that a given thread generates the same sequence of keys each time the test is run. While that might help with reproducibility it can also ruin results for IO-bound tests in a few cases.  By IO-bound I mean the database is much larger than RAM. When read performance is tested for an IO-bound database if the db_bench client is run for a short period of time and then restarted then during the second run all of the read data might still be in the OS filesystem cache. I use iostat during benchmarks to debug this problem and others. When write performance is tested for an IO-bound database and db_bench is run for a short time and then restarted, the same keys are written for each run and more likely to fit in the upper levels of the LSM leading to less write-amplification and better performance than might occur in production. I added compaction statistics to RocksDB to report how much IO is done (bytes read & written, read latencies) per level with leveled compaction and this data can debug the problem.
  3. The db_bench RNG has a range from 0 to 2^31-1. If your db_bench test database has more than 2B distinct key values then you are not going to have a good time. This has been fixed in RocksDB's db_bench.
  4. db_bench doesn't have an option to enable checksum verification on reads from storage. This excludes a CPU overhead that is likely to be enabled on production deployments. See the constructor for ReadOptions which sets verify_checksum to false. Given the use of the default constructor elsewhere in LevelDB I wonder if there are places where it never verifies checksums on read. That would be a bug and I think Table::Open is one example of the bug when ReadBlock is called for the index block. Possible bugs from use of the default constructor have been fixed in RocksDB. RocksDB's db_bench has the --verify_checksum option as well. I didn't ask the LevelDB team about this until I wrote this post.
  5. db_bench doesn't have an option to set the block size for database files, so the default of 4KB is used. RocksDB's db_bench has an option to set the size via the --block_size option.
  6. The statistics timer for Put operations is stopped before DB::Write is done. The timer is stopped in LevelDB's db_bench after the operation has been added to the write batch. It should be stopped after the write batch has been added to the memtable and optionally written to the redo log. See the call to FinishedOps in DoWrite. This means that response times don't measure the time to write the database. This was recently fixed in RocksDB. I didn't ask the LevelDB team about this until I wrote this post.
  7. Was a good malloc library used? I have measured significant differences in the overhead of db_bench source code between not using and using jemalloc. 
  8. There is no client-side processing of fetched data after a read. The read loop is very simple -- generate a key, call read, don't look at returned value, repeat. To make sure that the fetched value was really read by the client I think db_bench should compute the sum of the bytes in the returned value if only to generate a bit more load on the memory system. Otherwise a DBMS using mmap reads might avoid copying or reading the values. This has yet to be fixed in RocksDB. See the ReadRandom method in db_bench.
  9. LevelDB and RocksDB use buffered IO. A too large value for readahead (read_ahead_kb) can ruin your results on an IO-bound test. RocksDB's db_bench has the --advise_random_on_open option to use POSIX_FADV_RANDOM for non-compaction reads. It also has the --compaction_fadvice option to use POSIX_FADV_SEQUENTIAL on compaction reads & writes.
  10. When compression is enabled all files are compressed. I am not sure memtable flushes should be compressed because that can use a lot of CPU, especially with zlib, without providing a significant reduction in database size. Beyond the CPU overhead, compressing a memtable flush increases the chance that a Put operation will stall waiting for a memtable flush to finish (flush of the old memtable is done by a background thread when the current memtable is being written). There is an option in RocksDB to limit compression to the lower/larger levels of the LSM with leveled compaction. There is a similar option that works with universal compaction (called size-tiered by Cassandra). For a good overview of leveled and size-tiered compaction see the DataStax posts. The db_bench options are --universal_compression_size_percent and --min_level_to_compress. 
  11. Sequential insert tests can be very fast with LevelDB and with RocksDB when using leveled compaction because there is an optimization that avoids the need for compaction. Thus write-amplification is one (or two if the redo log is enabled). The optimization lets memtable flush output get pushed to a lower level of the LSM rather than first being written to the L0 and then slowly compacted from Ln to Ln+1. Querying a database after a sequential load can also benefit from this optimization. RocksDB (and I assume LevelDB) avoids checking a file for data when the desired key is less than the min key or greater than the max key stored in the file. This pruning is done before the bloom filter check. Thus is is likely that at most one file will be checked on a point query after a sequential load. Without the sequential load optimization one file per level would be checked.
  12. A random write test that puts 1M keys into an empty database is unlikely to result in a database with 1M distinct keys. If you do a random query test after this then some of the reads will be very fast assuming the bloom filter works because the desired data doesn't exist. 
  13. A short random write test done to a database after a sequential load might not get the database into a steady state. By steady state I mean that compaction should be in progress as it would be for a sustained write-heavy workload. For write-heavy tests I usually do a sequential load of X million rows, then do X million random writes, then consider the database to be ready for real tests.
  14. There are different types of writes. A write-heavy workload with db_bench usually means blind writes but we added the --benchmarks=updaterandom option to db_bench to match the read-modify-write pattern that is common with a SQL DBMS, except db_bench ignores the overhead from making the read-modify-write sequence atomic.
  15. Read QPS is also inflated in db_bench when the memtable is empty as no time must be spent checking an empty memtable.  So the pattern -- run db_bench to populate the database, re-run db_bench using a read-only test -- will get QPS much higher than is likely to occur in the real world. And if the load was done using leveled compaction (style used by LevelDB) then the read QPS will be inflated again (see point 11 above).
  16. When RocksDB universal compaction is used (called size-tiered by Cassandra) and a sequential insert is done, then read QPS after the load can also be inflated because of key range pruning. The files after the load will be range partitioned by the index key, so at most one file must be read, and possibly at most one bloom filter check must be done, per query.

Monday, June 23, 2014

Benchmark(et)ing

Benchmarking and benchmarketing both have a purpose. Both also have a bad reputation. A frequently expressed opinion is that benchmark results are useless. I usually disagree. I don't mind benchmarketing and think it is a required part of product development but I am not fond of benchmarketing disguised as benchmarking.

Benchmarketing is a common activity for many DBMS products whether they are closed or open source. Most products need new users to maintain viability and marketing is part of the process. The goal for benchmarketing is to show that A is better than B. Either by accident or on purpose good benchmarketing results focus on the message A is better than B rather than A is better than B in this context. Note that the context can be critical and includes the hardware, workload, whether both systems were properly configured and some attempt to explain why one system was faster.

I spend a lot of time running benchmarks. They are useful when the context for the result is explained and the reader has sufficient expertise. Many benchmark results don't explain the context and not everyone has the time and ability to understand the results. Thus many benchmark results are misunderstood and perhaps benchmarks deserve a bad reputation. Another problem is that benchmarks usually focus on peak performance while peak efficiency is more important in the real world. Our services have a finite demand and we want to improve quality of service and reduce cost while meeting that demand. But that is hard to measure in a benchmark result.

One obvious result that a benchmark can provide is the peak rate of performance that won't be exceeded in the real world. This is useful when doing high level capacity planning or when debugging performance problems in production. Narrow benchmark tests, like read-only with point-lookups, help to provide simple performance models to which more complex workloads can be mapped.

Another result is that performance comparisons are less likely to be useful as the number of systems compared increases. It takes a lot of time and expertise to explain benchmark results and to confirm that best practices were used for all of the systems compared. In my experience results that compare more than two systems tend to be benchmarketing rather than benchmarking.

For a few years my tests were limited to MySQL with InnoDB, but recently I have run tests to compare different products including MySQL/InnoDB, WiredTiger, RocksDB, LevelDB, TokuDB, TokuMX and MongoDB. I am comfortable publishing results for new releases of MySQL & InnoDB and they can be compared to results I previously published.  I am less willing to publish results that compare products from multiple vendors especially when small vendors are involved. I don't want to be a jerk to a small vendor and these take more time to evaluate.

I have some advice for people who run benchmarks even though I don't always follow of of it. Some of the advice takes a lot of time to follow.
  • Explain the results. This is the most important suggestion and I try to ignore results that are not explained. Why was one system faster than another (better algorithm, less code bloat, simple perf bug, etc)? Simple monitoring tools like vmstat and iostat are a start but you will eventually need to look at code. Run PMP to understand where threads are busy or waiting. Run Linux perf to see what consumes CPU time.
  • Explain whether you made any attempt to properly configure the systems tested. The quality of a benchmark result is inversely related to the number of systems tested because it is less likely that the people doing the test have expertise in all of the systems. Publish the configuration files.
  • Explain whether you have expertise in using the benchmark client.
  • Explain the context for the test. The result can be described as A is faster than B in this context so you need to explain that context. 
    • What was the workload? 
    • What hardware was used (CPU description, #sockets, cores per socket, clock rate, amount of RAM, type of storage)? What rate can the storage sustain independent of the DBMS for both IOPs and MB/second?
    • What product versions were used? Comparing your beta versus their GA release that runs in production might be bogus. It can also be bogus to compare production systems with research systems that have no overhead from monitoring, error logging, optimization, parsing and other features required in the real world.
    • How was the test deployed? Did clients share the same server as the DBMS? If not what was the network distance & throughput between them.
    • Were the file structures and storage devices aged? A b-tree fragments from random updates over time. Write-optimized databases and flash get garbage to be collected. Doing a sequential load into a b-tree or LSM and then immediately running tests means the file structure isn't in a steady state.
    • How full was the storage device? Flash and spinning disk perform very differently when full than when empty. 
    • Was buffered or direct IO used? If buffered IO was used are you sure...
      • a good value was used for filesystem readahead?
      • a good posix_fadvise calls were done to enable or disable readahead?

Friday, May 9, 2014

Write amplification from log writes

MongoDB, TokuMX and MySQL use log files with high-value data. For MongoDB this is the journal that uses direct IO. For MySQL this is the binlog, relay log and InnoDB redo log,  all use buffered IO by default, the InnoDB redo log uses 512 bytes as the "page size"  and the replication logs have no notion of page size.

The minimum size for a write to a storage device is either the sector size or filesystem page size. The sector size is either 512 or 4096 bytes today. In the future it will be 4096 or larger. The filesystem page size on my servers is 4096 bytes. When a DBMS tries to append 309 bytes to the end of a log file then more than 309 bytes are written to the storage device. Depending on the filesystem and choice of buffered or direct IO either a disk sector or a filesystem page will be written. This can explain why the bytes written rate as reported by iostat using the wsec/s column is higher than the rate reported by the DBMS.

MongoDB avoids the uncertainty by padding journal writes to a multiple of 8 KB. Perhaps the padding will be reduced to a multiple of 4 KB (see JIRA 13344). But the good thing is that the counters reported by MongoDB are correct. The OS and storage device will report the same value for bytes written to the journal file. Of course, this ignores any write-amplification from flash garbage collection.

MySQL and TokuMX do not avoid the uncertainty. I spent a few hours today looking at a busy host to explain the difference between the write rates reported by MySQL, iostat and flash storage physical bytes written counters. The sources of writes on the host include the following. Most of the bytes written were from the doublewrite buffer followed by dirty page write back.
  • InnoDB dirty page writeback - The host uses mostly 2X compression with an in-memory page size of 16 KB. So most disk writes are 8 KB but some are 16 KB.
  • InnoDB doublewrite buffer - Even though most pages are 8 KB, any page in the doublewrite buffer uses 16 KB. Domas has described this as triple-writing pages in his bug report.
  • InnoDB redo log - logically these are done as a multiple of 512 bytes with buffered IO. In the WebScaleSQL patch we have an option to round up the write to 4 KB to avoid reads from the filesystem when the to-be-written page is not in the OS filesystem cache. Alas the InnoDB counter for bytes written to the log did not include the bytes written from the round-to-4KB.
  • MySQL binlog - the binlog makes no attempt to round up writes to the end of the log. The counter for bytes written to the binlog ignore the round up done by the filesystem when fsync is called.
We are working on fixing the InnoDB redo log bytes-written counter to include the padding added when rounding writes up to 4 KB. Until then you can use existing counters to estimate how much rounding was needed. These counters show the number of bytes written to the redo log and the number of fsyncs (or fdatasyncs) done. The size of the average write per fsync is less than 2 KB. With a 4 KB filesystem page the real rate is more than 2X the value reported by the Innodb_os_log_written counter.
    Innodb_os_log_fsyncs 370923966
    Innodb_os_log_written 656878594048

Counters for the binlog writes and fsyncs can also be used to understand whether the write rate to it is wrong. The size of the average write per fsync is again less than 2 KB and with a 4 KB filesystem page the real rate is more than 2X the value reported by the Binlog_bytes_written counter.
    Binlog_bytes_written 699055440580
    Binlog_fsync_count 385104140

The final set of counters were used to estimate the average size of a page writeback. The ratio of the values below is about 10 KB. Most pages written were 8 KB but some were 16 KB.
    Innodb_data_async_write_requests 286202895
    Innodb_data_async_write_bytes 2764395651584

I am not sure whether all of the counters above are available in upstream MySQL. They are in WebScaleSQL. We have been adding extra monitoring for many years to support running MySQL at web scale.

Monday, May 5, 2014

Overhead of reading the Nth attribute in a document for MongoDB & TokuMX

I used the MongoDB sysbench client to measure the overhead from skipping N attributes in a document to access the N+1th attribute for N set to 1, 10, 100 and 1000. As expected there is overhead as N grows, for small N the overhead is less for TokuMX and for large N the overhead is less for MongoDB.

My test used 1 collection with 1M documents and 1 client thread. The sysbench client was changed to add extra attributes immediately after the _id attribute. The extra attributes used names of the form x${K} for K in 1 to 1, 1 to 10, 1 to 100 and 1 to 1000. Each of these attributes were assigned the value of K (x1=1, x2=2, ...). Immediately after these another attribute was added with the name y and assigned the same value as the _id attribute. Each test query matches one document and returns the values of the _id and y attributes after matching the _id and y attributes. The purpose of this test is to determine the overhead from skipping the extra attributes to find the y attribute. For N=1 the results and test was kind-of similar to what I did for a recent cached & read-only test.

This is data for the chart above. Note that TokuMX does much better for small N but MongoDB does much better for large N. I repeated the tests with the y attribute before the x${K} attributes and the results are about the same as below so my assumption was wrong about the overhead for attribute searches and the real overhead is from BSON parsing. Note that I ran extra tests for 200, 400, 500 and 750 attributes to understand the rate at which QPS decreases.

queries per second by #attributes
    1     10    100   200   400   500   750  1000   number of x${K} attributes
14326  13643  10317  8638  6543  5832  4515  3306   tokumx141
 9866   9398   8473  8305  7242  6903  6164  4954   mongo249
 9119   9530   8547  8613  7890  7577  6602  5289   mongo260

Overhead for 1000 attributes

I looked at top when the tests were running and for all cases mongod was using ~90% of 1 CPU core and Java was using ~10 of another. So MongoDB was using CPU at the same rate as TokuMX while serving many more QPS. I then used the Linux perf utility to understand what was using CPU in mongod. The CPU overhead for TokuMX is from calls to free & malloc. If I had an easy way to get hierarchical profiles I would include that, alas I don't have that. I do have the -g option with perf but the output isn't very interesting. It would be nice if TokuMX figured out how to get more debug symbols into their binary without expecting users to download/install a separate debug package.
This is the output from "perf -g" for TokuMX 1.4.1


    14.56%   mongod  mongod                 [.] free                             
             |
             --- free
                |          
                |--91.43%-- 0x7fe27c1c09d8
                |          (nil)
                |          
                |--7.66%-- (nil)
                 --0.91%-- [...]

    12.71%   mongod  mongod                 [.] malloc                                                                               
             |
             --- malloc
                |          
                |--76.17%-- mongo::ElementMatcher::~ElementMatcher()
                |          (nil)
                |          
                |--16.90%-- operator new(unsigned long)
                |          0x100000000
                |          0x3031317473657462
                |          
                |--4.22%-- (nil)
                |          |          
                |           --100.00%-- 0xffb
                |          
                |--1.37%-- 0xffffffff00000000
                |          0x3031317473657462
                |          
                 --1.34%-- mongo::BSONObj::init


And this is the non-hierarchical output from Linux perf during a 5 second interval of the test.

TokuMX 141
    15.83%   mongod  mongod                 [.] free
    11.64%   mongod  mongod                 [.] malloc
    11.23%   mongod  mongod                 [.] mongo::Projection::transform
     6.42%   mongod  libstdc++.so.6.0.17    [.] _ZNSs4_Rep10_M_disposeERKSaIcE.part.13
     6.40%   mongod  mongod                 [.] std::_Rb_tree<>::find
     4.81%   mongod  libc-2.5.so            [.] strlen
     4.01%   mongod  libstdc++.so.6.0.17    [.] std::basic_string::basic_string
     3.49%   mongod  libstdc++.so.6.0.17    [.] char* std::string::_S_construct
     2.43%   mongod  libc-2.5.so            [.] memcmp
     2.38%   mongod  libc-2.5.so            [.] memcpy
     2.00%   mongod  libstdc++.so.6.0.17    [.] operator new(unsigned long)
     1.93%   mongod  mongod                 [.] mongo::Projection::append
     1.27%   mongod  libstdc++.so.6.0.17    [.] std::string::_Rep::_S_create

MongoDB 2.4.9
    13.78%   mongod  mongod               [.] mongo::Projection::transform
    11.75%   mongod  mongod               [.] mongo::UnorderedFastKeyTable::Area::find
     7.95%   mongod  libc-2.13.so         [.] __strlen_sse42
     7.28%   mongod  mongod               [.] mongo::Projection::append
     4.21%   mongod  mongod               [.] (anonymous namespace)::cpp_alloc
     3.86%   mongod  mongod               [.] mongo::BSONElement::size()
     2.07%   mongod  mongod               [.] _ZN12_GLOBAL__N_121do_free_with_callback
     1.18%   mongod  mongod               [.] mongo::KeyV1::toBson()
     0.92%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     0.86%   mongod  mongod               [.] boost::detail::shared_count::~shared_count()
     0.71%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock
     0.70%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::~basic_string()
     0.69%   mongod  mongod               [.] tc_malloc
     0.68%   mongod  libc-2.13.so         [.] __memcpy_ssse3

MongoDB 2.6.0
    14.28%   mongod  mongod               [.] mongo::ProjectionStage::transform
    10.03%   mongod  mongod               [.] MurmurHash3_x64_128
     4.45%   mongod  mongod               [.] mongo::BSONElement::size()
     3.93%   mongod  libc-2.13.so         [.] __strlen_sse42
     2.83%   mongod  mongod               [.] _ZNKSt3tr110_HashtableIN5mongo10StringDataES2_SaIS2_ESt9_IdentityIS2_ESt8equal_toIS2_ENS2_6HasherENS_8__detail18_Mod_range_hashingENS9_20_Default_ranged_hashENS9_20_Prime_rehash_policyELb0ELb1ELb1EE12_M_find_nodeEPNS9_10_Hash_nodeIS2_Lb0EEERKS2_m.isra.162.constprop.255
     2.65%   mongod  mongod               [.] operator new(unsigned long)
     1.98%   mongod  libc-2.13.so         [.] __memcmp_sse4_1
     1.74%   mongod  mongod               [.] mongo::StringData::Hasher::operator()
     1.62%   mongod  mongod               [.] operator delete(void*)
     1.46%   mongod  mongod               [.] mongo::KeyV1::toBson()
     1.20%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.06%   mongod  libc-2.13.so         [.] __memcpy_ssse3
     0.83%   mongod  mongod               [.] tc_malloc
     0.82%   mongod  mongod               [.] mongo::ps::Rolling::access
     0.78%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock

Overhead for 10 attributes

This lists the CPU profile from Linux perf during 5 seconds of the test. Unlike the 1000 attribute result above, here malloc/free or their equivalent are the top two sources for TokuMX and MongoDB.


TokuMX 1.4.1
     4.53%   mongod  mongod                 [.] malloc
     4.12%   mongod  mongod                 [.] free
     2.89%   mongod  libtokufractaltree.so  [.] _ZNK4toku3omtIP13klpair_structS2_Lb0EE24find_internal_plus_arrayIR9ft_searchXa
dL_Z15wrappy_fun_findIS6_XadL_ZL23heaviside_from_search_tRK10__toku_db
     1.80%   mongod  libc-2.5.so            [.] strlen
     1.53%   mongod  mongod                 [.] mongo::storage::KeyV1::woCompare
     1.14%   mongod  libpthread-2.5.so      [.] pthread_mutex_lock
     1.02%   mongod  libc-2.5.so            [.] memcpy
     0.99%   mongod  mongod                 [.] mongo::Projection::transform
     0.96%   mongod  libstdc++.so.6.0.17    [.] _ZNSs4_Rep10_M_dispose
     0.87%   mongod  libstdc++.so.6.0.17    [.] operator new(unsigned long)
     0.86%   mongod  mongod                 [.] mongo::Projection::init
     0.85%   mongod  libtokufractaltree.so  [.] _Z26toku_ft_search_which_childP17__toku_descriptorPFiP9__toku_dbPK10__toku_dbt
S5_EP6ftnodeP9ft_search
     0.85%   mongod  libc-2.5.so            [.] memcmp

MongoDB 2.4.9
     7.47%   mongod  mongod               [.] cpp_alloc
     3.94%   mongod  mongod               [.] _ZN12_GLOBAL__N_121do_free_with_callback    
     2.24%   mongod  mongod               [.] mongo::KeyV1::toBson()
     2.21%   mongod  libc-2.13.so         [.] __strlen_sse42
     1.58%   mongod  mongod               [.] boost::detail::shared_count::~shared_count()
     1.51%   mongod  mongod               [.] mongo::BSONElement::size()
     1.48%   mongod  libc-2.13.so         [.] __memcpy_ssse3
     1.40%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.39%   mongod  libstdc++.so.6.0.16  [.] _ZNSs4_Rep10_M_disposeERKSaIcE
     1.22%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::~basic_string()
     1.22%   mongod  libc-2.13.so         [.] __memcmp_sse4_1
     1.19%   mongod  mongod               [.] tc_malloc
     0.99%   mongod  mongod               [.] operator new(unsigned long)
     0.95%   mongod  mongod               [.] boost::intrusive_ptr::~intrusive_ptr()
     0.94%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock
     0.89%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::basic_string
     0.77%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::basic_string
     0.69%   mongod  mongod               [.] mongo::BtreeBucket::customBSONCmp
     0.68%   mongod  mongod               [.] mongo::UnorderedFastKeyTable::Area::find
     0.64%   mongod  [ip_tables]          [k] ipt_do_table
     0.64%   mongod  mongod               [.] CoveredIndexMatcher::matches

MongoDB 2.6.0
     4.26%   mongod  mongod               [.] operator new(unsigned long)
     2.86%   mongod  mongod               [.] operator delete(void*)
     2.70%   mongod  mongod               [.] mongo::KeyV1::toBson() const
     1.62%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.57%   mongod  libc-2.13.so         [.] __memcpy_ssse3
     1.56%   mongod  libc-2.13.so         [.] __strlen_sse42
     1.14%   mongod  libc-2.13.so         [.] vfprintf
     1.13%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock
     1.13%   mongod  libc-2.13.so         [.] __memcmp_sse4_1
     1.06%   mongod  mongod               [.] mongo::BSONElement::size()
     0.92%   mongod  libstdc++.so.6.0.16  [.] _ZNSs4_Rep10_M_dispose
     0.87%   mongod  mongod               [.] tc_malloc
     0.84%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::~basic_string()  

Friday, May 2, 2014

The impact of read ahead and read size on MongoDB, TokuMX and MySQL

This continues my work on using a very simple workload (read-only, fetch 1 document by PK, database larger than RAM, fast storage) to understand how to get more QPS from TokuMX and MongoDB. I need to get more random read IOPs from storage to get more QPS from the DBMS. Beyond getting more throughput I ran tests to understand the impact of filesystem readahead on MongoDB and TokuMX and the impact of the read page size on TokuMX and InnoDB. My results are based on fast flash storage that can do more than 100,000 4kb reads/second. Be careful when trying to use these results for slower storage devices, especially disks.

In my previous post I wrote that I was able to get much more QPS from InnoDB but was likely to report improvements for TokuMX/MongoDB in the next post. This is the next post and some of the results I report here are better for a few reasons. First, some tests limited queries to the first half of each collection/table so the cache hit rate is much better especially for non-leaf index nodes. TokuMX and MongoDB are more sensitive to that and might require more cache/RAM for indexes than InnoDB. Second, I use better values for read_ahead_kb and readPageSize. My summary for this post is that:
  • Using a smaller value for readPageSize with TokuMX can help
  • Using the right value for read_ahead_kb can help. But I think that depending on filesystem readahead is a kludge that ends up wasting IO or hurting QPS
  • TokuMX and MongoDB require more cache/RAM for indexes than InnoDB even when TokuMX uses compression and InnoDB does not.

Goals

I describe the impact of read_ahead_kb for TokuMX and MongoDB, the impact of readPageSize for TokuMX and the impact of innodb_page_size for MySQL. My test server uses fast flash storage and what I report here might not be relevant for disk storage. From past experience there isn't much difference in peak IOPs from disks using 8kb versus 32kb reads as the device is bound by seek and rotational latency rather than data transfer. But flash devices usually do many more random reads per second at 8kb compared to 32kb.

I have a lot of experience with systems that use direct IO, less for systems that do reads & writes via buffered IO and much less for buffered IO + mmap. The epic rant about O_DIRECT is amusing because there wasn't a good alternative at the time for any DBMS that cared about IO performance. I suspect the DBMS community hasn't done enough to make things better (reach out to Linux developers, pay for improvements, volunteer time to run performance tests, donate expensive storage HW to people who do the systems work). But I think things are getting better and better.

mmap reads

MongoDB uses buffered IO and mmap. To get peak QPS for an IO-bound workload it must read the right amount of data per request -- not too much and not too little. With pread you tell the OS the right amount of data to read by specifying the read size. With mmap it is harder to communicate this. When a page fault occurs the OS knows that is must read the 4k page that contains the faulting address from storage. The OS doesn't know whether it should read adjacent 4k pages by using a larger read request to avoid a seek per 4kb of data. Perhaps the workaround is to suggest the read size by calling madvise with the MADV_WILLNEED flag prior to referencing memory and hope the suggestion isn't ignored.

Reading 4kb at a time isn't good for MongoDB index pages which are 8kb, so the OS might do two disk reads and two disk seeks in the worst case. A 4kb read is also too small for documents that are larger than 4kb and the worst case again is extra disk reads. A partial workaround is to set read_ahead_kb to suggest to Linux that more data should be read. But one value for read_ahead_kb won't cover all use cases -- 8kb index pages, large documents and small documents. Note that this is also a hint. If you want high performance and high efficiency for IO-bound OLTP and you rely on read_ahead_kb then you aren't going to have a good time.

There isn't much information on the impact of read_ahead_kb on MongoDB and TokuMX. A simple way to understand the impact from it is to look at the disk read rate and disk sector read rate as measured by iostat. The number of disk sectors per disk read is also interesting. I report that data below from my tests. There is an open question about this on the MongoDB users email list and reference to a blog post.

readPageSize

InnoDB uses 16kb pages by default and the page size can be changed via the innodb_page_size parameter when installing MySQL. A fast flash device can do more random IOPs for 4kb requests than for 16kb requests because data transfer can limit throughput. TokuMX has a similar option that can be set per table via readPageSize. The TokuMX default is 64kb which might be good for disk-based servers but I suspect is too large for flash. Even on disk-based servers there are benefits from using a smaller page as less space will be wasted in RAM when only a few documents/rows per page are useful. For TokuMX the readPageSize is the size of the page prior to compression.

Setup for 1.6B documents

I used the same test as described previously except I limited queries to the first half of each table/collection and there were 1.6B documents/rows that could be fetched in the 8 collections/tables. I tested the configurations listed below. This lists the total database size but only half of the database was accessed.
  • fb56.handler - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER
  • fb56.sql - 740G database, MySQL 5.6.12 with the Facebook patch, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
  • orig57.handler - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via HANDLER. 
  • orig57.sql - 740G database, official MySQL 5.7.4, InnoDB, page_size=4k, 8k, 16k, data fetched via SELECT
  • tokumx8 - 456G database, TokuMX 1.4.1, quicklz, readPageSize=8k
  • tokumx64 - 582G database, TokuMX 1.4.1, quicklz, readPageSize=64K
  • mongo249 - 834G database, MongoDB 2.4.9, powerOf2Sizes=0
  • mongo260 - 874G database, MongoDB 2.6.0, powerOf2Sizes=1

Results for 1.6B documents

The graph below shows the best QPS for each of the servers tested: 8kb page_size for MySQL, 8kb readahead for tokumx8 and tokumx64, 4kb readahead for MongoDB. I exclude the results for MySQL 5.7 because they are similar to 5.6. As in my previous blog post MySQL with InnoDB does better than the others but the difference is less significant for two reasons. First I repeated tests for many different values of read_ahead_kb and display the best one below. Second the test database is smaller. I assume this helps MongoDB and TokuMX by improving the cache hit rate for index data -- MongoDB because the index isn't clustered so we need all of the index in RAM to avoid doing an extra disk read per query, TokuMX because the non-leaf levels of the index are larger than for a b-tree. MongoDB 2.6.0 does worse than 2.4.9 because of a CPU performance regression for simple PK queries (JIRA 13663 and 13685). TokuMX with an 8kb readPageSize is much better than a 64kb readPageSize because there is less data to decompress per disk read, there is a better cache hit rate and fast storage does more IOPs for smaller read requests. Note that the database was also much smaller on disk with an 8kb readPageSize. It would be good to explain that.
This has QPS results for all of the configurations tested. The peak QPS for each of the configurations below was used in the graph above except for MySQL 5.7.

queries per second
    8    16     32     40  concurrent clients
44078 71929 108999 110640  fb56.handler, 16k page
36739 63622 100654 107970  fb56.sql, 16k page
44743 74183 117871 131160  fb56.handler, 8k page
37064 64602 102672 119555  fb56.sql, 8k page
43440 71328 113778 128366  fb56.handler, 4k page
35629 61437  97452 113916  fb56.sql, 4k page
-
44120 73919 119059 130071  orig57.handler, 8k page
36589 64339 102399 119260  orig57.sql, 8k page
42772 70368 113781 128088  orig57.handler, 4k page
35287 60928  96736 112950  orig57.sql, 4k page
-
24502 39332  62389  68967  tokumx8, 0k readahead
27762 45074  73102  79999  tokumx8, 4k readahead
29256 49093  81222  91508  tokumx8, 8k readahead
27835 45406  76164  82695  tokumx8, 16k readahead
-
 6347  9287  13512  14396  tokumx64, 0k readahead
 7221 12835  20233  21477  tokumx64, 4k readahead
 9263 16088  26595  27943  tokumx64, 8k readahead
10272 18602  22645  22015  tokumx64, 16k readahead
11191 20349  24090  24086  tokumx64, 32k readahead
10384 16492  17093  16600  tokumx64, 64k readahead
-
38154 62257  96033 103770  mongo249, 0k readahead
38274 62321  96131 106017  mongo249, 4k readahead
33088 51609  72699  76311  mongo249, 8k readahead
16533 22871  24019  25076  mongo249, 16k readahead
17572 23332  24319  24324  mongo249, 32k readahead
-
29179 49731  77114  84779  mongo260, 0k readahead
28979 49521  76569  86985  mongo260, 4k readahead
26321 42967  65662  71112  mongo260, 8k readahead
15338 23131  24448  25131  mongo260, 16k readahead
16277 23428  24443  24566  mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test. Disk reads/query is much higher for TokuMX with a 64k readPageSize than for all other servers. The rate changes for MongoDB 2.4.9 and 2.6.0 between 8k and 16k readahead.

reads/query  server
0.649        fb56.handler, 16kb page
0.672        fb56.handler, 8kb page
0.748        fb56.handler, 4kb page
-
1.363        tokumx8, 0k readahead
1.275        tokumx8, 4k readahead
0.927        tokumx8, 8k readahead
0.987        tokumx8, 16k readahead
-
7.161        tokumx64, 0k readahead
6.547        tokumx64, 4k readahead
3.833        tokumx64, 8k readahead
2.659        tokumx64, 16k readahead
2.432        tokumx64, 32k readahead
2.909        tokumx64, 64k readahead
-
0.806        mongo249, 0k readahead
0.807        mongo249, 4k readahead
0.824        mongo249, 8k readahead
1.117        mongo249, 16k readahead
1.147        mongo249, 32k readahead
-
0.830        mongo260, 0k readahead
0.832        mongo260, 4k readahead
0.847        mongo260, 8k readahead
1.115        mongo260, 16k readahead
1.145        mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read   server
16384        fb56.handler, 16kb page
 8192        fb56.handler, 8kb page
 4096        fb56.handler, 4kb page
-
 4096        tokumx8, 0k readahead
 5939        tokumx8, 4k readahead
 9011        tokumx8, 8k readahead
10752        tokumx8, 16k readahead
-
 4096        tokumx64, 0k readahead
 4557        tokumx64, 4k readahead
 9011        tokumx64, 8k readahead
23347        tokumx64, 16k readahead
23398        tokumx64, 32k readahead
27802        tokumx64, 64k readahead
-
 4096        mongo249, 0k readahead
 4198        mongo249, 4k readahead
 7322        mongo249, 8k readahead
19098        mongo249, 16k readahead
18534        mongo249, 32k readahead
-
 4096        mongo260, 0k readahead
 4250        mongo260, 4k readahead
 7424        mongo260, 8k readahead
19712        mongo260, 16k readahead
19149        mongo260, 32k readahead

Setup for 3.2B documents

The same test was repeated except the clients were able to query all 3.2B documents/rows in the test collections/tables. I exclude results for TokuMX with 64k readPageSize,

Results for 3.2B documents

The graph has the best configuration for each server: 8kb page_size for MySQL, 16kb readahead for tokumx8, 4kb readahead for MongoDB. TokuMX matches MongoDB 2.4.9 here while it did worse than it in the 1.6B document test.
This has QPS results for all of the configurations tested. It should be possible for MySQL with InnoDB to get more QPS at 4kb pages. I don't know why that didn't happen and suspect that mutex contention was a problem. TokuMX with 8k readPageSize matched MongoDB 2.4.9 here, otherwise the results are similar to the 1.6B document/row test.

queries per second
    8    16     32     40  concurrent clients
38767 60366  86895  87517  fb56.handler, 16k page
33062 54847  84480  85764  fb56.sql, 16k page
39940 63628 102261 107312  fb56.handler, 8k page
33599 56819  91128 102378  fb56.sql, 8k page
39165 62409 101283 111496  fb56.handler, 4k page
32593 54810  87967 100644  fb56.sql, 4k page
-
38829 60419  86223  86655  orig57.handler, 16k page
33298 55034  84282  84734  orig57.sql, 16k page
39898 63673 102641 106424  orig57.handler, 8k page
33801 57097  91397 101718  orig57.sql, 8k page
39015 62159 101203 107398  orig57.handler, 4k page
32433 54067  84663  90872  orig57.sql, 4k page
-
21062 32292  49979  54177  tokumx8, 0k readahead
24379 39353  58571  61245  tokumx8, 4k readahead
27871 45992  73472  80748  tokumx8, 8k readahead
27396 45214  74040  81592  tokumx8, 16k readahead
-
29529 45976  69002  72772  mongo249, 0k readahead
29482 45868  71590  73942  mongo249, 4k readahead
23608 35676  48662  51503  mongo249, 8k readahead
18606 27554  31865  32637  mongo249, 16k readahead
12485 16190  16668  16662  mongo249, 32k readahead
-
24296 40154  61795  66992  mongo260, 0k readahead
24245 39781  61343  68450  mongo260, 4k readahead
20559 32111  46572  49358  mongo260, 8k readahead
17115 26220  32825  33184  mongo260, 16k readahead
12309 17050  17542  17595  mongo260, 32k readahead

The next section displays the number of disk reads per query for some of the configurations to understand whether the server is efficient for IO. The result is from the 40 concurrent client test.  Note that TokuMX QPS gets much better as the rate decreases when a larger readahead is used.

reads/query  server
0.823        fb56.handler, 16kb page
0.842        fb56.handler, 8kb page
0.897        fb56.handler, 4kb page
-
1.794        tokumx8, 0k readahead
1.723        tokumx8, 4k readahead
1.074        tokumx8, 8k readahead
0.999        tokumx8, 16k readahead
-
1.213        mongo249, 0k readahead
1.213        mongo249, 4k readahead
1.270        mongo249, 8k readahead
1.232        mongo249, 16k readahead
1.478        mongo249, 32k readahead
-
1.225        mongo260, 0k readahead
1.226        mongo260, 4k readahead
1.285        mongo260, 8k readahead
1.225        mongo260, 16k readahead
1.458        mongo260, 32k readahead

The final table has the number of bytes read per disk read. This was measured from the 40 concurrent client test. MySQL used direct IO for InnoDB so storage reads the requested data and no more. The larger values for MongoDB are expected when readahead is set too large but this also demonstrates the difficulty of trying to be efficient when setting read_ahead_kb.

bytes/read   server
16384        fb56.handler, 16kb page
 8192        fb56.handler, 8kb page
 4096        fb56.handler, 4kb page
-
 4096        tokumx8, 0k readahead
 5939        tokumx8, 4k readahead
 9574        tokumx8, 8k readahead
11469        tokumx8, 16k readahead
-
 4096        mongo249, 0k readahead
 4250        mongo249, 4k readahead
 7373        mongo249, 8k readahead
13107        mongo249, 16k readahead
21197        mongo249, 32k readahead
-
 4096        mongo260, 0k readahead
 4301        mongo260, 4k readahead
 7475        mongo260, 8k readahead
13210        mongo260, 16k readahead
21658        mongo260, 32k readahead

Postgres 18rc1 vs sysbench

This post has results for Postgres 18rc1 vs sysbench on small and large servers. Results for Postgres 18beta3 are here for a small and larg...