Friday, April 18, 2014

Biebermarks

Yet another microbenchmark result. This one is based on behavior that has caused problems in the past for a variety of reasons which lead to a few interesting discoveries. The first was that using a short lock-wait timeout was better than the InnoDB deadlock detection code. The second was that no-stored procedures could overcome network latency.

The workload is a large database where all updates are done to a small number of rows. I think it is important to use a large database to include the overhead from searching multiple levels of a b-tree. The inspiration for this is maintaining counts for popular entities like Justin Bieber and One Direction. This comes from serving the social graph. For more on that read about TAO and LinkBench.

The most popular benchmark for MySQL is sysbench and it is usually run with a uniform distribution so that all rows are equally likely to be queried or modified. But real workloads have skew which can cause stress in unexpected places and I describe one such place within InnoDB from this microbenchmark. YCSB and LinkBench are benchmarks that have skew and can be run for MySQL. I hope that more of the MySQL benchmark results in the future include skew.

Configuration

See a previous post for more details. Eight collections/tables with 400M documents/rows per collection/table were created. All collections/tables are in one database so MongoDB suffers from the per-database RW-lock. But MySQL and TokuMX also suffer from a similar issue when all clients are trying to update the same row. Tests were run for 1, 2, 4 and 8 tables where one row per table was updated. So when the test used 4 tables there were 4 rows getting updates. For each number of tables tests were run for up to 64 concurrent clients/threads. The result tables listed in the next section should make that clear.

The workload is updating the non-indexed column of one document/row by PK per transaction. There are no secondary indexes on the table. In this case the document/row with ID=1 is chosen for every update. For MySQL and TokuMX an auto-commit transaction is used. The journal (redo log) is used but the update does not wait for the journal/log to be forced to disk. The updates should not require disk reads as all relevant index and data blocks remain in cache. TokuMX might do reads in the background to maintain fractal trees but I don't understand their algorithm to be certain.

The database was loaded in PK order and about 8 hours of concurrent & random updates were done to warmup the database prior to this test. The warmup was the same workload as described in a previous post.

The MySQL test client limits clients to one table. So when there are 64 clients and 8 tables then there are 8 clients updating the 1 row per table. The MongoDB/TokuMX client does not do that. It lets all clients update all tables so in this case there are at most 64 clients updating the row per table and on average there would be 8.

The test server has 40 CPU cores with HT enabled, fast flash storage and 144G of RAM. The benchmark client and database servers shared the host. Tests were run for several configurations:
  • mongo26 - MongoDB 2.6.0rc2, powerOf2Sizes=1, journalCommitInterval=300, w:1,j:0
  • mongo24 - MongoDB 2.6.0rc2, powerOf2Sizes=0, journalCommitInterval=300, w:1,j:0
  • mysql - MySQL 5.6.12, InnoDB, no compression, flush_log_at_trx_commit=2, buffer_pool_size=120G, flush_method=O_DIRECT, page_size=8k, doublewrite=0, io_capacity=16000, lru_scan_depth=2000, buffer_pool_instances=8, write_io_threads=32, flush_neighbors=0
  • toku-32 - TokuMX 1.4.1, readPageSize=32k, quicklz compression, logFlushPeriod=300, w:1,j:0. I don't have results for toku-32 yet.
  • toku-64 - TokuMX 1.4.1, readPageSize=64k, quicklz compression, logFlushPeriod=300, w:1,j:0

Results per DBMS

I first list the results by DBMS to show the impact from spreading the workload over more rows/tables. The numbers below are the updates per second rate. I use "DOP=X" to indicate the number of concurrent clients and "DOP" stands for Degree Of Parallelism (it is an Oracle thing). A few conclusions from the results below:

  • MySQL/InnoDB does much better with more tables for two reasons. The first is that it allows for more concurrency. The second is that it avoids some of the overhead in the code that maintains row locks and threads waiting for row locks. I describe that in more detail at the end of this post.
  • MongoDB 2.4.9 is slightly faster than 2.6.0rc2. I think the problem is that mongod requires more CPU per update in 2.6 versus 2.4 and this looks like a performance regression in 2.6 (at least in 2.6.0rc2). I am still profiling to figure out where. More details on this are at the end of the post. I filed JIRA 13663 for this.
  • MongoDB doesn't benefit from spreading the load over more collections when all collections are in the same database. This is expected given the per-database RW-lock.

Updates per second
config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         1   8360  15992  30182  24932   23924   23191   21048
mysql         2      X  16527  30824  49999   41045   40506   38357
mysql         4      X      X  32351  51791   67423   62116   59137
mysql         8      X      X      X  54826   80409   73782   68128

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mongo24       1  10212  17844  30204  34003   33895   33564   33451
mongo24       2      X  10256  17698  30547   34125   33717   33573
mongo24       4      X      X  10670  17690   30903   34027   33586
mongo24       8      X      X      X  10379   17702   30920   33758

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mongo26       1   9187  16131  27648  28506   27784   27437   27021
mongo26       2      X   9367  16035  27490   28326   27746   27354
mongo26       4      X      X   9179  16028   27666   28330   27647
mongo26       8      X      X      X   9125   16038   27275   27858

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
toku-64       1   7327  12804  16179  12154   11021    9990    8344
toku-64       2      X   7173  12690  20483   23064   22354   20349
toku-64       4      X      X   7191  12943   21399   33485   40124
toku-64       8      X      X      X   7121   12727   22096   38207

Results per number of tables

This reorders the results from above to show them for all configurations at the same number of tables. You are welcome to draw conclusions about which is faster.

Updates per second
config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         1   8360  15992  30182  24932   23924   23191   21048
mongo24       1  10212  17844  30204  34003   33895   33564   33451
mongo26       1   9187  16131  27648  28506   27784   27437   27021
toku-64       1   7327  12804  16179  12154   11021    9990    8344

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         2      X  16527  30824  49999   41045   40506   38357
mongo24       2      X  10256  17698  30547   34125   33717   33573
mongo26       2      X   9367  16035  27490   28326   27746   27354
toku-64       2      X   7173  12690  20483   23064   22354   20349

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         4      X      X  32351  51791   67423   62116   59137
mongo24       4      X      X  10670  17690   30903   34027   33586
mongo26       4      X      X   9179  16028   27666   28330   27647
toku-64       4      X      X   7191  12943   21399   33485   40124

config  #tables  DOP=1  DOP=2  DOP=4  DOP=8  DOP=16  DOP=32  DOP=64
mysql         8      X      X      X  54826   80409   73782   68128
mongo24       8      X      X      X  10379   17702   30920   33758
mongo26       8      X      X      X   9125   16038   27275   27858
toku-64       8      X      X      X   7121   12727   22096   38207

Row locks for InnoDB

I used PMP to understand MySQL/InnoDB on this workload. I frequently saw all user threads blocked on a condition variable with this stack trace. It seems odd that all threads are sleeping. I think the problem is that one thread can run but has yet to be scheduled by Linux. My memory of the row lock code is that it wakes threads in FIFO order and when N threads wait for a lock on the same row then each thread waits on a separate condition variable. I am not sure if this code has been improved in MySQL 5.7. A quick reading of some of the 5.6.12 row lock code showed many mutex operations. Problems in this code have escaped scrutiny in the past because much of our public benchmark activity has used workloads with uniform distributions.
pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,os_event_wait_low2,lock_wait_suspend_thread,row_mysql_handle_errors,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection
This was a less frequent stack trace from the test ...
lock_get_mode,lock_table_other_has_incompatible,lock_table,row_search_for_mysql,ha_innobase::index_read,handler::read_range_first,handler::multi_range_read_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection

Row locks for TokuMX

TokuMX has a similar point at which all threads wait. It isn't a big surprise given that both provide fine-grained concurrency control but there is no granularity finer than a row lock.
pthread_cond_timedwait@@GLIBC_2.3.2,toku_cond_timedwait,toku::lock_request::wait,toku_db_wait_range_lock,toku_c_getf_set(__toku_dbc*,,db_getf_set,autotxn_db_getf_set(__toku_db*,,mongo::CollectionBase::findByPK(mongo::BSONObj,mongo::queryByPKHack(mongo::Collection*,,mongo::updateObjects(char,mongo::lockedReceivedUpdate(char,mongo::receivedUpdate(mongo::Message&,,mongo::assembleResponse(mongo::Message&,,mongo::MyMessageHandler::process(mongo::Message&,,mongo::PortMessageServer::handleIncomingMsg(void*)

MongoDB 2.4 versus 2.6

I get about 1.2X more updates/second with MongoDB 2.4.9 compared to 2.6.0rc2. I think the problem is that 2.6 uses more CPU per update. I filed JIRA 13663 for this but am still trying to profile the code. So far I know the following all of which indicates that the 2.4.9 test is running 1.2X faster than 2.6.0rc2 with 32 client threads and 1 table:
  • I get ~1.2X more updates/second with 2.4.9
  • the Java sysbench client uses ~1.2X more CPU per "top" with 2.4.9
  • the context switch rate is ~1.2X higher with 2.4.9
The interesting point is that mongod for 2.4.9 only uses ~1.03X more CPU than 2.6.0rc2 per "top" during this test even though it is doing 1.2X more updates/second. So 2.6.0rc2 uses more CPU per update. I will look at "perf" output. I can repeat this with the GA version of 2.6.

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...