Wednesday, July 11, 2018

CPU overheads for RocksDB queries

An LSM like RocksDB has much better write and space efficiency than a B-Tree. That means with RocksDB you will use less SSD than with a B-Tree and the SSD will either last longer or you can use lower endurance SSD. But this efficiency comes at a cost. In the worst-case RocksDB might use 2X more CPU/query than a B-Tree, which means that in the worst case QPS with RocksDB might be half of what it is with a B-Tree. In the examples below I show that RocksDB can use ~2X more comparisons per query compared to a B-Tree and it is nice when results in practice can be explained by theory.

But first I want to explain the context where this matters and where it doesn't matter. It matters when the CPU overhead from RocksDB is a significant fraction of the query response time -- so the workload needs to be CPU bound (cached working set). It isn't a problem for workloads that are IO-bound. Many performance results have been published on my blog and more are coming later this year. With fast storage devices available I recommend that database workloads strive to be IO-bound to avoid using too much memory+power.

Basic Operations

Where does the CPU overhead come from? I started to explain this in my review of the original LSM paper. My focus in this post is on SELECT statements in SQL, but note that writes in SQL also do reads from the storage engine (updates have a where clause, searches are done to find matching rows and appropriate leaf pages, queries might be done to determine whether constraints will be violated, etc).

With RocksDB data can be found in the memtable and sorted runs from the L0 to Lmax (Lmax is the max/largest level in the LSM tree. There are 3 basic operations that are used to search for data - get, range seek and range next. Get searches for an exact match. Range seek positions iterators at the start of a range scan. Range next gets the next row from a range scan.

First a few comments before I explain the CPU overhead:
  1. I will try to be clear when I refer to the keys for a SQL table (SQL key) and the key as used by RocksDB (RocksDB key). Many SQL indexes can be stored in the same LSM tree (column family) and they are distinguished by prepending the index ID as the first bytes in the RocksDB key. The prepended bytes aren't visible to the user.
  2. Get is used for exact match on a PK index but not used for exact match on a secondary index because with MyRocks the RocksDB key for a secondary index entry is made unique by appending the PK to it. A SQL SELECT that does an exact match on the secondary key searches for a prefix of the RocksDB key and that requires a range scan.
  3. With RocksDB a bloom filter can be on a prefix of the key or the entire key (prefix vs whole key). A prefix bloom filter can be used for range and point queries. A whole key bloom filter will only be used for point queries on a PK. A whole key bloom filter won't be used for range queries. Nor will it be used for point queries on a secondary index (because secondary index exact match uses a range scan internally).
  4. Today bloom filters are configured per column family. A good use for column families is to put secondary indexes into separate ones that are configured with prefix bloom filters. Eventually we expect to make this easier in RocksDB.
Get

A point query is evaluated for RocksDB by searching sorted runs in order: first the memtable, then runs in the L0, then the L1 and so on until the Lmax is reached. The search stops as soon as the key is found, whether from a tombstone or a key-value pair. The work done to search each sorted run varies. I use comparisons and bloom filter probes as the proxy for work. I ignore memory system behavior (cache misses) but assume that the skiplist isn't great in that regard:
  • memtable - this is a skiplist and I assume the search cost is log2(N) when it has N entries. 
  • L0 - for each sorted run first check the bloom filter and if the key might exist then do binary search on the block index and then do binary search within the data block. The cost is a bloom filter probe and possibly log2(M) comparisons when an L0 sorted run has M entries.
  • L1 to Ln - each of these levels is range partitioned into many SST files so the first step is to do a binary search to find the SST that might contain the key, then check the bloom filter for that SST and if the key might exist then do binary search on the block index for that SST and then do binary search within the data block. The binary search cost to find the SST gets larger for the larger levels and that cost doesn't go away when bloom filters are used. For RocksDB with a 1T database, per-level fanout=8, SST size=32M then the number of SSTs per level is 8 in L1, 64 in L2, 512 in L3, 4096 in L4 and 32768 in L5. The number of comparisons before the bloom filter check are 3 for L1, 6 for L2, 9 for L3 and 12 for L4. I assume there is no bloom filter on L5 because it is the max level.
A bloom filter check isn't free. Fortunately, RocksDB makes the cost less than I expected by limiting the bits that are set for a key, and must be checked on a probe, to one cache line. See the code in AddHash. For now I assume that the cost of a bloom filter probe is equivalent to a few (< 5) comparisons given the cache line optimization.

A Get operation on a B-Tree with 8B rows needs ~33 comparisons. With RocksDB it might need 20 comparisons for the memtable, bloom filter probes for 4 SSTs in L0, 3+6+9+12 SST search comparisons for L1 to L4, bloom filter probes for L1 to L4, and then ~33 comparisons to search the max level (L5). So it is easy to see that the search cost might be double the cost for a B-Tree.

The search cost for Get with RocksDB can be reduced by configuring the LSM tree to use fewer sorted runs although we are still figuring out how much that can be reduced. With this example about 1/3 of the comparisons are from the memtable, another 1/3 are from the max level and the remaining 1/3 are from L0 to L4.

Range Seek and Next

The cost for a range scan has two components: range seek to initialize the scan and range next to produce each row. For range seek an iterator is positioned within each sorted run in the LSM tree. For range next the merging iterator combines the per-run iterators. For RocksDB the cost of range seek depends on the number of sorted runs while the cost of range next does not (for leveled compaction, assuming uniform distribution).

Range seek does binary search per sorted run. This is similar to a point query without a bloom filter. The cost across all sorted runs depends on the number of sorted runs. In the example above where RocksDB has a memtable, 4 SSTs in the L0, L1 to L5 and 8B rows that requires 20 comparisons for the memtable, 4 x 20 comparisons for the L0 SSTs, 21 + 24 + 27 + 30 + 33 comparison for L1 to L5. The total is 235 comparisons. There isn't a way to avoid this and for short range queries the cost of range seek dominates the cost of range next. While this overhead is significant for short range queries with embedded RocksDB and an in-memory workload it is harder to notice with MyRocks because there is a lot of CPU overhead above MyRocks from optimize, parse and client RPC for a short range query. It is easier to notice the difference in CPU overhead between MyRocks and a B-Tree with longer range scans.

The cost for range next is interesting. LevelDB has a comment that suggests using a heap but the merging iterator code uses N-1 comparisons per row produced which means the overhead is dependent on the number of sorted runs. The cost of range next in RocksDB is much less dependent on the number of sorted runs because it uses an optimized binary heap and the number of comparisons to produce a row depends on the node that produces the winner. Only one comparison is done if the root produces the winner and the root produced the previous winner. Two comparisons are done if the root produces the winner but did not produce the previous winner. More comparisons are needed in other cases. The code is optimized for long runs of winners from one iterator and that is likely with leveled compaction because Lmax is ~10X larger than the next largest level and usually produces most winners. Using a simulation with uniform distribution the expected number of comparisons per row is <= 1.55 regardless of the number of sorted runs for leveled compaction.

The optimization in the binary heap used by the merging iterator is limited to remembering the comparison result between the two children of the root node. Were that extended to remembering comparison results between any two sibling nodes in the heap then the expected number of comparisons would be reduced from ~1.55 to ~1.38.

For a B-Tree:
  • The overhead for range seek is similar to a point query -- ~33 comparisons when there are 8B rows.
  • There are no merging iterators. The overhead for range next is low -- usually move to the next row in the current leaf page.


Wednesday, June 6, 2018

The original LSM paper

I reread the LSM Tree paper by O'Neil et al. It is a great paper and I wanted to share my notes. I also make comparisons to leveled compaction in RocksDB.

The design goal was for disks and insert/update/delete intensive tables that get more writes than reads. Transactional logging (history tables) is one example. The LSM was shown to use less IO capacity than a B-Tree for making changes durable and that was a big deal when the cost of IOPs was a significant fraction of the system cost.  The LSM uses less disk IO capacity than a B-Tree for two reasons:
  1. It does multi-page writes rather than single page writes so the seek latency is amortized over N pages (maybe N=64). Assume a disk can do either 32 1MB transfers/s or 128 4kb transfers/s. Then multi-page block IO throughput is 64x faster.
  2. When sized appropriately, it does a page write per M modified rows while a B-Tree in the worst case (for their example) does a page write per modified row.
While OLTP has since moved from disks to SSD, the LSM is still a great idea because it provides better write and space efficiency than a B-Tree -- it needs less SSD capacity and the devices will last longer.

Details


The paper explains two component and multi-component LSM Tress. A component is similar to a level in RocksDB: component 0 (C0) is the in-memory tree, component 1 (C1) is the smallest on-disk tree. For multi-component trees there can be C2 and C3. C0 is similar to the RocksDB memtable and Cn is similar to Ln in RocksDB. The original LSM didn't have something equivalent to the L0 in RocksDB. While it might have been good for LevelDB, I think the L0 is a performance bug in RocksDB as it makes reads less efficient.

The paper has math to explain that the best fanout between components is a constant, and the paper used r for that value. This is similar to RocksDB for levels >= 1. In the paper, r is the fanout between all components including C0 to C1. While in RocksDB the fanout for memtable:L0 and L0:L1 is usually smaller than r.

The paper doesn't mention the use of bloom filters for the LSM Tree, but does mention them as used by another index structure. They aren't needed for a two component LSM Tree, but can be useful when the LSM Tree has more than 2 components. I assume they could be implemented via a bloom filter per multi-page block.

The paper mentions a few features that have yet to arrive in RocksDB. Predicate deletion is similar to range deletion in RocksDB and is a way to quickly perform a bulk delete. A find note is a special indicator written to the LSM Tree that requests data. When the find note moves from C0 to Cmax then the data will be provided as the find note encounters the keys that it requests.

C0 can use a variety of in-memory tree indexes. C1 and larger use a special B-Tree:
  • Leaf nodes are packed (no free space)
  • Key-order adjacent nodes are written in multi-page blocks. Assume a leaf node is 4kb and and a multi-page block is 256kb then 64 leaf nodes are written per multi-page block. This amortizes the seek latency when writing leaf nodes and when reading during long range scans. 
  • Point queries do single-page reads, long range scans do multi-page block reads.
  • Leaf nodes are write once.
  • Non-leaf nodes are also write once and might use multi-page blocks.

The paper uses a rolling merge to move data from C0 to C1. RocksDB calls this compaction. A rolling merge is a sequence of merge steps. Each merge step reads some data from C0 and C1, merges it and then writes it out into a new multi-page block in C1. The rolling merge is in progress all of the time. The rolling merge is similar to compaction for levels L1 to Lmax in RocksDB, but the implementation is different.
  • Compaction in RocksDB is some-to-some and atomic. By some-to-some I mean that RocksDB chooses 1 SST from level N to compact into the ~10 overlapping SSTs in level N+1. By atomic I mean that the output from a compaction step isn't visible until all of the compaction input has been consumed (and this used to be a big problem for universal/tiered compaction in RocksDB with large SSTs). 
  • The rolling merge is neither some-to-some nor atomic. I will call it all-to-all and incremental. By all-to-all I mean that a rolling merge limited to data from a range in each component (perhaps a merge step is). By incremental I mean that the merge output is made visible as soon as possible, alas that is complicated.
  • During a merge step component N is the inner component and component N+1 is the outer component.
  • The merge step input from components N and N+1 are called the emptying nodes. And the output written into component N+1 are called the filling nodes.
  • As soon as possible, the outcome from a merge step is made visible. This part is complicated. I assume this means that the B-Tree for component N is changed to remove that key range and the B-Tree for component N+1 is changed to reference the filling nodes in place of the emptying nodes. The non-leaf nodes of the B-Trees are also written sequentially, just like the leaf nodes. But I wasn't clear on some of the details on when these updates can be done.

A merge step consumes a range of data from the inner component, a range of data from the outer component and replaces that with new data in the outer component. While the paper has a section on concurrency (supporting reads concurrent with the rolling merge, supporting rolling merge in trees with more than 2 components) I don't think all of the details were provided. Some of the details are:
  • for now a node is a b-tree leaf node
  • nodes are locked in read mode for queries
  • the nodes from the outer component, and maybe the inner component, are locked in write mode when data from the node(s) is being merged. I didn't understand all of the details about when the node(s) could be unlocked.

When a merge step fills a multi-page block (for now assume 64 leaf nodes) then that block will be written to the database file. It doesn't overwrite live multi-page blocks. Eventually the B-Tree will be updated to point to the new leaf nodes in this multi-page block. Eventually the space from the multi-page blocks for the emptying nodes can be reclaimed. This maintains one sequential write stream per component.

Performance


The paper has a nice example of the 5 Minute Rule to estimate what data should be in memory based on the cost of RAM and IOPs. Assuming RAM is $10/GB and IOPs are $200 / 10k then one IO/second costs 2 cents and 4kb of RAM costs 0.0038 cents and ~500 4kb pages of RAM has the same cost as 1 IO/second, or a 4kb page of RAM costs the same as ~1/500 IO/second. When that is true, data that is referenced more frequently than once per 500 seconds should be in RAM.

Next is the performance model. I renamed a few of the variables. The variables include:
  • IO-S - cost of a single page IO
  • IO-M - cost of a page IO when done as part of a multi-page block IO
  • Se - size of index entry in bytes
  • Sp - size of page in bytes
  • S0 - size in MB of C0
  • S1 - size in MB of C1
  • De is number of B-Tree levels not in cache

The IO cost of a B-Tree insert is: IO-S * (De + 1). There are De IO reads to get the leaf node and then 1 IO to write back the dirty leaf node. This is the worst-case for a B-Tree. Each of the IO operations are single-page with cost IO-S.

M is used for the LSM insert cost. M is the average number of C0 entries merged into a C1 page. One formula for it is (Sp/Se) * (S0 / (S0 + S1)) because:
  • Sp/Se is the number of entries per page
  • S0 / (S0 + S1) is fraction of database entries in C0

Then the IO cost of an LSM insert with a two component LSM Tree is: 2 * IO-M / M.
  • 2 because a component is re-read when it is re-written
  • IO-M because the IO operations are multi-page block reads and writes
  • Divided by M because the cost (2 * IO-M) is amortized over M entries from C0

The ratio of the costs determine when LSM inserts are cheaper:
    Cost(LSM) / Cost(B-Tree) = K1 * (IO-M / IO-S) * (1/M)
  • K1 is 2 / (De + 1), which =1 when De is 1 and =2/3 when De is 2
  • IO-M / IO-S is the speedup from multi-page block IO, which is >> 10 for disks

Assuming disks are used, IO-M/IO-S is 64 and K1 = 1 then 1/M must be >= 1/64 for LSM inserts to be cheaper.



Monday, May 14, 2018

Geek code for database algorithms

I like to read academic papers on database systems but I usually don't have time to do more than browse. If only there were a geek code for this. Part of the geek code would explain the performance vs efficiency tradeoff. While it helps to know that something new is faster, I want to know the cost of faster. Does it require more storage (tiered vs leveled compaction)? Does it hurt SSD endurance (update-in-place vs write-optimized)? Read, write, space and cache amplification are a framework for explaining the tradeoffs.

The next part of the geek code is to group algorithms into one of page-based, LSM, index+log or something else. I suspect that few will go into the something else group. These groups can be used for both tree-based and hash-based algorithms, so I am redefining LSM to mean log structured merge rather than log structured merge tree.

Saturday, May 5, 2018

Where to ask questions about MyRocks and RocksDB

MyRocks


Best places to discuss MyRocks:
Other places to discuss MyRocks:
  • MyRocks group for FB MySQL - this existed before MyRocks made it into Percona and MariaDB. You are better off using the MariaDB and Percona groups.
  • Bugs for MyRocks in FB MySQL are here. Again, if using Percona or MariaDB please use their forums for bugs.

RocksDB



Sunday, April 22, 2018

MyRocks, malloc and fragmentation -- a strong case for jemalloc

While trying to reproduce a MyRocks performance problem I ran a test using a 4gb block cache and tried both jemalloc and glibc malloc. The test server uses Ubuntu 16.04 which has glibc 2.23 today. The table below lists the VSZ and RSS values for the mysqld process after a test table has been loaded. RSS with glibc malloc is 2.6x larger than with jemalloc. MyRocks and RocksDB are much harder on an allocator than InnoDB and this test shows the value of jemalloc.

VSZ(gb) RSS(gb) malloc  
 7.9     4.8    jemalloc-3.6.0
13.6    12.4    glibc-2.23

I am not sure that it is possible to use a large RocksDB block cache with glibc malloc, where large means that it gets about 80% of RAM.

I previously shared results for MySQL and for MongoDB. There have been improvements over the past few years to make glibc malloc perform better on many-core servers. I don't know whether that work also made it better at avoiding fragmentation.

Friday, April 20, 2018

Fun with caching_sha2_password in MySQL 8.0.11


I want to get benchmark numbers with MySQL 8.0.11. This is my first impression. The default auth method was changed to caching_sha2_password. See this post for more details. There will be some confusion with this change. By confusion I mean the difference between "error" and "OK because cached" below. I am not alone. See the experience that an expert had with replication.

Fun with caching_sha2_password occurs even with clients compiled as part of 8.0.11:

  1. install MySQL 8.0.11, disable SSL but use mostly default my.cnf
  2. bin/mysql -u... -p... -h127.0.0.1 -e ... -> error
  3. bin/mysql -u... -p... -e ... -> OK
  4. bin/mysql -u... -p... -h127.0.0.1 -e ... -> OK because cached

The error in step 2 is: ERROR 2061 (HY000): Authentication plugin 'caching_sha2_password' reported error: Authentication requires secure connection.

From show global variables I see the default is caching_sha2_password:

default_authentication_plugin   caching_sha2_password
Setting this in my.cnf after I created the user doesn't fix the problem. Setting this before creating the user is one fix. I did not test whether changing the value of user.plugin to "mysql_native_password" is another workaround.
default_authentication_plugin=mysql_native_password
The error when using an old mysql client will also be a source of confusion:
$ ~/b/orig5635/bin/mysql -u... -p.. -h127.0.0.1
ERROR 2059 (HY000): Authentication plugin 'caching_sha2_password' cannot be loaded: /home/mdcallag/b/orig5635/lib/plugin/caching_sha2_password.so: cannot open shared object file: No such file or directory
 

Thursday, April 5, 2018

Index Structures, Access Methods, whatever

I prefer to use Index Structures when writing about algorithms for indexing data stored on disk and SSD but Access Methods is another popular name. Recently I have been working on a high-level comparison (lots of hand waving) of index structures in terms of read, write, space and cache amplification.

I started by dividing the index structures into tree-based and hash-based. Tree-based support range queries while hash-based do not. Most of my experience is with tree-based approaches. There was a lot of research on hash-based in the 80s (extendible, dynamic and linear hashing), they are in production today but not as prominent as b-trees, and there is recent research on them. This paper by Seltzer and Yigit is a great overview on hash-based index structures.

The next classification is by algorithm - page-based, index+log and LSM. I use this for both tree-based and hash-based so LSM means Log Structured Merge rather than Log Structured Merge Tree. Most DBMS implement one or two of these. Tarantool has more algorithmic diversity, but I don't have much experience with it.

For tree-based:
  • page-based - I used to call this update-in-place but think that page-based is a better name because it includes update-in-place (UiP) and copy-on-write-random (CoW-R) b-trees. Great implementations include InnoDB for UiP and WiredTiger for CoW-R.
  • LSM - leveled and tiered compaction are examples (RocksDB, LevelDB, Cassandra). Note that the original LSM design by O'Neil et al didn't have an L0 and I suspect that the L0 might be a mistake but that is for another post. I don't want to insult LevelDB authors, the L0 makes sense when you want to limit memory consumption for database embedded in client apps, I am not sure it makes sense for serious OLTP with RocksDB. O'Neil also did fundamental work on bitmap indexes. I worked on both so he made my career possible (thank you).
  • index+log - use a tree-based index with data in log segments. The index points into the log segments. GC is required. It scans the log segments and probes the index to determine whether a record is live or can be deleted. There are fewer examples of this approach but interesting systems include WiscKey and ForestDB. Range queries will suffer unless the index is covering (this needs more discussion, but not here).

For hash-based:
  • page-based - see dynamic, extendible and linear hashing. TIL that ZFS implements extendible hashing. BerkeleyDB supports linear hashing. The dbm implementation on many Linux/Unix systems also implements some variant of update-in-place persistent hash tables.
  • LSM - today I only know of one example of this - SILT. That is a great paper (read it). I include it as an example of hash-based even though one of the levels is ordered.
  • index+log - BitCask is one example but the index wasn't durable and it took a long time (scan all log segments) to rebuild it on process start. Faster might be another example, but I am still reading the paper. I hope someone names a future system Efficienter or Greener.

Finally, I have a few comments on performance. I will be brief, maybe there will be another post with a lot more detail on my opinions:
  • b-tree - provides better read efficiency at the cost of write efficiency. The worst case for write efficiency is writing back a page for every modified row. Cache efficiency is better for a clustered index than a non-clustered index -- for a clustered index you should cache at least one key/pointer per leaf block but for a non-clustered index you need the entire index in RAM or there will be extra storage reads. Space efficiency for a b-tree is good (not great, not bad) -- the problem is fragmentation.
  • LSM - provides better write efficiency at the cost of read efficiency. Leveled compaction provides amazing space efficiency. Tiered compaction gets better write efficiency at the cost of space efficiency. Compaction does many key comparisons and this should be considered as part of the CPU overhead for an insert (maybe 4X more comparisons/insert than a b-tree).
  • index+log - provides better write efficiency. Depending on the choice of index structure this doesn't sacrifice read efficiency like an LSM. But the entire index must remain in RAM (just like a non-clustered b-tree) or GC will fall behind and/or do too many storage reads. GC does many index probes and the cost of this is greatly reduced by using a hash-based solution. These comparisons should be considered as part of the CPU overhead of an insert. There is also a write vs space efficiency tradeoff. By increasing the amount of dead data that can be tolerated in log segments then GC is less frequent, write-amplification is improved but space-amplification suffers. There are variants that don't need GC, but they are not general purpose.


Wednesday, March 28, 2018

Missing documentation

In my time with Linux there are some things that would benefit from better documentation.

  1. The use of per-inode mutexes by some members of the ext family for files opened with O_DIRECT
  2. PTHREAD_MUTEX_ADAPTIVE_NP - this enables some busy-waiting when trying to lock a mutex. It isn't mentioned in man pages.

Friday, March 9, 2018

Cache amplification

How much of the database must be in cache so that a point-query does at most one read from storage? I call this cache-amplification or cache amplification. The answer depends on the index structure (b-tree, LSM, something else). Cache amplification can join read, write and space amplification. Given that RWS was renamed RUM by the excellent RUM Conjecture now we have CRUM which is close to crummy. I briefly wrote about this in a previous post.

To do at most 1 storage read for a point query:
  • clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example.
  • non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index.
  • LSM - I assume there is a bloom filter per SST. Bloom filters for all levels but the max level should be in cache. Block indexes for all levels should be in cache. Data blocks don't have to be in cache. I assume there are no false positives from the bloom filter so at most one data block will be read. Note that with an LSM, more space amplification means more cache amplification. So cache-amp is worse (higher) for tiered compaction than for leveled.
  • something else - there have a been a few interesting variants on the same theme that I call index+log -- BitCask, ForestDB and WiscKey. These are similar to a non-clustered b-tree in that the entire index must be in cache so that the storage read can be spent on reading the data from the log.
I have ignored hash-based solutions for now but eventually they will be important. SILT is a great example of a solution with excellent cache-amplification.

Updated to correct what should be in cache for the LSM.

Friday, February 16, 2018

Sharded replica sets - MySQL and MongoDB

MongoDB used to have a great story for sharded replica sets. But the storage engine, sharding and replica management code had significant room for improvement. Over the last few releases they made remarkable progress on that and the code is starting to match the story. I continue to be impressed by the rate at which they paid off their tech debt and transactions coming to MongoDB 4.0 is one more example.

It is time for us to do the same in the MySQL community.

I used to be skeptical about the market for sharded replica sets with MySQL. This is popular with the web-scale crowd but that is a small market. Today I am less skeptical and assume the market extends far beyond web-scale. This can be true even if the market for replicasets, without sharding, is so much larger.


The market for replica sets is huge. For most users, if you need one instance of MySQL then you also need HA and disaster recovery. So you must manage failover and for a long time (before crash-proof slaves and GTID) that was a lousy experience. It is better today thanks to cloud providers and DIY solutions even if some assembly is required. Upstream is finally putting a solution together with MySQL Group Replication and other pieces.


But sharded replica sets are much harder, and even more so if you want to do cross-shard queries and transactions. While there have been many attempts at sharding solutions for the MySQL community, it is difficult to provide something that works across customers. Fortunately Vitess has shown this can be done and already has many customers in production.

ProxySQL and Orchestrator might also be vital pieces of this stack. I am curious to see how the traditional vendors (MySQL, MariaDB, Percona) respond to this progress.

Updates:

I think binlog server should be part of the solution. But for that to happen we need a GPLv2 binlog server and that has yet to be published.

Wednesday, January 17, 2018

Meltdown vs storage

tl;dr - sysbench fileio throughput for ext4 drops by more than 20% from Linux 4.8 to 4.13

I shared results from sysbench with a cached database to show a small impact from the Meltdown patch in Ubuntu 16.04. Then I repeated the test for an IO-bound configuration using a 200mb buffer pool for InnoDB and database that is ~1.5gb.

The results for read-only tests looked similar to what I saw previously so I won't share them. The results for write-heavy tests were odd as QPS for the kernel without the patch (4.8.0-36) were much better than for the kernel with the patch (4.13.0-26).

The next step was to use sysbench fileio to determine whether storage performance was OK and it was similar for 4.8 and 4.13 with read-only and write-only tests. But throughput with 4.8 was better than 4.13 for a mixed test that does reads and writes.

Configuration


I used a NUC7i5bnh server with a Samsung 960 EVO SSD that uses NVMe. The OS is Ubuntu 16.04 with the HWE kernels -- either 4.13.0-26 that has the Meltdown fix or 4.8.0-36 that does not. For the 4.13 kernel I repeat the test with PTI enabled and disabled. The test uses sysbench with one 2gb file, O_DIRECT and 4 client threads. The server has 2 cores and 4 HW threads. The filesystem is ext4.

I used these command lines for sysbench:
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
    --max-requests=0 --num-threads=4 --max-time=60 prepare
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
    --max-requests=0 --num-threads=4 --max-time=60 run

And I see this:
cat /sys/block/nvme0n1/queue/write_cache
write back

Results

The next step was to understand the impact of the filesystem mount options. I used ext4 for these tests and don't have much experience with it. The table has the throughput in MB/s from sysbench fileio that does reads and writes. I noticed a few things:
  1. Throughput is much worse with the nobarrier mount option. I don't know whether this is expected.
  2. There is a small difference in performance from enabling the Meltdown fix - about 3%
  3. There is a big difference in performance between the 4.8 and 4.13 kernels, whether or not PTI is enabled for the 4.13 kernel. I get about 25% more throughput with the 4.8 kernel.

4.13    4.13    4.8    mount options
pti=on  pti=off no-pti
100     104     137     nobarrier,data=ordered,discard,noauto,dioread_nolock
 93     119     128     nobarrier,data=ordered,discard,noauto
226     235     275     data=ordered,discard,noauto
233     239     299     data=ordered,discard,noauto,dioread_nolock

Is it the kernel?

I am curious about what happened between 4.8 and 4.13 to explain the 25% loss of IO throughput.

I have another set of Intel NUC servers that use Ubuntu 16.04 without the HWE kernels -- 4.4.0-109 with the Meltdown fix and 4.4.0-38 without the Meltdown fix. These servers still use XFS. I get ~2% more throughput with the 4.4.0-38 kernel than the 4.4.0-109 kernel (whether or not PTI is enabled).

The loss in sysbench fileio throughput does not reproduce for XFS. The filesystem mount options are "noatime,nodiratime,discard,noauto" and tests were run with /sys/block/nvme0n1/queue/write_cache set to write back and write through. The table below has MB/s of IO throughput.

4.13    4.13    4.8
pti=on  pti=off no-pti
225     229     232     write_cache="write back"
125     168     138     write_cache="write through"

More debugging

This is vmstat output from the sysbench test and the values for wa are over 40 for the 4.13 kernel but less than 10 for the 4.8 kernel. The ratio of cs per IO operation is similar for 4.13 and 4.8.

# vmstat from 4.13 with pti=off

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  4      0 15065620 299600 830564    0    0 64768 43940 7071 21629  1  6 42 51  0
 0  4      0 15065000 300168 830512    0    0 67728 45972 7312 22816  1  3 44 52  0
 2  2      0 15064380 300752 830564    0    0 69856 47516 7584 23657  1  5 43 51  0
 0  2      0 15063884 301288 830524    0    0 64688 43924 7003 21745  0  4 43 52  0

# vmstat from 4.8 with pti=on

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  4      0 14998364 384536 818532    0    0 142080 96484 15538 38791  1  6  9 84  0
 0  4      0 14997868 385132 818248    0    0 144096 97788 15828 39576  1  7 10 83  0
 1  4      0 14997248 385704 818488    0    0 151360 102796 16533 41417  2  9  9 81  0
 0  4      0 14997124 385704 818660    0    0 140240 95140 15301 38219  1  7 11 82  0

Output from Linux perf for 4.8 and for 4.13.

Friday, January 12, 2018

Meltdown vs MySQL part 2: in-memory sysbench and a core i5 NUC

This is my second performance report for the Meltdown patch using in-memory sysbench and a small server. In this test I used a core i5 NUC with the 4.13 and 4.8 kernels. In the previous test I used a core i3 NUC with the 4.4 kernel.
  • results for 4.13 are mixed -- sometimes there is more QPS with the fix enabled, sometimes there is more with the fix disabled. The typical difference is small, about 2%.
  • QPS for 4.8, which doesn't have the Meltdown fix, are usually better than with 4.13, the largest difference is ~10% and the difference tend to be larger at 1 client than at 2 or 8.

Configuration

My usage of sysbench is described here. The servers are described here. For this test I used the core i5 NUC (NUC7i5bnh) with Ubuntu 16.04. I have 3 such servers and ran tests with the fix enabled (kernel 4.13.0-26), the fix disabled via pti=off (kernel 4.13.0-26) and the old kernel (4.8.0-36) that doesn't have the fix. From cat /proc/cpuinfo I see pcid. This server uses the HWE kernels to make wireless work. I repeated tests after learning that 4.13 doesn't support the nobarrier mount option for XFS. My workaround was to switch to ext4 and the results here are from ext4.

The servers have 2 cores and 4 HW threads. I normally use them for low-concurrency benchmarks with 1 or 2 concurrent database clients. For this test I used 1, 2 and 8 concurrent clients to determine whether more concurrency and more mutex contention would cause more of a performance loss.

The sysbench test was configured to use 1 table with 4M rows and InnoDB. The InnoDB buffer pool was large enough to cache the table. The sysbench client runs on the same host as mysqld.

I just noticed that all servers had the doublewrite buffer and binlog disabled. This was leftover from debugging the XFS nobarrier change.

Results

My usage of sysbench is described here which explains the tests that I list below. Each test has QPS for 1, 2 and 8 concurrent clients. Results are provided for
  • pti enabled - kernel 4.13.0-26 with the Meltdown fix enabled
  • pti disabled - kernel 4.13.0-26 with the Meltdown fix disabled via pti=off
  • old kernel, no pti - kernel 4.8.0-36 which doesn't have the Meltdown fix
After each of the QPS sections, there are two lines for QPS ratios. The first line compares the QPS for the kernel with the Meltdown fix enabled vs disabled. The second line compares the QPS for the kernel with the Meltdown fix vs the old kernel. A value less than one means that MySQL gets less QPS with the Meltdown fix.

update-inlist
1       2       8       concurrency
5603    7546    8212    pti enabled
5618    7483    8076    pti disabled
5847    7613    8149    old kernel, no pti
-----   -----   -----
0.997   1.008   1.016   qps ratio: pti on/off
0.958   0.991   1.007   qps ratio: pti on / old kernel

update-one
1       2       8       concurrency
11764   18880   16699   pti enabled
12074   19475   17132   pti disabled
12931   19573   16559   old kernel, no pti
-----   -----   -----
0.974   0.969   0.974   qps ratio: pti on/off
0.909   0.964   1.008   qps ratio: pti on / old kernel

update-index
1       2       8       concurrency
7202    12688   16738   pti enabled
7197    12581   17466   pti disabled
7443    12926   17720   old kernel, no pti
-----   -----   -----
1.000   1.000   0.958   qps ratio: pti on/off
0.967   0.981   0.944   qps ratio: pti on / old kernel

update-nonindex
1       2       8       concurrency
11103   18062   22964   pti enabled
11414   18208   23076   pti disabled
12395   18529   22168   old kernel, no pti
-----   -----   -----
0.972   0.991   0.995   qps ratio: pti on/off
0.895   0.974   1.035   qps ratio: pti on / old kernel

delete
1       2       8       concurrency
19197   30830   43605   pti enabled
19720   31437   44935   pti disabled
21584   32109   43660   old kernel, no pti
-----   -----   -----
0.973   0.980   0.970   qps ratio: pti on/off
0.889   0.960   0.998   qps ratio: pti on / old kernel

read-write range=100
1       2       8       concurrency
11956   20047   29336   pti enabled
12475   20021   29726   pti disabled
13098   19627   30030   old kernel, no pti
-----   -----   -----
0.958   1.001   0.986   qps ratio: pti on/off
0.912   1.021   0.976   qps ratio: pti on / old kernel

read-write range=10000
1       2       8       concurrency
488     815     1080    pti enabled
480     768     1073    pti disabled
504     848     1083    old kernel, no pti
-----   -----   -----
1.016   1.061   1.006   qps ratio: pti on/off
0.968   0.961   0.997   qps ratio: pti on / old kernel

read-only range=100
1       2       8       concurrency
12089   21529   33487   pti enabled
12170   21595   33604   pti disabled
11948   22479   33876   old kernel, no pti
-----   -----   -----
0.993   0.996   0.996   qps ratio: pti on/off
1.011   0.957   0.988   qps ratio: pti on / old kernel

read-only.pre range=10000
1       2       8       concurrency
392     709     876     pti enabled
397     707     872     pti disabled
403     726     877     old kernel, no pti
-----   -----   -----
0.987   1.002   1.004   qps ratio: pti on/off
0.972   0.976   0.998   qps ratio: pti on / old kernel

read-only range=10000
1       2       8       concurrency
394     701     874     pti enabled
389     698     871     pti disabled
402     725     877     old kernel, no pti
-----   -----   -----
1.012   1.004   1.003   qps ratio: pti on/off
0.980   0.966   0.996   qps ratio: pti on / old kernel

point-query.pre
1       2       8       concurrency
18490   31914   56337   pti enabled
19107   32201   58331   pti disabled
18095   32978   55590   old kernel, no pti
-----   -----   -----
0.967   0.991   0.965   qps ratio: pti on/off
1.021   0.967   1.013   qps ratio: pti on / old kernel

point-query
1       2       8       concurrency
18212   31855   56116   pti enabled
18913   32123   58320   pti disabled
17907   32941   55430   old kernel, no pti
-----   -----   -----
0.962   0.991   0.962   qps ratio: pti on/off
1.017   0.967   1.012   qps ratio: pti on / old kernel

random-points.pre
1       2       8       concurrency
3043    5940    8131    pti enabled
2944    5681    7984    pti disabled
3030    6015    8098    old kernel, no pti
-----   -----   -----
1.033   1.045   1.018   qps ratio: pti on/off
1.004   0.987   1.004   qps ratio: pti on / old kernel

random-points
1       2       8       concurrency
3053    5930    8128    pti enabled
2949    5756    7981    pti disabled
3058    6011    8116    old kernel, no pti
-----   -----   -----
1.035   1.030   1.018   qps ratio: pti on/off
0.998   0.986   1.001   qps ratio: pti on / old kernel

hot-points
1       2       8       concurrency
3931    7522    9500    pti enabled
3894    7535    9214    pti disabled
3914    7692    9448    old kernel, no pti
-----   -----   -----
1.009   0.998   1.031   qps ratio: pti on/off
1.004   0.977   1.005   qps ratio: pti on / old kernel

insert
1       2       8       concurrency
12469   21418   25158   pti enabled
12561   21327   25094   pti disabled
13045   21768   21258   old kernel, no pti
-----   -----   -----
0.992   1.004   1.002   qps ratio: pti on/off
0.955   0.983   1.183   qps ratio: pti on / old kernel

XFS, nobarrier and the 4.13 Linux kernel

tl;dr

My day
  • nobarrier isn't supported as a mount option for XFS in kernel 4.13.0-26 with Ubuntu 16.04. I assume this isn't limited to Ubuntu. Read this for more detail on the change.
  • write throughput is much worse on my SSD without nobarrier
  • there is no error on the command line when mounting a device that uses the nobarrier option
  • there is an error message in dmesg output for this

There might be two workarounds:
  • switch from XFS to ext4
  • echo "write through" > /sys/block/$device/queue/write_cache

The Story

I have a NUC cluster at home for performance tests with 3 NUC5i3ryh and 3 NUC7i5bnh. I recently replaced the SSD devices in all of them because previous testing wore them out. I use Ubuntu 16.04 LTS and recently upgraded the kernel on some of them to get the fix for Meltdown.

The NUC7i5bnh server has a Samsung 960 EVO SSD that uses NVMe. I use the HWE kernel to make wireless work. The old kernel without the Meltdown fix is 4.8.0-36 and the kernel with the Meltdown fix is 4.13.0-26. Note that with the old kernel I used XFS with the nobarrier option. With the new kernel I assumed I was still getting nobarrier, but I was not. I have since switched from XFS to ext4.

The NUC5i3ryh server has a Samsung 850 EVO SSD that uses SATA. The old kernel without the Meltdown fix is 4.4.0-38 and the kernel with the Meltdown fix is 4.4.0-109. I continue to use XFS on these.

Results sysbench for NUC5i3ryh show not much regression from the Meltdown fix. Results for the NUC7i5bnh show a lot of regression for the write-heavy tests and not much for the read-heavy tests.
  • I started to debug the odd 7i5bnh results and noticed that write IO throughput was much lower for servers with the Meltdown fix using 4.13.0-26. 
  • Then I used sysbench fileio to run IO tests without MySQL and noticed that read IO was fine, but write IO throughput was much worse with the 4.13.0-26 kernel.
  • Then I consulted my local experts, Domas Mituzas and Jens Axboe.
  • Then I noticed the error message in dmesg output

Meltdown vs MySQL part 1: in-memory sysbench and a core i3 NUC

This is my first performance report for the Meltdown patch using in-memory sysbench and a small server.
  • the worst case overhead was ~5.5%
  • a typical overhead was ~2%
  • QPS was similar between the kernel with the Meltdown fix disabled and the old kernel
  • the overhead with too much concurrency (8 clients) wasn't worse than than the overhead without too much concurrency (1 or 2 clients)

Configuration

My usage of sysbench is described here. The servers are described here. For this test I used the core i3 NUC (NUC5i3ryh) with Ubuntu 16.04. I have 3 such servers and ran tests with the fix enabled (kernel 4.4.0-109), the fix disabled via pti=off (kernel 4.4.0-109) and the old kernel (4.4.0-38) that doesn't have the fix. From cat /proc/cpuinfo I see pcid.

The servers have 2 cores and 4 HW threads. I normally use them for low-concurrency benchmarks with 1 or 2 concurrent database clients. For this test I used 1, 2 and 8 concurrent clients to determine whether more concurrency and more mutex contention would cause more of a performance loss.

The sysbench test was configured to use 1 table with 4M rows and InnoDB. The InnoDB buffer pool was large enough to cache the table. The sysbench client runs on the same host as mysqld.

Results

My usage of sysbench is described here which explains the tests that I list below. Each test has QPS for 1, 2 and 8 concurrent clients. Results are provided for
  • pti enabled - kernel 4.4.0-109 with the Meltdown fix enabled
  • pti disabled - kernel 4.4.0-109 with the Meltdown fix disabled via pti=off
  • old kernel, no pti - kernel 4.4.0-38 which doesn't have the Meltdown fix
After each of the QPS sections, there are two lines for QPS ratios. The first line compares the QPS for the kernel with the Meltdown fix enabled vs disabled. The second line compares the QPS for the kernel with the Meltdown fix vs the old kernel. A value less than one means that MySQL gets less QPS with the Meltdown fix.

update-inlist
1       2       8       concurrency
2039    2238    2388    pti enabled
2049    2449    2369    pti disabled
2059    2199    2397    old kernel, no pti
-----   -----   -----
0.995   0.913   1.008   qps ratio: pti on/off
0.990   1.017   0.996   qps ratio: pti on / old kernel

update-one
1       2       8       concurrency
8086    11407   9498    pti enabled
8234    11683   9748    pti disabled
8215    11708   9755    old kernel, no pti
-----   -----   -----
0.982   0.976   0.974   qps ratio: pti on/off
0.984   0.974   0.973   qps ratio: pti on / old kernel

update-index
1       2       8       concurrency
2944    4528    7330    pti enabled
3022    4664    7504    pti disabled
3020    4784    7555    old kernel, no pti
-----   -----   -----
0.974   0.970   0.976   qps ratio: pti on/off
0.974   0.946   0.970   qps ratio: pti on / old kernel

update-nonindex
1       2       8       concurrency
6310    8688    12600   pti enabled
6103    8482    11900   pti disabled
6374    8723    12142   old kernel, no pti
-----   -----   -----
1.033   1.024   1.058   qps ratio: pti on/off
0.989   0.995   1.037   qps ratio: pti on / old kernel

delete
1       2       8       concurrency
12348   17087   23670   pti enabled
12568   17342   24448   pti disabled
12665   17749   24499   old kernel, no pti
-----   -----   -----
0.982   0.985   0.968   qps ratio: pti on/off
0.974   0.962   0.966   qps ratio: pti on / old kernel

read-write range=100
1       2       8       concurrency
 9999   14973   21618   pti enabled
10177   15239   22088   pti disabled
10209   15249   22153   old kernel, no pti
-----   -----   -----
0.982   0.982   0.978   qps ratio: pti on/off
0.979   0.981   0.975   qps ratio: pti on / old kernel

read-write range=10000
1       2       8       concurrency
430     762     865     pti enabled
438     777     881     pti disabled
439     777     882     old kernel, no pti
-----   -----   -----
0.981   0.980   0.981   qps ratio: pti on/off
0.979   0.980   0.980   qps ratio: pti on / old kernel

read-only range=100
1       2       8       concurrency
10472   19016   26631   pti enabled
10588   20124   27587   pti disabled
11290   20153   27796   old kernel, no pti
-----   -----   -----
0.989   0.944   0.965   qps ratio: pti on/off
0.927   0.943   0.958   qps ratio: pti on / old kernel

read-only.pre range=10000
1       2       8       concurrency
346     622     704     pti enabled
359     640     714     pti disabled
356     631     715     old kernel, no pti
-----   -----   -----
0.963   0.971   0.985   qps ratio: pti on/off
0.971   0.985   0.984   qps ratio: pti on / old kernel

read-only range=10000
1       2       8       concurrency
347     621     703     pti enabled
354     633     716     pti disabled
354     638     716     old kernel, no pti
-----   -----   -----
0.980   0.981   0.988   qps ratio: pti on/off
0.980   0.973   0.981   qps ratio: pti on / old kernel

point-query.pre
1       2       8       concurrency
16104   29540   46863   pti enabled
16716   30052   49404   pti disabled
16605   30392   49872   old kernel, no pti
-----   -----   -----
0.963   0.982   0.948   qps ratio: pti on/off
0.969   0.971   0.939   qps ratio: pti on / old kernel

point-query
1       2       8       concurrency
16240   29359   47141   pti enabled
16640   29785   49015   pti disabled
16369   30226   49530   old kernel, no pti
-----   -----   -----
0.975   0.985   0.961   qps ratio: pti on/off
0.992   0.971   0.951   qps ratio: pti on / old kernel

random-points.pre
1       2       8       concurrency
2756    5202    6211    pti enabled
2764    5216    6245    pti disabled
2679    5130    6188    old kernel, no pti
-----   -----   -----
0.997   0.997   0.994   qps ratio: pti on/off
1.028   1.014   1.003   qps ratio: pti on / old kernel

random-points
1       2       8       concurrency
2763    5177    6191    pti enabled
2768    5188    6238    pti disabled
2701    5076    6182    old kernel, no pti
-----   -----   -----
0.998   0.997   0.992   qps ratio: pti on/off
1.022   1.019   1.001   qps ratio: pti on / old kernel

hot-points
1       2       8       concurrency
3414    6533    7285    pti enabled
3466    6623    7287    pti disabled
3288    6312    6998    old kernel, no pti
-----   -----   -----
0.984   0.986   0.999   qps ratio: pti on/off
1.038   1.035   1.041   qps ratio: pti on / old kernel

insert
1       2       8       concurrency
7612    10051   11943   pti enabled
7713    10150   12322   pti disabled
7834    10243   12514   old kernel, no pti
-----   -----   -----
0.986   0.990   0.969   qps ratio: pti on/off
0.971   0.981   0.954   qps ratio: pti on / old kernel