Saturday, February 14, 2015

Storage efficiency

Storage efficiency is a big deal. I have learned a lot about read, write and space amplification over the past few years. Tiered storage has been a part of my work life via flashcache, but tiered storage for an LSM doesn't require flashcache. This is going to get interesting. We have more choices for SSD at a variety of price points based on write endurance and performance.  We have write-optimized database engines (RocksDB, Tokutek, WiredTiger) arriving for OLTP workloads. We can use this with commodity hardware and open-source DBMS solutions like MySQL and MongoDB. A lot of interesting work remains to put the pieces together while matching the quality of service we get from MySQL+InnoDB and I hope to use RocksDB as part of MySQL and MongoDB.

We are beginning to engage with the community. Yoshinori Matsunobu has a talk at the MySQL UC. I have a talk at the CloudDM workshop at ICDE. I expect more talks, hopefully a few conference papers and if you are local then there are RocksDB meetups.

One benefit from RocksDB compared to a b-tree is much better compression and much less write-amplification. I have observed this via linkbench and real workloads. There are many other ways that an LSM can reduce the IO demand from an application. But I don't want to give away too many details until Yoshi and I have done our talks.

Tuesday, February 3, 2015

I don't get marketing

MongoDB 3.0 is almost here. It includes the WiredTiger storage engine which is very impressive. Alas, now we are told that the original storage engine has performance problems. I am shocked. Who knew? Not anyone reading their documentation. I don't understand why companies play this game. Be honest with your users as this goes a long way to building a solid and educated community. I hold MySQL to the same standard. Lets us not forget that awesome manual section on atomic operations.

Don't let the marketing team have editorial control over the documentation team.

Edit: an era has ended, the section on atomic operations has been removed from the MySQL manual.

Tuesday, December 30, 2014

malloc and MongoDB performance

I used iibench to understand the impact of malloc on MongoDB performance to compare malloc from glibc 2.17, jemalloc 3.6.0 and tcmalloc/gperftools 2.2 for an insert-only workload with 10 concurrent clients. The test server has 40 cores with hyperthread enabled. Alas, I lost a few configuration details, but think I used MongoDB 2.8.0rc4 and the WiredTiger b-tree.

The summary is that tcmalloc and jemalloc provide a similar benefit and are much better than glibc malloc. I made no attempt to tune tcmalloc or jemalloc. I ran iibench 3 times for each configuration and chose the median result. There are four metrics for each configuration:

  1. Test duration in seconds
  2. Address space size per the VSZ column in ps
  3. Normalized context switch rate - the number of context switches per N documents inserted. I will leave N as undefined for now so the absolute value isn't interesting. The value can be compared between malloc implementations.
  4. Normalized CPU utilization - CPU utilization per N documents inserted. See #3.

Results

          seconds  VSZ(GB)   context-switch(relative)   CPU(relative)
tcmalloc   1943     65.1            6.6                   26510
jemalloc   1877     64.5            7.1                   27165
glibc      2251     72.3           23.0                   32120

For this test glibc uses 1.19X more time, 1.12X more address space, 3.23X more context switches and 1.18X more CPU than jemalloc and tcmalloc is similar to jemalloc. As tcmalloc is bundled with the source distribution for MongoDB I assume it is used in the binaries they distribute. This appears to be a good thing.

Friday, December 26, 2014

Storage overhead for attribute names in MongoDB

The flexibility of dynamic schema support in MongoDB comes with a few costs. One of the costs is that extra space is required to store the database. The problem is that attribute names are repeated in every document. If sizeof(attribute names) is significant relative to sizeof(attribute values) then a lot more space will be used for MongoDB with the mmapv1 engine compared to a DBMS that doesn't support dynamic schemas. Note that dynamic schemas are valuable because they support less schema, not no schema. Indexes, required attributes and assumptions about attribute values are all examples where some schema is used.

How much extra space is used for attribute names? Does page level compression in the WiredTiger and TokuMX engines make this a non-issue? Repeated values in every document from long attribute names seem like something that is easy to compress. And the alternative of using extra short attribute names will cause pain for anyone trying to use the database so page level compression might be the preferred solution.

While page level compression can remove the bloat from long attribute names for compressed copies of pages it doesn't solve the problem for uncompressed copies of pages. So there is still a cost from dynamic schemas. Perhaps one day we will get an engine that encodes long attribute names efficiently even for uncompressed pages. The impact from this is that fewer uncompressed pages can fit in cache because of the space overhead. Note that when page level compression is used there are some database pages that are in cache in both compressed and uncompressed forms. I assume that the WiredTiger and TokuMX block caches only cache uncompressed pages, but I am not an expert in either engine, and the OS filesystem cache has copies of compressed pages. I am not sure what happens when direct IO is used with WiredTiger because that prevents use of the OS filesystem cache.

Results

I used iibench for MongoDB and loaded 2B documents for a few configurations: mmapv1 and WiredTiger without compression (wt-none), with snappy compression (wt-snappy) and with zlib compression (wt-zlib). To keep the documents small I edited the iibench test to use a 1-byte character field per document and disabled creation of secondary indexes. I used two versions of iibench. The first used the attribute names as-is (long) and the second used shorter versions (short) for a few of the attribute names:
  • long attribute names: price, customerid, cashregisterid, dateandtime
  • short attribute names: price, cuid, crid, ts
There are two graphs below. The first graph has the ratio of the database size with long attributes versus the size with short attributes. A larger ratio means there is more overhead from long attributes and the mmapv1 engine has the worst ratio (1.53). The ratio for WiredTiger with the zlib engine is close to one (1.01). WiredTiger with snappy wasn't as good as zlib and I think that part of the reason is that a larger page size is used by WiredTiger when zlib compression is enabled. While that improves the compression ratio it can cause other performance problems but I am waiting for documentation to catch up to the code to understand this. One problem from a larger page size is that more memory is wasted in the block cache when a page is read to get one hot document. Another problem from a larger page size is that flash devices are usually much faster at 4kb reads than at 64kb reads, while there isn't much difference with disk for 4kb versus 64kb reads. For an IO-bound workload, there is also more CPU overhead from decompressing a 64kb page versus a 4kb page.
This shows the database size in GB for each of the engine configurations. Note that WiredTiger with zlib uses about 1/8th the space compared to mmapv1 and even the uncompressed WiredTiger engine does a lot better than mmapv1. I suspect that most of the benefit for wt-none versus mmapv1 is the overhead from using power of 2 allocations in mmapv1. As a side note, I am not sure we will be able to turn off power of 2 allocation for mmapv1 in future releases.


Tuesday, December 23, 2014

Read-modify-write optimized

A log structured merge tree (LSM) has optimizations that reduce the amount of reads from and writes to storage for a write-heavy workload. These optimizations aren't unique to an LSM as the fractal tree algorithm used by TokuMX and TokuDB has similar benefits.

There are different kinds of writes so I will explain this workload. I used iibench for MongoDB with 3 secondary indexes. This loads a collection in primary key order. Random values are used for attributes with secondary indexes. I repeated the test for different database cache sizes to show where an LSM is faster than a b-tree. While I wouldn't call the WiredTiger btree a write-optimized database algorithm, it is much more IO efficient than the mmapv1 engine (uses less space on disk, reads less from and writes less to storage for the same workload).

Results

This displays the document insert rate from iibench with 10 loader threads inserting 200M documents each. The collection goes from 0 to 2B documents during the benchmark. The test was repeated for the B-tree and LSM file structures in WiredTiger and the mmapv1 engine. The test used 3 cache sizes - 144G, 32G and 8G. Buffered IO was used for 144G and direct IO was used for 32G and 8G. The server has fast flash storage, 144G RAM and 40 hyperthread cores. The test used MongoDB 2.8.0rc3. At test end the database was 310G for the b-tree, 290G for the LSM and 815G for mmapv1. The WiredTiger tests used snappy compression. All tests used the SAFE write concern with 1000 documents inserted per request.

The insert rate for the LSM is insensitive to cache size which is expected for this workload. The insert rate for the b-tree is sensitive to cache size and goes from being much better than the LSM when the secondary index data stays in cache to much worse than the LSM when it does not. Results for mmap1 explain why it is great that MongoDB 2.8 includes WiredTiger.
I prefer benchmark results that include enough detail to understand the results. The results below are from a test with 2.8.0rc2 and are metrics from vmstat and iostat normalized per 200k inserted documents. As each insert request was for 1000 documents then the results are also per 200 requests. They are best used to compare between the WiredTiger btree (wt btree), WiredTiger LSM (wt lsm) and mmapv1 engine. The columns are:
  • vm.cs - context switches per 200 requests. Note the spike for the WiredTiger btree as the cache size is decreased. 
  • vm.us+sy - user & system CPU time (sum of the us and sy columns from vmstat multiplied by 1M). The relative value is useful. Again there is a spike for the WiredTiger btree and none for the LSM but the LSM uses much more CPU when the cache isn't small.
  • io.kBw - KB written per iostat per 200 requests. The WiredTiger LSM is always more efficient than the btree and the rate doesn't grow as the cache shrinks. The IO rate grows a lot for the WiredTiger btree as the cache shrink. All of that is expected. The WiredTiger btree has a much lower rate than mmapv1 for the same cache size.
  • io.kBr - KB read per iostat per 200 requests. The WiredTiger btree rate grows fast as the cache shrinks while the LSM is rate does not grow as the cache is reduced from 32G to 8G. The rate for mmapv1 at 144G is much worse than for WiredTiger.
  • io.r - read requests per iostat per 200 requests. The results match io.kBr.

per 200k inserts
vm.cs  vm.us+sy  io.kBw   io.kBr    io.r     engine, cache
  29     42445     417      0.49     0.02    wt btree, 144G
 306     67025     791    179.0     17.66    wt btree, 32G
1744    140560    2524   1889.0    215.0     wt btree, 8G

  76    152235     294      0.31     0.02    wt lsm, 144G
 134    148545     359     65.0      4.02    wt lsm, 32G
 137    146215     355     65.0      4.11    wt lsm, 8G 

 350     79500    2414      3.17     0.16    mmapv1, 144G 

B-tree

For a b-tree the IO pattern for the primary key index is efficient -- it is append only to grow the index to the right. If the user provides the value for the PK then the database must confirm that value is unique by reading from the database. That will not require extra reads from storage here because the database pages with that value should remain in cache.

The IO patterns for the secondary indexes are not efficient as inserted docs use random values for the secondary index attributes. Secondary index maintenance for a b-tree is a read-modify-write operation.  The page that has or will have the key must be fetched before it is written. For WiredTiger, and most other popular b-tree implementations, this is a random storage read eventually followed by a random storage write. When compression is used then a page must be decompressed as part of the read and compressed as part of the write.

InnoDB is able to avoid some of the random storage reads from secondary index maintenance via the change buffer. That avoids reads when the change buffer gets multiple changes to a secondary index page before doing the read and it can be very useful in production and on benchmarks.

A b-tree backed by a redo log defers writing back dirty pages until checkpoint or buffer pool pressure forces some pages to be written back. This reduces the IO overhead when one write to storage includes multiple changes to the page. This is less likely as the database:cache ratio increases. This occurs more frequently for workloads with skew (some keys are written much more frequently).

The size of the secondary indexes relative to cache are critical. While the database might be huge relative to cache (100X larger for example) if the secondary index fits in cache then all storage reads and many storage writes for the secondary index pages will be avoided and that makes a huge difference on the amount of IO for this workload.

For now I will ignore a b-tree that uses log-structured writes. That might reduce the cost for writing back pages but it will require storage reads and writes for garbage collection to copy-out live pages from previously written extents. It will also suffer from write-amplification for both the copy-out done during GC and from writing back a dirty page when only a fraction of the page has changed.

LSM

The popular compaction algorithms for an LSM are size-tiered and leveled. The WiredTiger LSM uses size-tiered. Without looking at the WiredTiger source, blind writes should be sufficient for secondary index maintenance because there are no unique constraints. A storage read is not required before the write.

A read before the write might be required for primary index maintenance to confirm that the PK for the inserted doc is unique. However the page with that data should almost always be in cache. An LSM can also use a bloom filter for workloads that don't load in PK order. It might be hard to define a bloom filter that is great for multiple indexes, in which case a column family is needed per index. I am not sure if WiredTiger used. I did not find references to bloom in the WiredTiger engine for MongoDB.

An LSM will do writes to storage during compaction and memtable flushes. It can be difficult to compare the overhead from these operations as they use larger IO requests (many pages at a time) compared to a b-tree. A big difference between an LSM and a b-tree is that most of the requests from the LSM are larger (many pages at a time) and amortized over many user operations. A busy LSM still does random IO:

  • multi-threaded compaction means that each thread has one stream of writes and one or more stream of reads when the database is larger than cache. The request size for these is large (many pages at a time)
  • another stream of writes is in progress for the redo log (WAL, write ahead log)
  • user operations can require single-page reads from storage

Thursday, November 20, 2014

Aurora for MySQL is coming

I am excited about Aurora for MySQL. While there aren't many details, I have two conclusions from the information that is available. First, many talented people did great work on this. Second, many customers want the features it provides and some of these features are otherwise unavailable unless you are web-scale and can afford a team of automation experts. This is a big deal and good for the MySQL community. I am not just writing this to boost my priority on the Aurora preview signup list.

I don't think it matters whether Aurora has better performance. The big story is much better availability and manageability without having to hire an expert MySQL operations team. The limiting factor might be cost but that depends on whether you come from the land where everything is free or from a commercial DBMS. And even in the land of free, the cost for an operations team is large.

Soapbox

Before describing what I learned about it I have a short editorial.
  • Is Amazon the reason we need AGPL? It isn't clear to me that they improve upstream MySQL. They do benefit the community by making it easier to run MySQL in the cloud. But I don't see them in the community. Which means I also don't see their developers in the community and who wants to disappear while working on mostly open-source?
  • Their marketing leads with up to 5X faster than MySQL which is translated by the tech press into 5X faster than MySQL. When I signed up for an Aurora preview the response email from Amazon also used the 5X faster than MySQL claim. I prefer up to 5X faster than MySQL.
  • In the video from AWS re:Invent 2014 James Hamilton makes some interesting claims.
    • I am not sure he heard about InnoDB based on this statement -- Just doing a state of the art storage engine would have been worth doing. Take Jim Gray's black book on transactions, implement it, I would be happy. Someone can tell him to be happy.
    • In big print -- 3X write performance, 5X read performance. In small print -- sysbench. It will take time to determine whether similar performance & availability can be had elsewhere at a similar cost.
    • In big print -- 400X less lag, 2 seconds vs 5 milliseconds. I doubt this. The slide also stated -- synchronous multi-DC replication. I don't see how that is possible except within a datacenter given network latency. Other content from Amazon claims this is async replication. But then Mr. Hamilton stated that everything is there, transactions are not lost were two data centers to instantly disappear. Again, that requires sync replication. This is confusing and I am surprised his claims don't match the other documentation from Amazon.
    • It can fix torn pages. InnoDB does this for writes in progress during a crash. I assume Aurora can also do this for pages written a long time ago and that would be a big deal.

Information

What is Aurora? I don't know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB. I encourage people to read the details page, FAQ, slides and pricing guide. The response from Clustrix is also useful. At least one answer on Quora is excellent and this overview of Amazon datacenters will convince you that they network needed to make this work with low latency.

The big deal is that the database is replicated 6X using 3 availability zones (AZs). I assume this means it uses 2 storage clusters per AZ. Documentation from Amazon states this is done using async replication and (committed) writes are available within a few milliseconds or 10s of milliseconds. I assume this is only for Aurora read replicas in the same AZ as the master. Network latency will make that impossible for some remote AZs.  In the presentation at AWS re:Invent there are claims that the replication is synchronous. That is confusing.

For now I will assume that Aurora does async replication to the two remote AZs and the non-primary storage cluster in the primary AZ -- which means that 5 of the 6 copies are updated via async replication. If this is true, then an outage at the primary storage cluster can mean that recent commits are lost. It would be great if someone were to make this clear. My preferred solution would be sync replication within the primary AZ to maintain copies in 2 storage clusters, and then async replication to the 4 copies in the 2 remote AZs. We will soon have multiple solutions in the MySQL community that can do sync replication within a datacenter and async replication elsewhere -- Galera, upstream and lossless semisync. But Aurora is much easier to deploy than the alternatives. My standard question is what commit rate can be sustained when all updates are to the same row? Sync replication with cross-country network round trips makes that slow.

The presentation also claimed this was mostly drop-in compatible. I am interested in compatibility with InnoDB. 
  • What are the semantics for cursor isolation? PostgreSQL, InnoDB and Oracle all do snapshot isolation with different semantics for writes. PostgreSQL has excellent documentation to describe the behavior.
  • Does Aurora support clustered indexes? 
  • What is the max size of an index key?
  • How are large columns supported? Is data always inline?
  • How is multi-versioning implemented? InnoDB usually does updates in place so there isn't much work for purge to do except for deletes and updates to secondary index columns. 
  • Does this use pessimistic or optimistic concurrency control?
  • Does this support partitioning?
  • What is the block size for reads?
  • Does this use compression?

Features

The brief description of features is very interesting. I will summarize that here. The features sound great and I expect them to get a lot of customers.
  • Unplanned failover to an Aurora read replica takes a few minutes. Unplanned failover to a new instance can take up to 15 minutes. This happens for customers who aren't spending money on Aurora read replicas. This is another feature that will make Aurora very popular. While it will be nice to make failover faster, the big deal is that they provide this. The usual MySQL deployment required some do-it-yourself effort to get something similar.
  • Storage uses SSD. I assume this is based on EBS. They do background scrubbing to detect and correct corrupt pages. Storage grows in 10G increments. You don't have to provision for a fixed amount of GB or TB, they will grow as needed up to 64T (or 64T/table). I am not sure I would want 64T in one instance, but it can make it easy to archive data in place. Also automatic growth to such large database sizes will make it much easier for deployments to avoid sharding especially when an instance has many TB of cold data.
  • There are interesting features for point-in-time recovery, incremental backup and snapshots. This is integrated with S3. Incremental backups make it affordable to archive data in place as you don't do full backups for data that doesn't change. But I don't understand all of their backup options. 
  • Database is replicated 2 times within 3 AZs so there are 6 copies. Up to 2 copies can be lost and writes are still possible. Up to 3 copies can be lost and reads are still possible. I assume that by copies can be lost they mean storage clusters can be lost. Automatic recovery here is another big deal.
  • The buffer pool survives mysqld process restart. I wonder if that is only true for planned restart. Regardless, this is a very useful feature for IO-bound workloads when it is important to have a warm cache. Percona used to have a patch for this with InnoDB.
  • Replication is supported in two ways -- via Aurora (at storage level) and MySQL (binlog). My bet is that Aurora replication will be most popular but some people will use MySQL replication to replicas with locally attached storage to save money. They claim much less lag with Aurora and I agree there will be less but I am not sure there will be 400X less. However, they avoid some overhead on commit by not using the binlog and they avoid a lot of complexity by not requiring MySQL replication on the master or slave. I assume that Aurora replication ships deltas rather than page images to be efficient on the network.

Cost

Cost comparisons will be interesting. I am very uncertain about my estimates here as I don't have much experience with prices on AWS or for normal sized datacenter deployments limited to a few servers. I estimate a 3 year cost of $568,814 for a high-end Aurora deployment: largest servers, 1 master, 1 replica, backup and storage. It will be interesting to compare this to non-AWS hardware because you need to account for features that Aurora provides and for the extra availability of the Aurora storage solution. I used 3TB and 10k IOPs from storage because that can be provided easily by a high-end PCIe flash device, but that also assumes a very busy server.
  • Reserved db.r3.8xlarge (32 vCPUs, 244GiB RAM) for 2 servers is $115,682 over 3 years
  • 50TB of backup at 10 cents/GB/month is $53,100 over 3 years
  • 3TB of storage and 10,000 IOPs per server is $400,032 for 2 servers over 3 years. But I am not sure if the storage cost includes IO done to all AZs to maintain the 6 database copies. In that case, the storage cost might be higher.

Wednesday, October 15, 2014

Updates with secondary index maintenance: 5.7 vs previous releases

My previous results for an update-only workload to a cached and IO-bound database used a table that did not require secondary index maintenance from the update. Prior to that I ran tests using mysqlslap and might have found a regression in InnoDB when the update requires secondary index maintenance (bug 74235). The mysqlslap test did all updates to the same row. The test I describe here chooses the row to update at random and this workload does not reproduce the worst case from bug 74235.

The results are that newer releases tend to do worse at 1 thread, 5.7.5 does best at 32 threads and 5.7.5 does slightly worse than 5.6.21 at less than 32 threads.  We need to fix the regressions in 5.7.5 at less-than-high concurrency workloads.

setup

The test here was the same as described in my previous posts with one exception -- the test tables have a secondary index on (k,c) and note that the column c is set to 0 at test start. The updates are of the form: UPDATE $table SET c = c + 1 WHERE id = $X. The test tables look like:
CREATE TABLE sbtest1 (
  id int(10) unsigned NOT NULL auto_increment,
  k int(10) unsigned NOT NULL default '0',
  c char(120) NOT NULL default '',
  pad char(60) NOT NULL default '',
  PRIMARY KEY  (id),
  KEY k (k,c)
) ENGINE=InnoDB AUTO_INCREMENT=8000001 DEFAULT CHARSET=latin1

There were 8 tables in the database with 8M rows per table. Tests were run for 1 thread with 1 table and 8, 16, 32 threads with 8 tables. The test at each concurrency level was run for 10 minutes. Tests were run for a 1G (IO-bound) and 32G (cached) InnoDB buffer pool. For the cached test the tables were read into the buffer pool at test start. The updates/second rate from these tests had more variance than previous tests. I ran tests many times and report rates here that were typical. For the 1-thread tests that use 1 table, the database (1 table) is 2G. For the 8, 16 and 32 thread tests that use 8 tables the database (8 tables) is 16G.

results at 1 thread

The rates for a cached database and 1 thread match most previous results -- newer releases are much slower than older releases and MySQL 5.7.5 is about 1.5X worse than MySQL 5.0.85. The results for the IO-bound setup don't follow that trend. Newer releases do better. I regret that I did not have the time to explain the difference between 5.6.21 and 5.7.5.


results at many threads

MySQL 5.7.5 and 5.6.21 are much better than older releases. 5.6.21 does better than 5.7.5 at 8 and 16 threads while 5.7.5 is the best at 32 threads.



configuration

The insert benchmark on a small server, IO-bound workload : Postgres 19 beta1

This has results for Postgres versions 19 beta1, 18.4 and 17.10 with the Insert Benchmark on a small server using a cached and CPU-bound wo...