Posts

Showing posts from 2015

IO-bound linkbench for MongoDB 3.2

Image
I previously shared Linkbench results for MongoDB 3.2.0 with a cached database. Here I provide results for a database larger than cache using SSD and a disk array to compare RocksDB with the WiredTiger B-Tree. The performance summary is: the peak load rate is 2X better with WiredTiger in 3.2 vs 3.0 the load rate for WiredTiger is much better than for RocksDB the load rate for WiredTiger and RocksDB does not get slower with disk vs SSD or with a cached database vs an uncached database. For RocksDB this occurs because secondary index maintenance doesn't require page reads. This might be true for WiredTiger only because the secondary index pages fit in cache. the peak query rates were between 2X and 3X better for RocksDB vs WiredTiger Configuration The previous post explains the benchmark and test hardware. The test was repeated for 1, 4, 8, 16 and 24 concurrent clients for the disk array test and 1, 4, 8, 12, 16, 20 and 24 concurrent clients for the SSD test. Load pe

Read, write & space amplification - B-Tree vs LSM

This post compares a B-Tree and LSM for  read, write and space amplification . The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload . For the LSM I consider  leveled compaction rather than size-tiered compaction. For the B-Tree I consider a clustered index like InnoDB. The comparison in practice provides values for read, write and space amplification on real workloads. The comparison in theory attempts to explain those values. B-Tree vs LSM in theory Read Amplification Most comparisons should be done for a specific context including the hardware and workload. For now I am only specific about the cache hit rate. For the B-Tree I assume that all non-leaf levels are in cache. For the LSM I assume that everything but the data blocks of the largest LSM level are in cache. While an LSM with leveled compaction has more things to keep in the cache (bloom filters) it also benef

Read, write & space amplification - pick 2

Good things  come in threes , then reality bites and you must choose at most two. This choice is well known in distributed systems with  CAP ,  PACELC  and  FIT . There is a similar choice for database engines. An algorithm can optimize for at most two from  read ,  write  and space amplification. These are metrics for efficiency and performance. This means one algorithm is unlikely to be better than another at all three. For example a  B-Tree  has less read amplification than an  LSM  while an LSM has less write amplification than a B-Tree. I abbreviate the metrics as read-amp, write-amp and space-amp. I also abbreviate this as the framework . The framework assumes a database workload that consists of point-queries, range-queries of length N and writes. Were I to add a delete operation then this would match the RocksDB and LevelDB API . The write is a blind-write as it doesn't imply a read prior to the write. This is part one of a topic that requires several blog posts. The s

How does MongoDB do on Linkbench with concurrency?

Image
I started to study performance math with a focus on queueing theory and the USL . To get data for the USL I ran Linkbench for MongoDB with different levels of concurrency. I tested WiredTiger in MongoDB 3.2.0rc0 and 3.0.7 and then RocksDB in MongoDB 3.2.0. The performance summary is: WiredTiger load scales much better in 3.2 compared to 3.0 RocksDB throughput is stable at high concurrency WiredTiger 3.2 throughput is stable at high concurrency. For WiredTiger 3.0 the load rate drops significantly at high concurrency and the query rate also has an odd drop. Configuration The test server has 2 sockets with 6 CPU cores/socket and hyperthreading enabled to get 24 HW threads. The server also has 6 SAS disks with HW RAID 0 and 1 SSD. I intended to use the disk array for all tests but ended up using the SSD for the WiredTiger 3.0.7 test. The server also has 144G of RAM and the test database was cached by mongod for all tests. The oplog was enabled but sync-on-commit was not done. F

Define better for a small-data DBMS

There are many dimensions by which a DBMS can be better for small data workloads: performance, efficiency, manageability, usability and availability. By small data I mean OLTP. Performance gets too much attention from both industry and academia while the other dimensions are at least as important in the success of a product. Note that this discussion is about which DBMS is likely to get the majority of new workloads. The decision to migrate a solved problem is much more complex. Performance - makes marketing happy Efficiency - makes management happy Manageability - makes operations happy Usability - makes databased-backed application developers happy Availability - makes users happy Performance  makes marketing happy because they can publish a whitepaper to show their product is faster than the competition and hope that the message morphs from  X is faster than Y in this context  into  X is faster than Y . This is the  art of benchmarketing . It can be risky to use avera

MongoDB 3.2 vs Linkbench

I used LinkbenchX to compare performance and efficiency for MongoDB 3.2.0rc0 vs 3.0.7 with the RocksDB and WiredTiger engines. The Linkbench test has two phases: load and query. The test was run in three configurations: cached with data on disk, too big to cache with data on disk and too big to cache with data on SSD. My summary: Performance: load rates are similar for disk and SSD with RocksDB and WiredTiger load rate for WiredTiger is ~2X better in 3.2 versus 3.0 load rate for WiredTiger is more than 2X better than RocksDB query rate for WiredTiger is ~1.3X better than RocksDB for cached database query rate for RocksDB is ~1.5X better than WiredTiger for not cached database Efficiency: disk space used is ~1.33X higher for WiredTiger vs RocksDB disk bytes written per document during the load is ~5X higher for RocksDB disk bytes written per query is ~3.5X higher for WiredTiger RocksDB uses ~1.8X more CPU during the load WiredTiger uses ~1.4X more CPU during the query

Losing it?

Many years ago the MySQL team at Google implemented semi-sync replication courtesy of Wei Li. The use case for it was limited and the community was disappointed that it did not provide the semantics they wanted. Eventually group commit was implemented for binlog+InnoDB: first by the MySQL team at Facebook, then much better by MariaDB, and finally in upstream MySQL. With group commit some magic was added to give us lossless semisync and now we have automated, lossless and fast failover in MySQL without using extra replicas. This feature is a big deal. I look forward to solutions for MariaDB (via binlog server and MaxScale ) and MySQL (via Fabric and Proxy). MongoDB is ahead of MySQL in features that make scale-out easier to manage and the next release (3.2)  adds a few more features and more robust code for existing features. I hope that some of the features that have been sacrificed in the name of scale-out will eventually arrive in a MongoDB release: per-shard transactions, per-s

Problems not worth fixing

I worked on a closed-source DBMS years ago and the development cycle was 1) code for 6 months 2) debug for 18 months. Part 2 was longer than part 1 because the customers have high expectations for quality and the product delivers on that. But it was also longer because some co-workers might not have been code complete after 6 months and were quietly extending part 1 into part 2. I fixed many bugs during part 2. One that I remember was in the optimizer. Someone added code to detect and remove duplicate expressions ( A and A and B  --> A and B ). Alas the algorithm was O(N*N). Analytics queries can have complex WHERE clauses and all queries were forced to pay the O(N*N) cost whether or not they had any duplicate expressions.  The overhead for this feature was reasonable after I fixed the code but it raises an interesting question. Are some problems not worth the cost of prevention? Several times a year I see feature requests that make me think of this. Forgot to add the bug th

MyRocks versus allocators: glibc, tcmalloc, jemalloc

I used a host with Fedora 19 and Linkbench to determine the ability of memory allocators to avoid fragmentation with MyRocks (MySQL+RocksDB). RocksDB can create pressure on an allocator because allocations for the block cache can have a short lifetime. Memory is allocated on page read and released when a page is evicted from the cache. The allocation is the size of the decompressed page and sometimes the size of the compressed page. RocksDB puts more pressure on the allocator than InnoDB because with InnoDB the buffer pool allocation is done once at process start. After the Linkbench load finished I ran the query test for 24 hours and report the values of VSZ and RSS from ps at the end. From the results below jemalloc & tcmalloc do much better at avoiding fragmentation (value of RSS) compared to glibc 2.1.7. I am still adjusting to larger values of VSZ with jemalloc. It isn't a problem, but it takes time to accept. VSZ       RSS       allocator 22246784  10062100  jemalloc

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems. So I asked core developers from Btrfs  whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs  might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other. XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous

Third day with InnoDB transparent page compression

My first two days with InnoDB transparent page compression didn't turn out well. Transparent page compression can make InnoDB source code simpler and InnoDB more performant on insert heavy workloads. Unfortunately the versions of XFS that I use are not happy after doing a hole-punch on write. The performance summary is that with transparent compression: Database load is slightly faster Transaction processing is slightly slower DROP TABLE is 43X slower MySQL 5.6 vs 5.7 I used a host with 24 HW threads, 144G of RAM and a 400G Intel s3700 SSD. The server uses Fedora 19, XFS and Linux kernel 3.14.27-100.fc19. The benchmark application is linkbench  and was run with maxid1=100M, loaders=10 and requesters=20 (10 clients for the load, 20 for queries). I compared the Facebook patch for MySQL 5.6 with upstream MySQL 5.7. For 5.6 I used old-style compression for linktable and counttable. For 5.7 I used transparent compression for all tables. I also used 32 partitions for 5.6 and

Linkbench for MySQL 5.7.8 with an IO-bound database

Image
I wanted to try InnoDB transparent page compression that is new in the MySQL 5.7.8 RC . That didn't work out , so I limited my tests to old-style compression. I compared MyRocks with InnoDB from the Facebook patch for 5.6, upstream 5.6.26 and upstream 5.7.8. My performance summary is: MyRocks loads data faster than InnoDB. This isn't a new result. Non-unique secondary index maintenance doesn't require a read before the write (unlike a B-Tree). This is also helped by less random IO on writes and better compression. MyRocks compression is much better than compressed InnoDB. After 24 hours it used between 56% and 64% of the space compared to the compressed InnoDB configurations. MyRocks QPS degrades over time. This will be fixed real soon. Partitioning improves InnoDB load performance in MySQL 5.6 for compressed and non-compressed tables. This reduces stalls from the per-index mutex used by InnoDB when inserts cause or might cause a page split (pessimistic code path) be