Posts

Showing posts from November, 2015

Read, write & space amplification - B-Tree vs LSM

This post compares a B-Tree and LSM for  read, write and space amplification . The comparison is done in theory and practice so expect some handwaving mixed with data from iostat and vmstat collected while running the Linkbench workload . For the LSM I consider  leveled compaction rather than size-tiered compaction. For the B-Tree I consider a clustered index like InnoDB. The comparison in practice provides values for read, write and space amplification on real workloads. The comparison in theory attempts to explain those values. B-Tree vs LSM in theory Read Amplification Most comparisons should be done for a specific context including the hardware and workload. For now I am only specific about the cache hit rate. For the B-Tree I assume that all non-leaf levels are in cache. For the LSM I assume that everything but the data blocks of the largest LSM level are in cache. While an LSM with leveled compaction has more things to keep in the cache (bloom filters) it also benef

Read, write & space amplification - pick 2

Good things  come in threes , then reality bites and you must choose at most two. This choice is well known in distributed systems with  CAP ,  PACELC  and  FIT . There is a similar choice for database engines. An algorithm can optimize for at most two from  read ,  write  and space amplification. These are metrics for efficiency and performance. This means one algorithm is unlikely to be better than another at all three. For example a  B-Tree  has less read amplification than an  LSM  while an LSM has less write amplification than a B-Tree. I abbreviate the metrics as read-amp, write-amp and space-amp. I also abbreviate this as the framework . The framework assumes a database workload that consists of point-queries, range-queries of length N and writes. Were I to add a delete operation then this would match the RocksDB and LevelDB API . The write is a blind-write as it doesn't imply a read prior to the write. This is part one of a topic that requires several blog posts. The s

How does MongoDB do on Linkbench with concurrency?

Image
I started to study performance math with a focus on queueing theory and the USL . To get data for the USL I ran Linkbench for MongoDB with different levels of concurrency. I tested WiredTiger in MongoDB 3.2.0rc0 and 3.0.7 and then RocksDB in MongoDB 3.2.0. The performance summary is: WiredTiger load scales much better in 3.2 compared to 3.0 RocksDB throughput is stable at high concurrency WiredTiger 3.2 throughput is stable at high concurrency. For WiredTiger 3.0 the load rate drops significantly at high concurrency and the query rate also has an odd drop. Configuration The test server has 2 sockets with 6 CPU cores/socket and hyperthreading enabled to get 24 HW threads. The server also has 6 SAS disks with HW RAID 0 and 1 SSD. I intended to use the disk array for all tests but ended up using the SSD for the WiredTiger 3.0.7 test. The server also has 144G of RAM and the test database was cached by mongod for all tests. The oplog was enabled but sync-on-commit was not done. F

Define better for a small-data DBMS

There are many dimensions by which a DBMS can be better for small data workloads: performance, efficiency, manageability, usability and availability. By small data I mean OLTP. Performance gets too much attention from both industry and academia while the other dimensions are at least as important in the success of a product. Note that this discussion is about which DBMS is likely to get the majority of new workloads. The decision to migrate a solved problem is much more complex. Performance - makes marketing happy Efficiency - makes management happy Manageability - makes operations happy Usability - makes databased-backed application developers happy Availability - makes users happy Performance  makes marketing happy because they can publish a whitepaper to show their product is faster than the competition and hope that the message morphs from  X is faster than Y in this context  into  X is faster than Y . This is the  art of benchmarketing . It can be risky to use avera

MongoDB 3.2 vs Linkbench

I used LinkbenchX to compare performance and efficiency for MongoDB 3.2.0rc0 vs 3.0.7 with the RocksDB and WiredTiger engines. The Linkbench test has two phases: load and query. The test was run in three configurations: cached with data on disk, too big to cache with data on disk and too big to cache with data on SSD. My summary: Performance: load rates are similar for disk and SSD with RocksDB and WiredTiger load rate for WiredTiger is ~2X better in 3.2 versus 3.0 load rate for WiredTiger is more than 2X better than RocksDB query rate for WiredTiger is ~1.3X better than RocksDB for cached database query rate for RocksDB is ~1.5X better than WiredTiger for not cached database Efficiency: disk space used is ~1.33X higher for WiredTiger vs RocksDB disk bytes written per document during the load is ~5X higher for RocksDB disk bytes written per query is ~3.5X higher for WiredTiger RocksDB uses ~1.8X more CPU during the load WiredTiger uses ~1.4X more CPU during the query