Friday, January 31, 2020

Durability vs Availability

I don't consider myself an expert on distributed systems but sometimes I try to write on the topic. I hope any mistakes here aren't too serious. I like to describe things so here is my attempt.


Durability is about the probability of losing data whether that data was written to the DBMS yesterday or recently. I describe this as a probability rather than systems that can or cannot lose data. I have spent too much time with web-scale deployments to claim that transactions will never be lost.

Backups prevent loss of data added yesterday or last year. Making recent changes durable is harder and comes at a cost in complexity and commit latency. The solution is to get commit log entries onto multiple servers before something is considered committed. There are fascinating and complicated ways to do this. Fortunately most of us can make use of Raft and Paxos implementations written by experts.

While my team at Google was the first to implement semisync replication, someone else gets credit for the idea to make it lossless. And that turned out to be a great idea. Another interesting idea is the way that MongoDB has separated the read and write paths to provide strongly consistent reads and highly durable writes without implementing synchronous replication on the commit path (Raft is used elsewhere). This is clever and flexible. Users get the semantics they need while only paying for the stronger semantics when desired.


Availability is about avoiding big and small downtime. An obvious source of big downtime is the time required to restore from a backup and replay archived commit logs after a server fails. The cost of better availability is more hardware because this is frequently done via replica sets, something that looks like synchronous replication and one of:
  • Single-primary elected via consensus with fast & automated failover and failure detection
  • Multi-master - I assume this means no failover, which might be the fastest kind of failover

Small Downtime

Small downtime is another way to reduce availability and deserves more PR. Sources of small downtime include:
  • Too slow commit - not all clients can wait, so too slow might also imply data loss.
  • Java GC stalls - I am happy to have little experience with servers written in Java
  • Flash GC stalls - NAND flash response time latency isn't great once you add enough 9s to percentile response time latency
  • Flash TRIM stalls - we rarely discuss this in public
  • Other stalls - more things that don't get enough discussion
  • Overloaded server - who is at fault for this. It seems like users are more willing to blame the DBMS for being too slow when subject to too much concurrent load than they are willing to blame a storage device for lousy response times when subject to ridiculous numbers of concurrent requests
Implementation Details
I once worked on a self-service MySQL project that was ahead of its time. Unfortunately I left the company before it was deployed. I think it succeeded for many years. Since I was a part time product manager on the effort I think of the cross-product of durability and availability via two levels of durability (more lossy, less lossy) and three levels of availability (low, medium, high). 

For durability more lossy is provided via async replication or async log shipping. There are weak bounds on the number of commits that might be lost when the primary disappears. Less lossy durability usually requires synchronous replication.

The availability levels are implemented by:
  • low - on failure of the primary restore from backup and replay the archived commit log. This can take hours.
  • medium - use HA storage (EBS, Google Persistent Disk, S3, etc). On failure of the primary setup a new compute node, run crash recovery and continue. This takes less time than the low availability solution. You still need backups and commit log archives as the HA storage will occasionally have problems.
  • high - use replication to maintain a failover target. Hopefully the failover target can handle queries to get more value from the extra hardware. 
At this point most of the combinations of the following are useful. However resources are finite and it might not be a good idea to provide all of them (jack of all trades vs master of none):
   (more, less lossy durability) X (low, medium, high availability)

Copyleft vs the DeWitt Clause

There is recent benchmarketing drama between AWS and Microsoft.

Section 1.8 of the AWS service terms includes:
(ii) agree that we may perform and disclose the results of Benchmarks of your products or services, irrespective of any restrictions on Benchmarks in the terms governing your products or services.
Some software includes a DeWitt Clause to prevent users and competitors from publishing benchmark results. I am not a lawyer but wonder if section 1.8 of the AWS service terms allows Amazon to counter with their own benchmark results when their competitors software and services use a DeWitt Clause. This would be similar to the effect of copyleft.

I hope David DeWitt doesn't mind the attention that the DeWitt Clause receives. He has done remarkable database research that has generated so much -- great PhD topics, better DBMS products, a larger CS department at UW-Madison and many jobs. But he is also famous for the DeWitt Clause.

Tuesday, January 28, 2020

Aggregate PMP call stacks for one thread

Putting this in a blog post because I want to find it again. I had N files. Each file had stack traces from one point in time. I wanted to know what one thread was doing so I extracted the stack for that thread from each of the N files and then aggregated the result. The thread was LWP 23839.

Step 1: extract traces for LWP 23839

for d in f.*; do
echo $d; awk '/\(LWP 23839/,/^$/' $d;
done > all.f

Step 2: aggregate stacks for LWP 23839. This is a slight variant of standard PMP.

cat all.f | awk 'BEGIN { s = ""; }  /^Thread/ { print s; s = ""; } /^\#/ { x=index($2, "0x"); if (x == 1) { n=$4 } else { n=$2 }; if (s != "" ) { s = s "," n} else { s = n } } END { print s }' -  | sort | uniq -c | sort -r -n -k 1,1

Monday, January 27, 2020

Deletes are fast and slow in an LSM

In an LSM deletes are fast for the deleter but can make queries that follow slower. The problem is that too many tombstones can get in the way of a query especially a range query.

A tombstone must remain in the LSM tree until any keys it deletes have been removed from the LSM. For example, if there is a tombstone for key "ABC" on level 1 of the LSM tree and a live value for that key on level 3 then that tombstone cannot be removed. It is hard to make the check (does live key exist below me) efficient.

I haven't read much about optimizing for tombstones in an LSM not named RocksDB. Perhaps I have not tried hard to find such details. Maybe this is something that LSM engine developers should explain more in public.

Confirming whether a tombstone can be dropped

This is based on code that I read ~2 years ago. Maybe RocksDB has changed today.

Tombstones are dropped during compaction. The question is how much work (CPU and IO) you are willing to spend to determine whether a tombstone can be dropped. LevelDB and RocksDB favor performance over exactness. By this I mean they spend less CPU and no IO on the "can I drop this tombstone" check. The check is simple today. If there is an LSM level (or sorted run) below (older) that has a key range which overlaps the tombstone key then the tombstone won't be dropped. And in most workloads this means that tombstones aren't dropped until they reach the max LSM level -- because there usually is overlap.

An LSM could spend a bit more CPU and check bloom filters for the levels that overlap with the tombstone. That might allow tombstones to be dropped earlier. An even more expensive check would be to use more CPU and possibly IO to confirm whether the level that overlaps with the tombstone really has that key. This can make compaction much slower.

Fast path

The SingleDelete API in RocksDB makes it easier to drop tombstones. If you respect what the API requires then tombstones can be dropped quickly -- without spending IO or CPU. SingleDelete makes it easy to drop tombstones for a given key when those tombstones meet during compaction. They don't do anything for the case above where the tombstone is on one level and the live key might be on a lower level.

MyRocks magic

MyRocks has some clever code that does extra work to remove tombstones from an SST when the SST has too many tombstones. Configuration options for this are mentioned in this post. Percona has some docs on rocksdb_compaction_sequential_deletes. And I am sure there is a slide deck from my favorite MyRocks expert. Maybe he will share that with me.

Friday, January 24, 2020

Comparative benchmarks and a question about describing data

I enjoy working on database performance benchmarks. I also enjoy writing about benchmarketing. Some of my focus is on comparative benchmarks rather than competitive benchmarks. Let me try to distinguish them. Both try to do the right thing, as in get the best result for each DBMS and try to explain differences. I will try to use these definitions going forward.
  • The goal for a competitive benchmark is to show that your system is faster than their system and when it is that result will be published. 
  • The goal for a comparative benchmark is to determine where your system is slower than their system and file feature requests for things that can be improved.
I haven't done many competitive benchmarks because I haven't been on a product team for a long time, although MyRocks was kind of a product that I was happy to promote. I have been doing comparative benchmarks for a long time and that will continue on my new job at MongoDB. My product for 10 years was making MySQL better and marketing bugs and feature requests was my job. I did Ok at that.

This post is a follow up to the slow perf movement manifesto.

A question

What is a good way to describe N metrics from a benchmark that used two configurations? I still struggle with this and seek something that is concise and consistent. By concise I want this to be easy to read and use less text. By consistent I want the comparison directions to not change -- use Oracle in the numerator in all cases.

First is the table of results that I want to describe. These numbers are made up in case any lawyers for Microsoft or Oracle read my blog. I didn't violate the DeWitt Clause but I am curious about results for Linkbench and the insert benchmark on your products.

Server QPS CPU/query Cost Bogomips
Oracle 110 75 180 5
Microsoft 100 100 100 100

I can describe this using percentages or ratios. My preferences is ratios v2.
  1. Percentages - Oracle gets 10% more QPS, Oracle uses 25% less CPU, Oracle costs 80% more bitcoins, Oracle gets 95% fewer bogomips
  2. Ratios v1 - Oracle gets 1.10X more QPS, Oracle uses 0.75X ... CPU, Oracle costs 1.80X more bitcoins, Oracle gets 0.05X ... bogomips
  3. Ratios v2 - QPS ratio is 1.10, CPU ratio is 0.75, Cost ratio is 1.80, bogomips ratio is 0.05
I prefer ratios v2 because it concise -- in the example above the less concise approaches use more than one line which hurts readability. There are other challenges in the other approaches:
  • In the percentages approach when more and less are used there is the burden of less vs fewer
  • In the percentages approach I slow down when I read X% more or Y% less. I stop and think too much about percentage change even though the math is easy.
  • The percentages approach obscures big differences. In the example above there is a huge difference in bogomips as Microsoft gets 20X more. But this is described as 95% less for Oracle, and 95 is close to 100. Focus on 95% rather than 95% less and you miss that it is 20X.
  • In the ratios v1 approach the CPU difference description is awkward. Writing uses 0.75X more CPU is confusing because Oracle doesn't use more CPU. Writing uses 0.25X less CPU isn't clear. Writing uses 0.75X the CPU doesn't fit the pattern of more/less/fewer.

Friday, January 17, 2020

PMP for on-cpu profiling?

PMP has been great for off-CPU profiling, as long as you remember to strip the binaries. Percona shared a way to make flame graphs from PMP output. Maybe the next improvement can be a tool to make PMP useful for on-CPU profiling.


Remove all stacks that appear to be off-CPU (blocked on a mutex or IO). This won't be exact. I wonder if it will be useful. It won't remove threads that are ready to run but not running. Whether that is an issue might depend on whether a workload runs with more threads than cores.


Assuming you already run PMP for off-CPU profiling then you have the thread stacks. Perhaps this makes them more useful.

Thursday, January 16, 2020

The slow perf movement

I struggle with the presentation of performance data. This isn't about benchmarketing -- I try to explain performance and inform rather than create hype and FUD. This is about how I present results. One day I should take the time to learn from the experts on this topic.

Two frequent problems for me are:
  1. Balancing impact vs content
  2. Describing performance differences
Balancing impact vs content

I am a member of the slow perf (report) movement. I don't want to make it too easy to read a performance report. Credit for the name goes to Henrik Ingo.

It isn't easy to write a good performance report. The challenge is the right balance of content and conclusions. The conclusions create impact -- X is 3X faster than Y! -- while the content helps the conclusions look truthy. Write the content, then add the conclusions (executive summary). You don't want the reader to suffer through pages of detail trying to learn something.

Performance reports have a context. Too often the conclusion is remembered while the context is lost. It is great that X is faster than Y but I want to know the conditions for which it is true. Alas a perf report with context takes longer to read -- thus the slow perf movement.

A DBMS has a variety of bottlenecks. Whether one or another limits performance depends on the context (workload & hardware). The conclusion, X is faster than Y, doesn't always survive changes in the context.

One example of this is the use of bar charts. I add at least one to most perf blog posts but most of the data I present is in tables. Bar charts are great for impact, tables are better for information density. Too many bar charts is bad for users on mobile devices waiting for a post to download.

Describing performance differences

I am frequently confused when I read that X is 30% or 300% faster or slower than Y. My confusion is about the math used. I prefer to use ratios as they are less ambiguous to me and more concise. Assume I have 8 performance counters measured while running the same workload for MySQL and Postgres -- counter-1 to counter-8 (CPU/query, IO/query, etc).

Then I can report these as ratios where MySQL (or Postgres) is always the numerator:
  • counter-1 ratio is 0.8
  • counter-2 ratio is 2.3
Or I can report these as percentage differences, and hope we all agree on how to compute percentage change.
  • counter-1 is reduced by X%
  • counter-2 is increased by Y%
Sometimes I use a hybrid approach
  • counter-1 is reduced by X%
  • counter-2 is increased by 2.3X
I prefer the first approach. I frequently start with the second approach but then switch to the hybrid approach when there are huge differences for some counters. Then I realize that I am adding to the confusion with the hybrid approach.

Another problem with consistency when comparing two things (servers, configurations, etc) is using the value for thing A in the numerator for some comparisons but in the denominator for others. This is another mistake that I make. One reason for my mistake might be a desire to make all results have a ratio > 1 or a percentage increase when such a result implies an improvement.

How I think about perf reports

I still struggle with the presentation of performance data. Performance reports need to be efficient to write because I am short on time and likely to rewrite a report each time I make a mistake and have to repeat my tests. It needs to be easy to read because it isn't worth writing if nobody reads it. 

I am fan of geek codes (for LSM and other index structures) as they are an efficient way to convey information -- pay the cost once of learning the geek code and save time when you use it. This is great for people who read many of your perf reports and blog posts. This isn't always great for a first-time reader. 

I have layers of scripts to extract and condense performance data. My goal is one line per test with N columns of the most useful data for showing results and explaining differences. With one line per test it is easy to compare results across tests (where each test might be a different configuration or DBMS). But I risk using a too large value for N leading to too long lines.

It is easy for me to paste such data into a report and the result looks OK with a fixed width font. I usually format such tables twice in my scripts -- once as CSV, once as TSV. The CSV can be imported into a spreadsheet. The TSV can be pasted into reports. Pasting the TSV is easy. Using an editor (Google Docs, Blogger, Word, etc) to reformat into a proper table takes time. So I am reluctant to do that but formatting HTML tables could be added to my scripts.

I am reluctant to use bar charts for such data. First, because a bar chart takes up more space than a table of text. Second, because a bar chart takes more time to format. In the past I have had scripts that generate HTML or URLs for use with Google charts. I have yet to script the creation of bar charts.

Setting up Ubuntu 18.04 server on an Intel NUC

I was finally able to buy a new NUC8i7beh. It was on backorder for a few months. Now I get to set it up and expand my home test cluster to 3 servers. This explains setting up Ubuntu 18.04.3 server. It is easier today than it used to be -- don't need HWE enabled kernel, wifi easy to setup.

The first problem was installing Ubuntu 18.04.3 server. The default download is the live server ISO. It didn't work for me. I am not alone and the workaround is to use the non-live ISO. My vague memory is that the default download in the past was the non-live ISO as I didn't have this problem in the past.

Wifi step 1

Next up is wifi. The Ubuntu desktop UI makes that easy to setup but I am not using desktop. Fortunately this was easy:
  1. apt install wireless-tools
  2. apt install network-manager
  3. reboot (this started network manager)
  4. nmcli d (to confirm my wifi device was listed)
  5. nmcli d wifi list (to confirm my wifi network is available)
  6. nmcli d wifi connect $networkName (to connect)
Wifi step 2

I want wifi to start on reboot. The steps above don't make that happen so I reread my old post which has steps using /etc/network/interfaces. When I cat that file it states that ifupdown has been replaced by netplan. OK, man netplan starts with:
netplan - YAML network configuration abstraction for various backends
Oh no, YAML! But it turned out OK. With some web searches I figured out how to configure wired and wireless, and made both optional to avoid hangs waiting for network configuration on startup. I use this for /etc/netplan/01-netcfg.yaml with an open wifi network.

But the first step is: apt install wireless-tools wpasupplicant

  version: 2
  renderer: networkd
      dhcp4: yes
      optional: true
      dhcp4: yes
      optional: true
        horses: {}

Fun with Python and scons

scons --clean
-> Python 3.5 or greater required, but you have Python 2.7.17

# Edit /usr/bin/scons to use Python3
scons --clean
-> SCons 3.0.4 or greater required, but you have SCons 3.0.1

python3 -m pip install scons
-> /usr/bin/python3: No module named pip

Check out docs on pip
"pip is already installed if you are using Python 2 >=2.7.9 or Python 3 >=3.4"

python3 --version

-> Python 3.6.9

Maybe docs are wrong? The fix was:
  • sudo apt install python3-pip
  • /usr/bin/env python3 $( which scons )

Sunday, January 12, 2020

480p is my friend - video streaming when you don't have fiber

I live in a rural area and doubt I will ever get fiber. I have fixed wireless broadband. Maybe low earth orbit satellites will be an option in the future whether that is via Starlink, Amazon or OneWeb.

I get 16Mbps which is shared by 4 people. One user plays online video games and cares about latency. Video streaming can be a challenge as it frequently consumes too much download bandwidth. Upload is usually not an issue except for FaceTime and other video chat apps.

I put video streaming apps into one of 3 classes -- polite, knows better and rude.
  • Polite apps have video quality settings that are sticky. Set the video quality once to 480p and things are good. 
  • Knows better apps know better. Set video quality to 480p today and it resets to auto the next time you use the app. Why? Because the apps knows that HD video makes you happy even if you don't have the network bandwidth to support that.
  • Rude apps have no video quality settings. They use as much bandwidth as they can get.
To make up for the lack of throttling in many of the video streaming apps I have tried traffic shaping with a Gargoyle router. That is a lot of work, must get MAC addresses for all devices, and the results weren't great. I might have to try it again.

I am not a fan of Auto video quality. Maybe that works great when there are not multiple apps coming and going on the network. But I am skeptical about the ability for it to deal with congestion and I doubt it has any concern for the impact on latency sensitive apps (like online video games).

Polite apps are great but you still have to select lower quality for apps X users for web and then apps X devices for phone/tablet. Assume you need to revisit this once per year because apps change. Another thing to do is disable autoplay for as many apps as possible. I won't explain how to do that below.

The Apps

  • Facebook - Video Settings has a Video Default Quality option. AFAIK this setting has no effect for videos in Feed and Watch.  Some might call that a bug. I set the option to SD Only but Watch videos start in 720p or 1080p and Feed videos use a variety, some are HD.
  • YouTube on a browser and iOS
  • YouTube TV on a browser
  • Apple TV - in Settings -> Apps -> iTunes Movies and TV Shows see Limit Purchases and Rentals to SD
  • iTunes - purchase movies and TV shows in SD
  • Netflix on iOS - go to More -> App Settings and set Video Quality to Standard
  • Netflix on web - go to Account -> Playback settings
  • Hulu on iOS - go to Account -> Settings -> Cellular Data Usage and choose Data Saver
  • Hulu on web - start a video, click on the gear -> Quality -> Data Saver
  • Amazon Prime on iOS - select Settings -> Streaming & Downloading
  • Amazon Prime on web - start a video, click the gear
  • Zoom - defaults to non-HD on a Mac (hope others follow this trend)
  • Twitter on web - select Settings -> Data Usage -> Data Saver. The setting doesn't mention whether it changes video quality. While there disable autoplay as well.
  • Twitter on iOS - select Settings -> Data usage to disable autoplay and select lower quality videos and images
Knows Better
  • YouTube TV on iOS
  • FaceTime
  • Chromecast - the only choice is Auto. If you are watching something in 480p on a browser and then start casting the device will try for higher quality.
  • Snapchat, Instagram, TikTok - I will figure this out when my kids get home
  • Facebook Video Chat

Tuesday, January 7, 2020

It is all about the constant factors

Assuming I put performance bugs I worked on into one of three buckets -- too much big-O, too much constant factor and other -- then too much big-O would have the fewest entries. Maybe we worry too much about big-O and not enough about everything else? This post is inspired by a post and tweet by Dan Luu.

By too much big-O I mean using an algorithm with more computational complexity when one with less is available and the difference causes problems in production. A common problem is using an O(n^2) algorithm when O(nlgn) or O(n) are possible. By too much constant factor I mean an inefficient implementation. Everything else goes into the other bucket.

It is all about the constant factors was a reply to someone who dismissed performance concerns with it is just a constant factor. That still amuses me. I assume that too much big-O is the least likely cause of performance problems given my classification scheme. This is not a rigorous study but I wasn't able to find many too much big-O bugs from my bug history. The search is complicated because it isn't easy to search for O(n*n). What search terms do you use -- O(n*n), O(n^2), squared?
  1. Pagination is O(n*n) - this happened several times in production, so it counts as many bugs. Be careful with pagination queries.
  2. O(n*n) parsing in MySQL client - this was MySQL bug 34578
  3. O(n*n) allocation on array resize - Dan Luu explained an example of this. I have also encountered this but my memory is vague.
  4. O(n*n) common subexpression elimination - while working on a closed-source DBMS I fixed code that did CSE using an O(n*n) algorithm. It fell over on machine generated SQL that had thousands of terms in the WHERE clause.
  5. The InnoDB rw-lock and mutex so many years ago had O(n*n) overhead for n threads trying to get through them (lock/unlock) because of the way that wake-all on unlock was implemented. See MySQL bug 52806.
Perhaps shared_ptr deserves its own bucket. I wrote about finding shared_ptr perf bugs in MyRocks and MongoDB. I found at least one more in MyRocks, someone else reported one for RocksDB and I have seen a few more problems from this in production.

Wednesday, January 1, 2020

From disk to flashcache to flash

The past decade in database storage was interesting whether you stayed with local attach storage, used block & object storage from cloud or on-prem vendors or moved to OSS scale-out storage like Ceph, GlusterFS and MinIO. I am writing about my experience and will focus on local attach.

Over the past decade the DBMS deployments I cared for went from disk to flashcache to flash on the HW side and then from MySQL+InnoDB to MySQL+MyRocks on the SW side. I assume that HW changes faster than DBMS software. DBMS algorithms that can adapt to such changes will do better in the next decade.

One comment I have heard a few too many times is that storage performance doesn't matter much because you can fit the database in RAM. More recently I hear the same except change RAM to Optane. I agree that this can be done for many workloads. I am less certain that it should be done for many workloads. That (all data in RAM/Optane) costs a lot in money, power and even space in the data center. Lets make web-scale DBMS green. Use enough RAM/Optane for cache to meet the performance SLA and then use SSD or disk arrays. At some point there is no return from cutting the DBMS query response time in half but cutting the DBMS HW cost in half is usually a big deal.


With disk and flashcache I worried a lot about the IOPs demand because the supply was limited, usually less than 2000 operations/second. On moving to flash I stopped worrying about that and began worrying about write and space amplification (efficiency).

The context for this is small data (OLTP) workloads and deployments where reducing HW cost matters.  Overcoming the IOPs shortage was interesting at times and good for my career as there were always new problems that had to be fixed right now. Moving to flash made life easier for everyone. There was an opportunity cost from using disks -- time spent solving the latest IOPs demand crisis was time not spent on longer term projects. Moving to flash gave us time to build and deploy MyRocks.

MyRocks has better space and write efficiency than a b-tree. The cost of better space and write efficiency with an LSM is more CPU overhead for point and range queries. Sometimes that is a great trade. Better space and write efficiency means you buy less SSD and it lasts longer. Better write efficiency is a big deal with lower endurance (TLC and QLC) NAND flash. I wonder how this changes in the cloud. Cloud vendors might improve their profit margins with better space and write efficiency but they also have the ability to pass on some of the inefficiency costs to the user. A cloud user doesn't have to worry as much about write efficiency because they are renting the SSD.


This is my history with storage for web-scale MySQL. The NUC servers I use today have similar RAM/CPU as the DBMS servers I started with in 2005 but the NUC servers have much more IO capacity.

First there were disk arrays with HW RAID and SW RAID. This was RAID 10 which was better for durability than availability. Data isn't lost on a single-disk failure but the server performance is unacceptable when a HW RAID cache battery fails (fsync is too slow) or a rebuild is in progress after a disk gets replaced.

Then there was flashcache and performance is wonderful when the read working sit fits in the flash cache but there is an abrupt change in performance when it does not. Those were exciting years. Some of the performance critical parts of flashcache were in the Linux kernel. I lack kernel skills and it took us (really, Domas) a while to identify perf problems that were eventually fixed.

Then there was flash and the abundance of IOPs was wonderful. I look forward to the next decade.


If you use disk arrays at scale then you will see corruption at scale. You are likely using multiple storage devices with multiple firmware revisions. It is interesting when 99% of corruption occurs on 1% of the deployment -- all on the same, new firmware revision. That result makes it easy to focus on the HW as the probable problem and stop blaming MySQL. I can't imagine doing web-scale DBMS without per-page checksums.

Performance variance with NAND flash is to be expected. I hope that more is done to explain and document it for OLTP workloads. The common problem is that NAND flash GC can stall reads on the device. I wish it were easier to characterize device performance for enthusiasts like myself. I am sure there is an interesting paper on this topic. How much concurrency does the device provide? How are writes done? How is GC done? What is the stall distribution? What can be done to reduce stalls (see multistream and LightNVM)?

Using TRIM (mount FS with discard) at scale is exciting. RocksDB and MyRocks do a lot of TRIM while while InnoDB does not. How many GB/s and unlink/s of TRIM does the device support? TRIM performance varies greatly by vendor. I hope more is done to document these differences. Perhaps we need trimbench. People at web-scale companies have stories that never get shared because they don't want to throw their SSD vendors under the bus. I was spoiled by FusionIO. My memory is that FusionIO TRIM was a noop from a perf perspective.

Innosim is an InnoDB IO simulator that I wrote to help device vendors reproduce performance stalls we encountered with web-scale MySQL. It is easier to run than MySQL while able to generate similar IO patterns. I wrote it because InnoDB has a pattern of coordinated IO that fio wasn't able to reproduce. The pattern occurs during page write back -- first write the double write buffer (1MB or 2MB sequential write) and then do 64 or 128 random 16kb writes. Innosim also takes much less time to reach steady state -- just sequentially write out X GB of database pages versus load InnoDB and then run (for days) an update workload to fragment the indexes. Fragmentation takes time. I wish more DBMS benchmarks ran long enough to get sufficient fragmentation but that can be expensive.

Perhaps one day I will write WTsim, the WiredTiger IO simulator. I wrote ldbsim, the LevelDB IO simulator, but it was rarely used because the RocksDB benchmark client, db_bench, was easy to use even if fragmenting the LSM tree still took a long time. I am not sure that fio would be able to reproduce the special IO patterns created by RocksDB compaction. I love fio but I am not sure it should try to solve this problem for me.