Monday, January 27, 2020

Deletes are fast and slow in an LSM

In an LSM deletes are fast for the deleter but can make queries that follow slower. The problem is that too many tombstones can get in the way of a query especially a range query.

A tombstone must remain in the LSM tree until any keys it deletes have been removed from the LSM. For example, if there is a tombstone for key "ABC" on level 1 of the LSM tree and a live value for that key on level 3 then that tombstone cannot be removed. It is hard to make the check (does live key exist below me) efficient.

I haven't read much about optimizing for tombstones in an LSM not named RocksDB. Perhaps I have not tried hard to find such details. Maybe this is something that LSM engine developers should explain more in public.

Confirming whether a tombstone can be dropped

This is based on code that I read ~2 years ago. Maybe RocksDB has changed today.

Tombstones are dropped during compaction. The question is how much work (CPU and IO) you are willing to spend to determine whether a tombstone can be dropped. LevelDB and RocksDB favor performance over exactness. By this I mean they spend less CPU and no IO on the "can I drop this tombstone" check. The check is simple today. If there is an LSM level (or sorted run) below (older) that has a key range which overlaps the tombstone key then the tombstone won't be dropped. And in most workloads this means that tombstones aren't dropped until they reach the max LSM level -- because there usually is overlap.

An LSM could spend a bit more CPU and check bloom filters for the levels that overlap with the tombstone. That might allow tombstones to be dropped earlier. An even more expensive check would be to use more CPU and possibly IO to confirm whether the level that overlaps with the tombstone really has that key. This can make compaction much slower.

Fast path

The SingleDelete API in RocksDB makes it easier to drop tombstones. If you respect what the API requires then tombstones can be dropped quickly -- without spending IO or CPU. SingleDelete makes it easy to drop tombstones for a given key when those tombstones meet during compaction. They don't do anything for the case above where the tombstone is on one level and the live key might be on a lower level.

MyRocks magic

MyRocks has some clever code that does extra work to remove tombstones from an SST when the SST has too many tombstones. Configuration options for this are mentioned in this post. Percona has some docs on rocksdb_compaction_sequential_deletes. And I am sure there is a slide deck from my favorite MyRocks expert. Maybe he will share that with me.

Friday, January 24, 2020

Comparative benchmarks and a question about describing data

I enjoy working on database performance benchmarks. I also enjoy writing about benchmarketing. Some of my focus is on comparative benchmarks rather than competitive benchmarks. Let me try to distinguish them. Both try to do the right thing, as in get the best result for each DBMS and try to explain differences. I will try to use these definitions going forward.
  • The goal for a competitive benchmark is to show that your system is faster than their system and when it is that result will be published. 
  • The goal for a comparative benchmark is to determine where your system is slower than their system and file feature requests for things that can be improved.
I haven't done many competitive benchmarks because I haven't been on a product team for a long time, although MyRocks was kind of a product that I was happy to promote. I have been doing comparative benchmarks for a long time and that will continue on my new job at MongoDB. My product for 10 years was making MySQL better and marketing bugs and feature requests was my job. I did Ok at that.

This post is a follow up to the slow perf movement manifesto.

A question

What is a good way to describe N metrics from a benchmark that used two configurations? I still struggle with this and seek something that is concise and consistent. By concise I want this to be easy to read and use less text. By consistent I want the comparison directions to not change -- use Oracle in the numerator in all cases.

First is the table of results that I want to describe. These numbers are made up in case any lawyers for Microsoft or Oracle read my blog. I didn't violate the DeWitt Clause but I am curious about results for Linkbench and the insert benchmark on your products.

Server QPS CPU/query Cost Bogomips
Oracle 110 75 180 5
Microsoft 100 100 100 100

I can describe this using percentages or ratios. My preferences is ratios v2.
  1. Percentages - Oracle gets 10% more QPS, Oracle uses 25% less CPU, Oracle costs 80% more bitcoins, Oracle gets 95% fewer bogomips
  2. Ratios v1 - Oracle gets 1.10X more QPS, Oracle uses 0.75X ... CPU, Oracle costs 1.80X more bitcoins, Oracle gets 0.05X ... bogomips
  3. Ratios v2 - QPS ratio is 1.10, CPU ratio is 0.75, Cost ratio is 1.80, bogomips ratio is 0.05
I prefer ratios v2 because it concise -- in the example above the less concise approaches use more than one line which hurts readability. There are other challenges in the other approaches:
  • In the percentages approach when more and less are used there is the burden of less vs fewer
  • In the percentages approach I slow down when I read X% more or Y% less. I stop and think too much about percentage change even though the math is easy.
  • The percentages approach obscures big differences. In the example above there is a huge difference in bogomips as Microsoft gets 20X more. But this is described as 95% less for Oracle, and 95 is close to 100. Focus on 95% rather than 95% less and you miss that it is 20X.
  • In the ratios v1 approach the CPU difference description is awkward. Writing uses 0.75X more CPU is confusing because Oracle doesn't use more CPU. Writing uses 0.25X less CPU isn't clear. Writing uses 0.75X the CPU doesn't fit the pattern of more/less/fewer.

Friday, January 17, 2020

PMP for on-cpu profiling?

PMP has been great for off-CPU profiling, as long as you remember to strip the binaries. Percona shared a way to make flame graphs from PMP output. Maybe the next improvement can be a tool to make PMP useful for on-CPU profiling.

How?

Remove all stacks that appear to be off-CPU (blocked on a mutex or IO). This won't be exact. I wonder if it will be useful. It won't remove threads that are ready to run but not running. Whether that is an issue might depend on whether a workload runs with more threads than cores.

Why?

Assuming you already run PMP for off-CPU profiling then you have the thread stacks. Perhaps this makes them more useful.

Thursday, January 16, 2020

The slow perf movement

I struggle with the presentation of performance data. This isn't about benchmarketing -- I try to explain performance and inform rather than create hype and FUD. This is about how I present results. One day I should take the time to learn from the experts on this topic.

Two frequent problems for me are:
  1. Balancing impact vs content
  2. Describing performance differences
Balancing impact vs content

I am a member of the slow perf (report) movement. I don't want to make it too easy to read a performance report. Credit for the name goes to Henrik Ingo.

It isn't easy to write a good performance report. The challenge is the right balance of content and conclusions. The conclusions create impact -- X is 3X faster than Y! -- while the content helps the conclusions look truthy. Write the content, then add the conclusions (executive summary). You don't want the reader to suffer through pages of detail trying to learn something.

Performance reports have a context. Too often the conclusion is remembered while the context is lost. It is great that X is faster than Y but I want to know the conditions for which it is true. Alas a perf report with context takes longer to read -- thus the slow perf movement.

A DBMS has a variety of bottlenecks. Whether one or another limits performance depends on the context (workload & hardware). The conclusion, X is faster than Y, doesn't always survive changes in the context.

One example of this is the use of bar charts. I add at least one to most perf blog posts but most of the data I present is in tables. Bar charts are great for impact, tables are better for information density. Too many bar charts is bad for users on mobile devices waiting for a post to download.

Describing performance differences

I am frequently confused when I read that X is 30% or 300% faster or slower than Y. My confusion is about the math used. I prefer to use ratios as they are less ambiguous to me and more concise. Assume I have 8 performance counters measured while running the same workload for MySQL and Postgres -- counter-1 to counter-8 (CPU/query, IO/query, etc).

Then I can report these as ratios where MySQL (or Postgres) is always the numerator:
  • counter-1 ratio is 0.8
  • counter-2 ratio is 2.3
Or I can report these as percentage differences, and hope we all agree on how to compute percentage change.
  • counter-1 is reduced by X%
  • counter-2 is increased by Y%
Sometimes I use a hybrid approach
  • counter-1 is reduced by X%
  • counter-2 is increased by 2.3X
I prefer the first approach. I frequently start with the second approach but then switch to the hybrid approach when there are huge differences for some counters. Then I realize that I am adding to the confusion with the hybrid approach.

Another problem with consistency when comparing two things (servers, configurations, etc) is using the value for thing A in the numerator for some comparisons but in the denominator for others. This is another mistake that I make. One reason for my mistake might be a desire to make all results have a ratio > 1 or a percentage increase when such a result implies an improvement.

How I think about perf reports

I still struggle with the presentation of performance data. Performance reports need to be efficient to write because I am short on time and likely to rewrite a report each time I make a mistake and have to repeat my tests. It needs to be easy to read because it isn't worth writing if nobody reads it. 

I am fan of geek codes (for LSM and other index structures) as they are an efficient way to convey information -- pay the cost once of learning the geek code and save time when you use it. This is great for people who read many of your perf reports and blog posts. This isn't always great for a first-time reader. 

I have layers of scripts to extract and condense performance data. My goal is one line per test with N columns of the most useful data for showing results and explaining differences. With one line per test it is easy to compare results across tests (where each test might be a different configuration or DBMS). But I risk using a too large value for N leading to too long lines.

It is easy for me to paste such data into a report and the result looks OK with a fixed width font. I usually format such tables twice in my scripts -- once as CSV, once as TSV. The CSV can be imported into a spreadsheet. The TSV can be pasted into reports. Pasting the TSV is easy. Using an editor (Google Docs, Blogger, Word, etc) to reformat into a proper table takes time. So I am reluctant to do that but formatting HTML tables could be added to my scripts.

I am reluctant to use bar charts for such data. First, because a bar chart takes up more space than a table of text. Second, because a bar chart takes more time to format. In the past I have had scripts that generate HTML or URLs for use with Google charts. I have yet to script the creation of bar charts.

Setting up Ubuntu 18.04 server on an Intel NUC

I was finally able to buy a new NUC8i7beh. It was on backorder for a few months. Now I get to set it up and expand my home test cluster to 3 servers. This explains setting up Ubuntu 18.04.3 server. It is easier today than it used to be -- don't need HWE enabled kernel, wifi easy to setup.

The first problem was installing Ubuntu 18.04.3 server. The default download is the live server ISO. It didn't work for me. I am not alone and the workaround is to use the non-live ISO. My vague memory is that the default download in the past was the non-live ISO as I didn't have this problem in the past.

Wifi step 1

Next up is wifi. The Ubuntu desktop UI makes that easy to setup but I am not using desktop. Fortunately this was easy:
  1. apt install wireless-tools
  2. apt install network-manager
  3. reboot (this started network manager)
  4. nmcli d (to confirm my wifi device was listed)
  5. nmcli d wifi list (to confirm my wifi network is available)
  6. nmcli d wifi connect $networkName (to connect)
Wifi step 2

I want wifi to start on reboot. The steps above don't make that happen so I reread my old post which has steps using /etc/network/interfaces. When I cat that file it states that ifupdown has been replaced by netplan. OK, man netplan starts with:
netplan - YAML network configuration abstraction for various backends
Oh no, YAML! But it turned out OK. With some web searches I figured out how to configure wired and wireless, and made both optional to avoid hangs waiting for network configuration on startup. I use this for /etc/netplan/01-netcfg.yaml with an open wifi network.

But the first step is: apt install wireless-tools wpasupplicant

network:
  version: 2
  renderer: networkd
  ethernets:
    eno1:
      dhcp4: yes
      optional: true
  wifis:
    wlp0s20f3:
      dhcp4: yes
      optional: true
      access-points:
        horses: {}

Fun with Python and scons

scons --clean
-> Python 3.5 or greater required, but you have Python 2.7.17

# Edit /usr/bin/scons to use Python3
scons --clean
-> SCons 3.0.4 or greater required, but you have SCons 3.0.1

python3 -m pip install scons
-> /usr/bin/python3: No module named pip

Check out docs on pip
"pip is already installed if you are using Python 2 >=2.7.9 or Python 3 >=3.4"


python3 --version

-> Python 3.6.9

Maybe docs are wrong? The fix was:
  • sudo apt install python3-pip
  • /usr/bin/env python3 $( which scons )




Sunday, January 12, 2020

480p is my friend - video streaming when you don't have fiber

I live in a rural area and doubt I will ever get fiber. I have fixed wireless broadband. Maybe low earth orbit satellites will be an option in the future whether that is via Starlink, Amazon or OneWeb.

I get 16Mbps which is shared by 4 people. One user plays online video games and cares about latency. Video streaming can be a challenge as it frequently consumes too much download bandwidth. Upload is usually not an issue except for FaceTime and other video chat apps.

I put video streaming apps into one of 3 classes -- polite, knows better and rude.
  • Polite apps have video quality settings that are sticky. Set the video quality once to 480p and things are good. 
  • Knows better apps know better. Set video quality to 480p today and it resets to auto the next time you use the app. Why? Because the apps knows that HD video makes you happy even if you don't have the network bandwidth to support that.
  • Rude apps have no video quality settings. They use as much bandwidth as they can get.
To make up for the lack of throttling in many of the video streaming apps I have tried traffic shaping with a Gargoyle router. That is a lot of work, must get MAC addresses for all devices, and the results weren't great. I might have to try it again.

I am not a fan of Auto video quality. Maybe that works great when there are not multiple apps coming and going on the network. But I am skeptical about the ability for it to deal with congestion and I doubt it has any concern for the impact on latency sensitive apps (like online video games).

Polite apps are great but you still have to select lower quality for apps X users for web and then apps X devices for phone/tablet. Assume you need to revisit this once per year because apps change. Another thing to do is disable autoplay for as many apps as possible. I won't explain how to do that below.

The Apps

Buggy
  • Facebook - Video Settings has a Video Default Quality option. AFAIK this setting has no effect for videos in Feed and Watch.  Some might call that a bug. I set the option to SD Only but Watch videos start in 720p or 1080p and Feed videos use a variety, some are HD.
Polite
  • YouTube on a browser and iOS
  • YouTube TV on a browser
  • Apple TV - in Settings -> Apps -> iTunes Movies and TV Shows see Limit Purchases and Rentals to SD
  • iTunes - purchase movies and TV shows in SD
  • Netflix on iOS - go to More -> App Settings and set Video Quality to Standard
  • Netflix on web - go to Account -> Playback settings
  • Hulu on iOS - go to Account -> Settings -> Cellular Data Usage and choose Data Saver
  • Hulu on web - start a video, click on the gear -> Quality -> Data Saver
  • Amazon Prime on iOS - select Settings -> Streaming & Downloading
  • Amazon Prime on web - start a video, click the gear
  • Zoom - defaults to non-HD on a Mac (hope others follow this trend)
  • Twitter on web - select Settings -> Data Usage -> Data Saver. The setting doesn't mention whether it changes video quality. While there disable autoplay as well.
  • Twitter on iOS - select Settings -> Data usage to disable autoplay and select lower quality videos and images
Knows Better
  • YouTube TV on iOS
Rude
  • FaceTime
  • Chromecast - the only choice is Auto. If you are watching something in 480p on a browser and then start casting the device will try for higher quality.
Unknown
  • Snapchat, Instagram, TikTok - I will figure this out when my kids get home
  • Facebook Video Chat

Tuesday, January 7, 2020

It is all about the constant factors

Assuming I put performance bugs I worked on into one of three buckets -- too much big-O, too much constant factor and other -- then too much big-O would have the fewest entries. Maybe we worry too much about big-O and not enough about everything else? This post is inspired by a post and tweet by Dan Luu.

By too much big-O I mean using an algorithm with more computational complexity when one with less is available and the difference causes problems in production. A common problem is using an O(n^2) algorithm when O(nlgn) or O(n) are possible. By too much constant factor I mean an inefficient implementation. Everything else goes into the other bucket.

It is all about the constant factors was a reply to someone who dismissed performance concerns with it is just a constant factor. That still amuses me. I assume that too much big-O is the least likely cause of performance problems given my classification scheme. This is not a rigorous study but I wasn't able to find many too much big-O bugs from my bug history. The search is complicated because it isn't easy to search for O(n*n). What search terms do you use -- O(n*n), O(n^2), squared?
  1. Pagination is O(n*n) - this happened several times in production, so it counts as many bugs. Be careful with pagination queries.
  2. O(n*n) parsing in MySQL client - this was MySQL bug 34578
  3. O(n*n) allocation on array resize - Dan Luu explained an example of this. I have also encountered this but my memory is vague.
  4. O(n*n) common subexpression elimination - while working on a closed-source DBMS I fixed code that did CSE using an O(n*n) algorithm. It fell over on machine generated SQL that had thousands of terms in the WHERE clause.
  5. The InnoDB rw-lock and mutex so many years ago had O(n*n) overhead for n threads trying to get through them (lock/unlock) because of the way that wake-all on unlock was implemented. See MySQL bug 52806.
Perhaps shared_ptr deserves its own bucket. I wrote about finding shared_ptr perf bugs in MyRocks and MongoDB. I found at least one more in MyRocks, someone else reported one for RocksDB and I have seen a few more problems from this in production.