Posts

Showing posts from January, 2020

Durability vs Availability

I don't consider myself an expert on distributed systems but sometimes I try to write on the topic. I hope any mistakes here aren't too serious. I like to describe things so here is my attempt. Durability Durability is about the probability of losing data whether that data was written to the DBMS yesterday or recently. I describe this as a probability rather than systems that can or cannot lose data. I have spent too much time with web-scale deployments to claim that transactions will never be lost. Backups prevent loss of data added yesterday or last year. Making recent changes durable is harder and comes at a cost in complexity and commit latency. The solution is to get commit log entries onto multiple servers before something is considered committed. There are fascinating and complicated ways to do this. Fortunately most of us can make use of Raft and Paxos implementations written by experts. While my team at Google was the first to implement semisync replication, some

Copyleft vs the DeWitt Clause

There is recent benchmarketing drama  between AWS and Microsoft . Section 1.8 of the AWS service terms  includes: (ii) agree that we may perform and disclose the results of Benchmarks of your products or services, irrespective of any restrictions on Benchmarks in the terms governing your products or services. Some software includes a DeWitt Clause to prevent users and competitors from publishing benchmark results. I am not a lawyer but wonder if section 1.8 of the AWS service terms allows Amazon to counter with their own benchmark results when their competitors software and services use a DeWitt Clause. This would be similar to the effect of copyleft. I hope David DeWitt doesn't mind the attention that the DeWitt Clause receives. He has done remarkable database research that has generated so much -- great PhD topics, better DBMS products, a larger CS department at UW-Madison and many jobs. But he is also famous for the DeWitt Clause.

Aggregate PMP call stacks for one thread

Putting this in a blog post because I want to find it again. I had N files. Each file had stack traces from one point in time. I wanted to know what one thread was doing so I extracted the stack for that thread from each of the N files and then aggregated the result. The thread was LWP 23839. Step 1: extract traces for LWP 23839 for d in f.*; do echo $d; awk '/\(LWP 23839/,/^$/' $d; done > all.f Step 2: aggregate stacks for LWP 23839. This is a slight variant of standard PMP . cat all.f | awk 'BEGIN { s = ""; }  /^Thread/ { print s; s = ""; } /^\#/ { x=index($2, "0x"); if (x == 1) { n=$4 } else { n=$2 }; if (s != "" ) { s = s "," n} else { s = n } } END { print s }' -  | sort | uniq -c | sort -r -n -k 1,1

Deletes are fast and slow in an LSM

In an LSM deletes are fast for the deleter but can make queries that follow slower. The problem is that too many tombstones can get in the way of a query especially a range query. A tombstone must remain in the LSM tree until any keys it deletes have been removed from the LSM. For example, if there is a tombstone for key "ABC" on level 1 of the LSM tree and a live value for that key on level 3 then that tombstone cannot be removed. It is hard to make the check (does live key exist below me) efficient. I haven't read much about optimizing for tombstones in an LSM not named RocksDB. Perhaps I have not tried hard to find such details. Maybe this is something that LSM engine developers should explain more in public. Confirming whether a tombstone can be dropped This is based on code that I read ~2 years ago. Maybe RocksDB has changed today. Tombstones are dropped during compaction. The question is how much work (CPU and IO) you are willing to spend to determine wheth

Comparative benchmarks and a question about describing data

I enjoy working on database performance benchmarks. I also enjoy writing about benchmarketing . Some of my focus is on comparative benchmarks rather than competitive benchmarks. Let me try to distinguish them. Both try to do the right thing, as in get the best result for each DBMS and try to explain differences. I will try to use these definitions going forward. The goal for a competitive benchmark is to show that your system is faster than their system and when it is that result will be published.  The goal for a comparative benchmark is to determine where your system is slower than their system and file feature requests for things that can be improved. I haven't done many competitive benchmarks because I haven't been on a product team for a long time, although MyRocks was kind of a product that I was happy to promote. I have been doing comparative benchmarks for a long time and that will continue on my new job at MongoDB. My product for 10 years was making MySQL better

PMP for on-cpu profiling?

PMP has been great for off-CPU profiling, as long as you remember to strip the binaries. Percona shared a way to make flame graphs from PMP output. Maybe the next improvement can be a tool to make PMP useful for on-CPU profiling. How? Remove all stacks that appear to be off-CPU (blocked on a mutex or IO). This won't be exact. I wonder if it will be useful. It won't remove threads that are ready to run but not running. Whether that is an issue might depend on whether a workload runs with more threads than cores. Why? Assuming you already run PMP for off-CPU profiling then you have the thread stacks. Perhaps this makes them more useful.

The slow perf movement

I struggle with the presentation of performance data. This isn't about benchmarketing -- I try to explain performance and inform rather than create hype and FUD. This is about how I present results. One day I should take the time to learn from the experts on this topic. Two frequent problems for me are: Balancing impact vs content Describing performance differences Balancing impact vs content I am a member of the slow perf (report) movement. I don't want to make it too easy to read a performance report. Credit for the name goes to  Henrik Ingo . It isn't easy to write a good performance report. The challenge is the right balance of content and conclusions. The conclusions create impact -- X is 3X faster than Y! -- while the content helps the conclusions look truthy. Write the content, then add the conclusions (executive summary). You don't want the reader to suffer through pages of detail trying to learn something. Performance reports have a context. T

Setting up Ubuntu 18.04 server on an Intel NUC

I was finally able to buy a new NUC8i7beh. It was on backorder for a few months. Now I get to set it up and expand my home test cluster to 3 servers. This explains setting up Ubuntu 18.04.3 server. It is easier today than it used to be -- don't need HWE enabled kernel, wifi easy to setup. The first problem was installing Ubuntu 18.04.3 server. The default download is the live server ISO. It didn't work for me. I am not alone and the workaround is to use the non-live ISO . My vague memory is that the default download in the past was the non-live ISO as I didn't have this problem in the past. Wifi step 1 Next up is wifi. The Ubuntu desktop UI makes that easy to setup but I am not using desktop. Fortunately this was easy: apt install wireless-tools apt install network-manager reboot (this started network manager) nmcli d (to confirm my wifi device was listed) nmcli d wifi list (to confirm my wifi network is available) nmcli d wifi connect $networkName (to connec

480p is my friend - video streaming when you don't have fiber

I live in a rural area and doubt I will ever get fiber. I have fixed wireless broadband . Maybe low earth orbit satellites will be an option in the future whether that is via  Starlink , Amazon or OneWeb. I get 16Mbps which is shared by 4 people. One user plays online video games and cares about latency. Video streaming can be a challenge as it frequently consumes too much download bandwidth. Upload is usually not an issue except for FaceTime and other video chat apps. I put video streaming apps into one of 3 classes -- polite , knows better  and  rude . Polite apps have video quality settings that are sticky. Set the video quality once to 480p and things are good.  Knows better apps know better. Set video quality to 480p today and it resets to auto the next time you use the app. Why? Because the apps knows that HD video makes you happy even if you don't have the network bandwidth to support that. Rude apps have no video quality settings. They use as much bandwidth as they

It is all about the constant factors

Assuming I put performance bugs I worked on into one of three buckets -- too much big-O , too much constant factor and other -- then too much big-O would have the fewest entries. Maybe we worry too much about big-O and not enough about everything else? This post is inspired by a post and tweet by Dan Luu. By too much big-O I mean using an algorithm with more computational complexity when one with less is available and the difference causes problems in production. A common problem is using an O(n^2) algorithm when O(nlgn) or O(n) are possible. By too much constant factor I mean an inefficient implementation. Everything else goes into the other bucket. It is all about the constant factors was a reply to someone who dismissed performance concerns with it is just a constant factor . That still amuses me. I assume that too much big-O is the least likely cause of performance problems given my classification scheme. This is not a rigorous study but I wasn't able to find many too

From disk to flashcache to flash

The past decade in database storage was interesting whether you stayed with local attach storage, used block & object storage from cloud or on-prem vendors or moved to OSS scale-out storage like Ceph, GlusterFS and MinIO . I am writing about my experience and will focus on local attach. Over the past decade the DBMS deployments I cared for went from disk to flashcache to flash on the HW side and then from MySQL+InnoDB to MySQL+MyRocks on the SW side. I assume that HW changes faster than DBMS software. DBMS algorithms that can adapt to such changes will do better in the next decade. One comment I have heard a few too many times is that storage performance doesn't matter much because you can fit the database in RAM. More recently I hear the same except change RAM to Optane. I agree that this can be done for many workloads. I am less certain that it should be done for many workloads. That (all data in RAM/Optane) costs a lot in money, power and even space in the data center.