Posts

Showing posts from January, 2019

Less "mark" in MySQL benchmarking

My goal for the year is more time learning math and less time running MySQL benchmarks. I haven't done serious benchmarks for more than 12 months. It was a great experience but I want to learn new things. MySQL 8.0.14 has been released with fixes for a serious bug I found via the insert benchmark. I won't confirm whether it has been fixed. I hope someone else does. My tests and methodology are described in posts for sysbench , linkbench and the insert benchmark .  I hope the upstream distros (MySQL, MariaDB, Percona) repeat my tests and methodology and I am happy to answer questions about that. I even have inscrutable shell scripts that make it easy to run the tests. Despite being a lousy example of how to use Bash, they are portable enough to run on my home and work hardware.

Optimal configurations for an LSM and more

I have been trying to solve the problem of finding an optimal LSM configuration for a given workload. The real problem is larger than that, which is to find the right index structure and the right configuration for a given workload. But my focus is RocksDB so I will start by solving for an LSM. This link is to slides that summarizes my effort. I have expressed the problem to be solved using differentiable functions to express the cost that is to be minimized. The cost functions have a mix of real and integer valued parameters for which values must be determine to minimize the cost. I have yet to solve the functions, but I am making progress and learning more math. This might be a constrained optimization problem and Lagrange Multipliers might be useful. The slides are from a talk I am about to present at the MongoDB office in Sydney where several WiredTiger developers are based. I appreciate that Henrik Ingo set this up. My work has things in common with the excellent work by H

Bugs in Windows 10 parental controls

I use Windows 10 parental controls with my two children. Sometimes I am surprised at the bugs I encounter, but I can't rant too much because of glass houses and stones. My old favorite was that a hard reset before the time limit reached zero allowed my clever child to get more time. Apparently Microsoft takes storage efficiency very seriously and didn't want to waste a disk write and/or fsync on persisting the usage counter every few minutes. I haven't tried to reproduce this recently but never heard back after filing a bug report. Now I have a new favorite bug. I am 5 hours behind their timezone and granted another hour to my daughter. It is 4pm here and 9pm there. The landing page after granting the time tells me my child can use the computer until 5pm (my timezone).  Child tries to login and immediately encounters the timeout dialog. Apparently timezones are a hard problem. But less screen time is a good thing.

Geek code for LSM trees

This is a link to slides from my 5-minute talk at the CIDR 2019 Gong Show. The slides are a brief overview of the geek code for LSM trees. If you click on the settings icon in the slide show you can view the speaker notes which have links to blog posts that have more details. I also pasted the links below. Given time I might add to this post, but most of the content is in my past blog posts. Regardless I think there is more to be discovered about performant, efficient and manageable LSM trees. The key points are there are more compaction algorithms to discover, we need to make it easier to describe them and compaction is a property of a level, not of the LSM tree. Links to posts with more details: Describing tiered and leveled compaction Number of levels that minimized write amplification Combining tiered and leveled compaction Tiered vs leveled, why not both Name that compaction algorithm Original LSM paper that got this started Review of SlimDB with references to the f

LSM math: fixing mistakes in my last post

My last post explained the number of levels in an LSM that minimizes write amplification using 3 different estimates for the per-level write-amp. Assuming the per-level growth factor is w then the 3 estimates were approximately w, w+1 and w-1 and named LWA-1, LWA-2 and LWA-3 in the post. I realized there was a mistake in that post for the analysis of LWA-3. The problem is that the per-level write-amp must be >= 1 (and really should be > 1) but the value of w-1 is <= 1 when the per-level growth factor is <= 2. By allowing the per-level write-amp to be < 1 it easy to incorrectly show that a huge number of levels reduces write-amp as I do for curve #3 in this graph . While I don't claim that (w-1) or (w-1)/2 can't be a useful estimate for per-level write-amp in some cases, it must be used with care. Explaining LWA-3 The next challenge is to explain how LWA-3 is derived. That comes from equation 12 on page 9 of the Dostoevsky paper . Start with the (T-1)/(K+1)

LSM math: revisiting the number of levels that minimizes write amplification

I previously used math to explain the number of levels that minimizes write amplification for an LSM tree with leveled compaction. My answer was one of ceil(ln(T)) or floor(ln(T)) assuming the LSM tree has total fanout = T where T is size(database) / size(memtable). Then I heard from a coworker that the real answer is less than floor(ln(T)). Then I heard from Niv Dayan, first author of the Dostoevsky paper , that the real answer is larger than ceil(ln(T)) and the optimal per-level growth factor is ~2 rather than ~e. All of our answers are correct. We have different answers because we use different functions to estimate the per-level write-amp. The graph of the functions for total write-amp using the different cost functions is here and you can see that the knee in the curve occurs at a different x value for two of the curves and the third curve doesn't appear to have a minimum. While working on this I learned to love the Lambert W function . But I wonder whether I made the

Define "better"

Welcome to my first rant of 2019, although I have written about this before. While I enjoy benchmarketing from a distance it is not much fun to be in the middle of it. The RocksDB project has been successful and thus becomes the base case for products and research claiming that something else is better. While I have no doubt that other things can be better I am wary about the definition of better . There are at least 3 ways to define better when evaluating database performance. The first, faster is better, ignores efficiency, the last two do not. I'd rather not ignore efficiency. The marginal return of X more QPS eventually becomes zero while the benefit of using less hardware is usually greater than zero. Optimize for throughput and ignore efficiency (faster is better) Get good enough performance and then optimize for efficiency Get good enough efficiency and then optimize for throughput Call to action I forgot to include this before publishing. Whether #1, #2 or #3

Review of LSM-based Storage Techniques: A Survey

Chen Luo and Mike Carey published a wonderful survey of research on LSM algorithms . They know about LSM because the AsterixDB project  includes an LSM. They did a great job explaining the LSM space, telling a coherent story and summarizing relevant papers. Reading this paper was a good use of my time and I found a few more papers to read in their references. I have read a few papers, including TRIAD , with ideas on reducing write-amp for the smaller levels of the LSM tree. I think this could be done for RocksDB by merging and remerging immutable memtables -- this is similar in spirit to subcompactions for the L0. With a large immutable memtable there would be one less level in the LSM tree. This is an alternative to having an L0, and maybe an L1, that are not made durable. In all cases the cost is a longer MTTR because WAL replay must be done. In all cases there is an assumption that the non-durable levels (large immutable memtables or L0/L1) are in memory. This is a small complaint f