Friday, October 23, 2020

LSM local secondary indexes (LSM LSI)

This expands on my previous post about RID-lists for RocksDB. RocksDB doesn't provide secondary indexes, nor does it know about schemas but applications that use RocksDB can add both and MyRocks is an example of that.

Many DBMS use the concept of local vs global secondary indexes and partitioned tables are one place where this matters. Assuming a B-Tree is used for the index, then a local secondary index with a partitioned table has a B-Tree index per-partition. The benefit of this is that DROP PARTITION is fast -- delete the table and index files for that partition. The cost from this is that a secondary index query might have to probe indexes per-partition and this can use more CPU and IO.

Without defining a local secondary index for an LSM I assert that secondary indexes with MyRocks are global. Global secondary index entries for an LSM use the PK value to reference the base row. Side effects of this include larger secondary index entries and CPU overhead when an LSM point read must be done to find the base row. Any DBMS with a clustered PK index, like InnoDB, also has these costs.

For the following I haven't done an extensive literature search to understand whether this idea has been proposed. Regardless, I hope that LSM LSI (see below) is the name to be be used.

Local secondary indexes for an LSM

The concept of local secondary indexes also applies to a Log Structured Merge tree. An LSM global secondary index (LSM GSI) uses the PK value because the base row is in the primary key index and is likely to be relocated over time by compaction. Therefore, either secondary index entries must be updated with a new location (rowid) during compaction of the PK index or the secondary index entries must use the PK value. There are more complex solutions that I will ignore.

There are other costs when secondary index entries uses the PK value rather than a rowid to find the base row -- it is harder to implement RID-lists and bitmap indexes. But I think this can be fixed by an LSM local secondary index (LSM LSI).

I define an LSM LSI to be local to a sorted run. When compaction merges sorted runs into a larger sorted run, then a new LSM LSI will be created for that new sorted run. The cost from this is fanout for queries that have a predicate on the secondary index and cannot be pruned to a few sorted runs. Pruning to a few sorted runs is most likely to be enabled via predicates on the PK.

The lifetime of the LSM LSI is the same as the lifetime of the data it indexes. Therefore, LSM LSI entries can use something like a rowid rather than the PK to reference the base row. This makes it much easier to implement RID-lists and bitmap indexes for an LSM. The rowid has two parts -- (block#, offset) where block# is a value from 0 to N-1 when an SST has N blocks and the offset is the position of the row in the block. The rowid usually needs <= 4 bytes.

Fanout is less of a concern for a bitmap index because many analytic queries expect to scan a large part of the index. But there will be more merging of bitmap index entries when LSM LSI is used, just as there would be with local bitmap indexes on a partitioned table with a famous SQL DBMS.

References


Wednesday, October 21, 2020

Self-aware DBMS, RID-lists for LSM trees

This is a short writeup of two ideas I might never have the time to implement -- self-aware DBMS and RID-lists for LSM tree.

Self-aware DBMS

Tuning a B-Tree index structure is hard and tuning an LSM tree is usually harder. Fortunately it will get easier over time. We tune a DBMS today via knobs and there is clever work in progress to automate changing of those knobs. But we can do even better. I hope one day for self-aware DBMS algorithms that understand the impact of their knobs, accept declarative goals & priorities from a user and are able to improve performance with respect to an objective function. Lets move the cleverness into the DBMS.

RID-lists for LSM trees

tl;dr - can I use the merge operator to implement RID-lists for an LSM?

I worked on bitmap indexes at Oracle. It would be fair to call them compressed RID-lists. Many smart people worked on that code in Oracle prior to me and the inventor, Gennady Antoshenkov, has a strong record of innovation. IBM DB2 has RID-lists, but I don't know whether they can be compressed. All of these are useful for analytics whether this is a bitmap index, compressed RID-list or uncompressed RID-list.

Postgres 13 has index deduplication which looks like an uncompressed RID-list and I look forward to new query execution features to do the equivalent of bitmap-AND and bitmap-OR.

But this is about a RID-list for an LSM tree. A RID-list is useful for an index that can have many duplicates for a key -- meaning some secondary indexes and not for a primary index. The LSMs I care about don't know much about schema and secondary index maintenance requires knowledge of schema but I will ignore that for now. MyRocks uses RocksDB for primary and secondary indexes, so this is something that can be solved by an application. WiredTiger supports schemas and secondary indexes without also providing a query language and someone could add the same to RocksDB.

I am curious whether the merge operator can be used to implement RID-lists? Disclaimer - I am far from an expert on the merge operator.

  • the rowid is the value of the primary key. Secondary index entries for an LSM and for Oracle IOT use the primary key value as the rowid because the base row can be relocated (during compaction with an LSM, during B-Tree leaf node splits and merges with IOT). Bitmap indexes for Oracle IOT solves this via a mapping table. If that is not done then the problem is that the primary key value might be large and a list of them might not be compression friendly.
  • the delta that is written by the merge operator specifies whether a rowid is added or removed for a given secondary key value. I wonder if something like SingleDelete is needed.
  • the callback function that combines merge operator entries during compaction will merge entries to create larger RID-lists

Tuesday, October 20, 2020

MyRocks in MySQL 5.6 and 8.0

I am in the process of upgrading my home NUC cluster from Ubuntu 18.04 to 20.04 and that means I get to use newer versions of gcc. It also means I lose easy access to older versions of gcc. I assumed that I wouldn't be able build MyRocks and MySQL 5.6 from source in Ubuntu 20.04 but I never confirmed it and I might be wrong. I have been using gcc 7 on Ubuntu 18.04 which is still available in 20.04, if I am willing to manage multiple versions of gcc. Regardless it is time for me to begin using MyRocks with MySQL 8.

I have results for the insert benchmark to compare MySQL 5.6 and 8.0 where both use MyRocks and the results are what I expected -- I lose between 10% and 20% of throughput (queries or inserts /second) because 8.0 has more CPU overhead. Note that this is for a workload with low-concurrency and my NUC servers have 4 cores with HT disabled. This is similar to what I see when comparing InnoDB between MySQL 5.6 and 8.0. I prefer no new overhead, but new overhead is inevitable and the amounts here seem reasonable given these are two versions apart -- 5.6 to 5.7 to 8.0.

Update - the numbers above include the CPU from the benchmark client and mysqld that share the same host. When I limit the CPU measurement to mysqld then the difference is 1.4X during load without secondary indexes, that is mysqld in MySQL 8.0 uses 1.4X more CPU to do the same amount of work during the load. While I am OK with 1.2X more CPU (~20% mentioned above) I think that 1.4X more is too much.

Background reading that has more detail:


Configuration

Three workloads were used -- in-memory, IO-bound and extra IO-bound. The in-memory workload loads 20m rows without secondary indexes, creates 3 secondary indexes, loads 20m more rows and then does 12 30-minute read+write tests where the writer is rated limited to 100, 200, 400, 600, 800 and then 1000 inserts/s. The IO-bound workload is similar but loads 100m rows without secondary indexes and then 10m with secondary indexes. The extra IO-bound workload is similar but loads 500m rows without secondary indexes and then 10m with secondary indexes. For my sake, here are the command lines for my helper scripts.

The load and create index test steps are single-threaded. The read+write test steps use one query thread and one rate-limited writer thread.

MyRocks was used with MySQL 5.6.35 and 8.0.17 from the FB branch. The 5.6 version was compiled and run using Ubuntu 18.04 while the 8.0 version was compiled and run with Ubuntu 20.04. The my.cnf contents are here for 5.6 and for 8.0.

Performance summaries that can be interpreted after reading this are here for the in-memory, IO-bound and extra IO-bound workloads. The most important columns are qps and ips for throughput (queries/s, inserts/s). The throughput differences are usually explained by cpupq (cpu/query or cpu/insert depending on the test step).

The table below lists the throughput ratio for each test step as (8.0 value / 5.6 value) and a number less than 1 means that 8.0 is slower than 5.6. I didn't list values for all of the test steps to avoid long lines. The test steps are:
  • l.i0 - load without secondary indexes
  • l.x - create secondary indexes
  • l.i1 - load with secondary indexes
  • qN.2 - read+write where N is the number of inserts/s

                l.i0    l.x     l.i1    q100.2  q200.2  q800.2  q1000.2
in-memory       .79     .99     .87     .89     .89     .88     .88
io-bound        .82     .99     .87     .99     .89     .90     .91
extra io-bound  .80     1.03    .87     .92     .90     .91     .92

Thursday, October 15, 2020

Better vs optimal

I read many systems papers and appreciate how research makes my career more interesting. Some interesting papers use optimal when they mean better in describing performance and I wonder if I am an outlier in thinking that optimal is occasionally misused. Optimal is formal and requires one of:

  • A proof
  • An objective function and math to show where it is minimized or maximized
  • Exhaustive search
In the context of tuning a DBMS, exhaustive search is rarely done because the search space is large. Assume there are N tuning options that each use a B-bit integer, then the search space has N*B bits and N*B quickly becomes a too-big number for all of the systems on which I work.

Objective functions and math are wonderful but they work on performance models and it is a challenge to accurately model a complex system. Proofs share that challenge.

Fortunately, there is clever work in progress to get better results without doing exhaustive search. Alas those results are better, not optimal. And now an unsponsored ad -- I appreciate that OtterTune is not only doing interesting work but they have been careful about the distinction between better and optimal.

Wednesday, October 14, 2020

Comments on: The Unwritten Contract of Solid State Drives

The Unwritten Contract of Solid State Drives by Jun He, Sudarsun Kannan, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau was published in EuroSys 2017. The paper is interesting and worth reading. I hope for more papers on this topic and recently shared a review on another paper from UW-Madison on this topic. This tries to be a brief review of the paper. I only review papers that I think are useful to me and others.

I didn't notice this paper until recently. While I wish I didn't miss interesting papers like that, it is hard for me to keep up with so many great papers getting published. Regardless, I assume that the RocksDB team is open to discussions with people doing research in this space. I am too and this post was half-serious.

5 rules

The paper presents five rules that are frequently true for an SSD and explains that all rules aren't true for all SSDs. Following the rules often leads to better SSD performance. 

The objective functions in the paper are throughput and write amplification. Quality of service (QoS) wasn't included and maximizing throughput can hurt QoS. Many applications want maximum throughput that respects QoS where QoS can be defined by p99 and worst-case response times. So I hope that a future paper includes QoS. The paper explains that it focuses on internal states of the SSD rather than end-to-end performance, but QoS can also be measured via internal states because that is where many stalls happen (read delayed because slower program or erase in progress).

There is also a difference between using all device throughput that a workload needs vs using all of the device throughput. Saturating device throughput isn't a goal. Getting as much throughput as needed for a workload is a goal. A device offers several things -- capacity, throughput and low-latency operations. It is hard to saturate both capacity and throughput while not hurting QoS. 

The rules are:

  • Request scale - issue large requests or concurrent small requests to get more throughput and benefit from internal parallelism of the SSD
  • Locality - access with locality (temporal and/or spatial). Possible benefits include reduced translation cache misses, reduced data cache misses and reduced write-amp from flash GC
  • Aligned sequentiality - write files sequentially using large write requests. For an FTL that can do hybrid block mapping this requires fewer entries in the translation table.
  • Grouping by death time - do writes so that all data in an erase block is discarded at the same time. 
  • Uniform lifetime - structure data so that all writes have a similar lifetime. Note that this is stricter version of grouping by death time and unlikely to be followed by a DBMS except for workloads that are write-once and then read-only.
The paper states that an SSD might not require following all of the rules. Alas, it can be difficult to determine which rules should be followed and for a given device, or even a given firmware version.

A summary of the Linux block IO layer is useful background reading.

5 Rules and an LSM

Can an LSM follow these rules?
  • Request scale
    • Compaction reads with RocksDB are a (compressed) block at a time and the uncompressed block size is likely ~4k, ~8k or ~16k. The block sizes are approximate, usually < the target size and not aligned. The compaction_readahead_size option in RocksDB can be set to get large read requests otherwise you rely on filesystem readahead and it is risky to make read_ahead_kb large because that is per-device.
    • Compaction writes with RocksDB are a page at a time. Each SST is written sequentially and fsync is called when the SST is full (usually ~64M). Async write-behind can be enabled via the bytes_per_sync to avoid a storm of writes on fsync and to get large write requests of size bytes_per_sync.
    • WAL writes can use the wal_bytes_per_sync option
    • For user reads, many concurrent users are the easy way to get concurrent small reads. I am less certain about options for doing prefetch or getting parallel IO requests for a user query. Back when I did small data at FB getting concurrent IO for one user request wasn't a priority because the focus was sharing a device across many users.
  • Locality
    • Compaction reads & writes respect this as they read & write files sequentially. 
    • Whether user reads respect this depends on the workload.
  • Aligned sequentiality - compaction writes respect this
  • Grouping by death time
    • I know this via the multistream effort and even collaborated on an evaluation. The results in my eval weren't significant so I should read papers by others to understand where it was a big deal.
    • Grouping by space requires an API like multistream. With that it is easy to group for an LSM. The upper/smaller levels of the LSM tree have shorter lifetimes than the lower/larger levels.
    • It is usually impossible to do this without an API. Because I don't know how many erase blocks are concurrently open. Nor do I know the size of an erase block. I risk using the wrong name as I am not an expert on this but I mean the logical erase block that is managed by the FTL not the physical erase block that is per NAND chip. The logical erase block is striped over many NAND chips for a few reasons, including to prevent the loss of data when a NAND chip fails. This structure is rarely explained by vendors.
  • Uniform lifetime - I doubt any DBMS can follow this except for data that is write once.
Workloads

The paper then runs workloads for SQLite, LevelDB and RocksDB using SSD simulators to determine whether the rules are followed. The workloads used db_bench for RocksDB however there weren't enough details on the command line options or RocksDB configuration. One of the comments mentions that most of the activity was between the L0 and L1 in the LSM tree and I wonder if the databases were too small and didn't have enough LSM tree levels.

Observations

The paper then has many interesting observations on whether the rules were respected and explains cases where they weren't. Things that prevent the rules from being followed include the DBMS, the filesystem and the workload. I have comments on some of the observations
  • Observation 1 - Application log structure increases the scale of write size. This is expected with an LSM. I appreciate that this paper distinguishes between IO patterns for the application (LSM writes SST files sequentially) and for the device (concurrent compaction means that a device interleaves large write requests from multiple files).
  • Observation 2 - The scale of read requests is low. It is hard to know whether this is a function of the system software (bottlenecks & missing features) or the workload (working set cached, not many concurrent users).
  • Observation 3 - RocksDB doesn't consume enough device bandwidth. This is hard to appreciate without more details on the benchmark setup. Note that I don't care about saturating device throughput unless QoS is also a goal.
  • Observation 4 - Frequent data barriers limit request scale. The obvious one is fsync/fdatasync although concurrency usually overcomes that. For reads, prefetching and readahead can overcome but this wasn't a priority for the workloads I cared about because they had many concurrent users. There are things that can be done, and might already be in RocksDB, to get more concurrent IO reads to a device when apps have low concurrency.
  • Observation 5 - Linux buffered I/O limits request scale. Fortunately, RocksDB can also run with O_DIRECT. I wasn't clear on the claim that read_ahead_kb limits the size of a read. Perhaps this is true for reads smaller than read_ahead_kb. Perhaps I can ask the block IO expert. For compaction, the compaction_readahead_size option can get larger reads for compaction. For scans I am not sure if RocksDB has a way to get readahead (multi-page reads) when needed.
  • Observation 8 - be wary of deferring discard. The variety of discard performance across devices is large and not well documented. I hope more research is done on this.
  • Observation 9 - SSDs demand accurate and aggressive prefetching. I am wary of inaccurate and aggressive prefetching. An LSM can benefit from prefetching for compaction reads (RocksDB has an option for that). Prefetching for an LSM during user reads is interesting. For a point query prefetching the lower levels of the LSM tree can be wasted work when the data is found in the upper levels of the tree. Prefetching for a scan is easier because data from all levels must be read. However, the amount of readahead to use during a scan is more complicated because the LSM usually isn't told how large the scan will be (first 10 rows vs all rows). There might be options in RocksDB to get parallel reads on behalf of a single user query.
  • Observation 17 - application log structuring does not reduce garbage collection. I am skeptical about this. Also, the paper shows that it doesn't prevent GC. It does not show that it doesn't reduce it. I have results from 2016 to show that flash GC write-amp from MyRocks was less than from InnoDB. Of course, that is specific to one device. Regardless I appreciate that the paper explains some of the issues including that a device interleaves large write requests from many files with concurrent compaction. I am wary of the misleading sequential writes message that is repeated in too many papers.

Thursday, October 8, 2020

MySQL 8 is greater with MyRocks

I compiled MyRocks for MySQL 8 on Ubuntu 20.04 and will soon use it for performance tests. I did this because I am upgrading my NUC cluster to Ubuntu 20 and am not willing to get an older gcc toolchain on it to build MyRocks in MySQL 5.6 from source. It took a few hours to figure this out, but that is a one time cost.

This explains how to build from the Facebook MySQL fork in Github. I suggest that everyone else use MariaDB or Percona Server if they want to try MyRocks. I write posts like this because I am likely to need this information in the future.

The steps are:

  1. Install dependencies listed here. I might be missing a few as I did this on an install I have been using for a while and might have installed others in the past.
  2. Get the source from github
  3. Checkout the 8.0.17 branch via git checkout remotes/origin/fb-mysql-8.0.17 -b fb8017
  4. cd fb8017
  5. Get RocksDB source via git submodule init; git submodule update
  6. Get Boost 1.69. The previous step installed it as a submodule but I wasn't able to use it and just copied it from elsewhere.
  7. mkdir build; cd build
  8. Run cmake via this script. The script is good for me. It might not be good for you. But it is important to set extra compile flags for RocksDB, otherwise you risk having a non-performant build. See problem 4 below. The script expects Boost 1.69 to be copied to $BUILD/../boost_1_69_0
Problems:
  1. This is harder than it should be but the FB MySQL tree isn't meant for the community to build from. It is there to share diffs with the community.
  2. Some RocksDB #defines don't have ROCKSDB_ as a prefix so there is a greater chance of conflict with code from elsewhere
  3. I wasn't able to figure out how to get the Boost submodule working so I copied Boost 1.69 from elsewhere
  4. Some RocksDB compilation flags that are used for a pure-RocksDB build are not used in the MyRocks build and this can hurt performance. So I build RocksDB directly and copy some of the flags from there into my cmake script. This is a hassle and is error prone, but again, this source tree isn't mean for community builds
One hint that you have a non-performant RocksDB build is this string in the LOG file
Fast CRC32 supported: Not supported on x86
While this string is OK:
Fast CRC32 supported: Supported on x86

Comments on: Optimizing Databases by Learning Hidden Parameters of Solid State Drives

I just read an interesting paper: Optimizing Databases by Learning Hidden Parameters of Solid State Drives by Aarati Kakaraparthy, Jignesh Patel, Kwanghyun Park and Brian Kroth. Reading the paper was a good use of my time. I hope there are more papers on this topic. This isn't a review of the paper but a list of questions and comments on the paper. The paper is well written and easy to read, so a review by me wouldn't add much.

Storage Performance

The reason for the paper is that the performance profile of an SSD is complicated and complicated in different ways than the performance profile of a disk. It might be easy to predict throughput from a disk given a transfer rate, rotational and seek latency. But that gets a more complicated when using the outer vs inner tracks of the disk. And it gets even more complicated with concurrency and the impact of the IO scheduler.

Predictions for a disk are easy but complicated (that might sound odd) but predictions for an SSD are less easy and more complicated and many of the details aren't explained by the SSD vendors. The paper explains how to derive some of the design decisions via testing and then uses that to improve performance for MariaDB and SQLite.

If you care about performance an SSD is a black-box. SSD vendors might feel the same way about the workloads that run on SSDs. I wrote innosim (InnoDB IO simulator) a long time ago to share a workload that I care about with SSD vendors. It is fun to see results from it on the web.

The paper is about direct attach SSD. I wonder if there is an opportunity for interesting research on cloud SSD. With EBS requests <= 256KB count as one IO so you will get more throughput if you can figure out how to use larger requests.

Comments on the paper

It is possible that I missed a few things in the paper, and I am happy to be corrected.

The desirable write request size is determined by creating files using writes of size X KB (X <= 512) and then computing the read latency for reads of size 1MB per the pseudo-code for Experiment 1. The value of X that provides the minimum read latency is the desirable write request size.

  • The desirable size is likely to be much larger than 16kb but if you are using an update-in-place B-Tree for OLTP then you are unlikely to do that as leaf pages will be small (<= 16kb). Write requests can be larger with an LSM (SST files are > 1MB) or a copy-on-write B-Tree.
  • I am not sure about using min latency for 1MB reads as the objective function to determine the best write request size. It is unlikely that 1MB reads will be done from a B-Tree. With an LSM that will occur during compaction, although I am not sure that optimizing for compaction reads is a goal. Large reads are common with a heap-organized table (Postgres), especially when doing analytics. So I think more nuance is needed and that could be included in another paper.
  • What is the impact from concurrent writes? For InnoDB either page writeback or doublewrite IO will be concurrent with binlog or redo IO. For MyRocks compaction writes will be concurrent with binlog or WAL IO. Assuming it were possible to use the desirable write request size, how is this impacted by concurrent writes because requests can be interleaved by the block IO layer and by the SSD. Note that if the SSD erase block size is <= 64MB then it might be nice to let an LSM not share erase blocks between SST files as not sharing means there is nothing to copy-out of the erase block during GC (all or none of the erase block is in use). I probably should re-read this document once a year as Jens Axboe does a great job explaining what happens in the block IO layer for large requests.
  • What is the impact from GC? Writes done today can be relocated tomorrow by flash GC.
I hope more is written to explain the difference between desirable write request size and stripe size and now I will try to read other papers to better understand stripe size. From table 1 they are equal for 2 of the 4 devices tested. They aren't equal for SSD-S and I think that was a judgement call based on the graphs in figure 3.

The paper didn't try to discover the desirable concurrency level (number of channels) as that has been addressed by previous work. But I hope they include that in future work.

IO patterns and optimization goals vary by file structure. For MySQL with InnoDB the important IO is listed below. Given that reading redo and binlogs from storage aren't (or shouldn't be) the common case the desirable write request size for them is one that minimizes write-amp:
  • InnoDB database files - writes of size 16 KB (sometimes 8KB), uses either buffered IO or O_DIRECT, single-page reads and writes are frequent
  • InnoDB undo - I will let someone with more expertise summarize this
  • InnoDB redo - writes of size 512*x, frequent fsync, buffered IO, file is pre-allocated, rarely read. We suffered from redo read IO back in the day for InnoDB redo when it didn't stay in cache (large redo isn't free). The problem was that writing the first 512 bytes of a file system page not in cache turns into read-modify-write. I don't know the status of this today.
  • InnoDB doublewrite buffer - large writes, probably a small multiple of 512 KB to the same location in the system tablespace, uses either buffered IO or O_DIRECT. This might have changed recently in MySQL 8.
  • binlog - small writes of any size, frequent fsync, file grows with each write, reads are usually done from the OS page cache, not from storage
TRIM is another black box and a topic for future work. TRIM performance varies across devices. I was spoiled by early Fusion IO devices where TRIM was fast. Modern devices have differing capacities for the number of files and number of MB that can be trimmed per second. There will be stalls when those rates are exceeded. Alas, these capacities are not explained. As a workaround, R.ocksDB has an option to rate limit file deletion. Domas Mituzas has written about and shared code to manage file deletion done by value-added processes that run on DBMS HW to keep them from hurting DBMS QoS.