Friday, March 27, 2020

kernel lockdown vs blktrace

I am trying to use blktrace to determine which files are the source of disk reads for a database that uses buffered IO. The server runs Ubuntu 18.04 on bare metal and the boot uses UEFI not legacy.

blktrace doesn't work, time to debug. Things that don't fix it include upgrading from 4.15 to 5.3 HWE kernel and disabling apparmor. Next up is disabling kernel lockdown via mokutil --disable-validation. Alas, blktrace still fails at startup.

After running mokutil and then rebooting there are still a few messages in dmesg output about lockdown so I wonder whether it was fully disabled.
Lockdown: Hibernation is restricted; see man kernel_lockdown.7
Lockdown: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
OK, lets read the man page. Great, it doesn't exist -- not for Ubuntu nor for other distros. There is a draft but I am starting to get the impression that lockdown wasn't ready for prime time. And Linus had a strong opinion about it in 2018.

Next up is a strong opinion from Brendan Gregg.
Many distros are enabling lockdown, breaking BPF. This is the worst OS change I've ever seen.
OK, maybe my problem is lockdown and mokutil wasn't sufficient. Time to try:
echo 1 > /proc/sys/kernel/sysrq; echo x > /proc/sysrq-trigger 
And now blktrace works. Well, until I reboot. I already have a script to run after reboot to reduce security so that PMP can run. That script just got larger:
echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger 

Monday, March 16, 2020

Insert Benchmark v3

I expect to replace the insert benchmark later this year. The insert benchmark I have been using might be ibench v2 as ibench v1 came from Tokutek. So the replacement I write would be ibench v3.

The reasons for a replacement include:
  • Switch from Python to something more performant (Java or Go) as it is difficult to sustain high QPS rates (100k / second or more) with Python and a Java/Go client will consume less CPU. I assume the DBMS I care about all have good golang clients (MongoDB, MySQL, Postgres).
  • Make the benchmark less synthetic. Early ibench was insert-only for a table with a PK and 3 secondary indexes. Eventually I added support for short range queries. It has always used uniform distribution. The plan for ibench v3 is to model a monitoring workload -- many inserts, some updates, many queries. It will use a PK and 1 or 2 secondary indexes.
  • Continue to support multiple DBMS -- MongoDB, MySQL and Postgres.
  • As a bonus, don't ignore coordinated omission
I still need to confirm whether this is worth implementing or whether I have a case of NIH and should be using TSBS. I browsed TSBS and the indexes I describe here might match the indexes that TSBS would use. However, TSBS has more query diversity. My focus is on the storage engine and a smaller set of queries is sufficient. Regardless, I need to spend more time reading about TSBS.

Logical Schema

The logical schema describes the data in one insert operation. The physical schema describes how that is stored in the DBMS. I focus on the logical schema here.

The logical schema is: timestamp, deviceID, [metricID, metricValue]+
  • This is the data from one insert for a given device. It has data for 1+ metrics.
  • timestamp is a 64-bit integer that will be mostly increasing on insert. Clocks are not synchronized across clocks, some devices will have bad/stuck clocks and some devices will have clock drift.
  • deviceID is a 64-bit integer. The number of devices will be used to scale the workload. There can be millions of devices for large scale factors.
  • Each (metricID, metricValue) pair represents the value for a metricID from a given device at a given point in time. metricID is a 32-bit integer. The number of metrics will also be used to scale the workload. There will be fewer metrics than devices. metricValue is a 64-bit integer


Fields are abbreviated when describing the workload: t=timestamp, d=deviceID, m=metricID, v=metricValue. Operations are described in SQL, except for insert.

The workload is a combination of:
  • Get: select v, t from C where m=M and d=D and t between T1 and T2 order by t
    • Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N
  • Max, Min: select v, t, d from C where m=M and t between T1 and T2 order by v asc/desc limit N
  • Compare: call Get for m in (...) and d=D and t between T1 and T2
  • Insert: insert into C values (t, d, [(m1, v1),(m2, v2),...])
    • SQL above is not standard and assumes a schema that might not be used.
    • (t,d,m) is unique. I am still deciding whether the benchmark client can generate intermittent duplicate inserts and how that should be handled.
  • Rollup: Uses Compare to get all data for d in (...) in time range, then replaces that data with values aggregated over coarser granularity. Assume that base data is rolled up into 5 minute, then 1 hour and then 1 day granularity.
  • Backfill: inserts missing data that might be hours or days old

The value of N in Get, Max and Min above will vary. For Max and Min it is likely to be <= 100 and frequently <= 10. For Get in some cases the user wants all data in a time range, but in other cases they will set N to be <= 10,000 and not get all data in the time range.

I decided to not have a Get query variant, called GetD, that uses where d=D and t between … (it lacks a predicate on m) because this data is provided via the Compare query assuming that the metricIDs are known. If that assumption isn’t true and GetD were needed then an extra secondary index on (d,t) would also be required to make that query efficient unless the DBMS supported index skip scan via an index on (d,m,t,...).

Open Issues
  • Spammy data - is it worth simulating devices that misbehave? This includes sending data too frequently, sending incorrect values and timestamps and not sending data for some time periods.
  • Is the aggregated data stored inline with the base data or in a separate table? If stored inline do entries need a tag to indicate granularity like base, hourly, daily, etc?
  • Should the benchmark include queries and writes to correct bad data?
  • Can there be a way for the benchmark to run faster than wall clock time so that per-hour and per-day rollup is done more frequently than per real hour and per real day? This avoids the need to run the benchmark for too long.
  • How does a scale factor affect the size of the database, the write load and the query load?

Physical Schema - Write Optimized
The goal here is to make writes fast. That is done by using a unique index for (t,d). Inserts are to the right end of the index assuming they usually arrive in t order. While this makes inserts fast it makes queries too slow. The (t,d) index over-fetches because only the range predicate on t can be used for the index access path:
  • For the Get and Compare queries it over-fetches by a factor of |m| * |d|. If there are 1M devices and 10 metrics per insert it fetches 10M times more index entries than needed.
  • For the Max/Min queries it over-fetches by a factor of |m|.
The large over-fetching for Get and Compare makes it infeasible to run them with the write optimized schema. The Max and Min queries could still be run -- it would be painful but not infeasible given that |m| is likely to be <= 100.

Implementation notes:
  • MongoDB - There are 3 choices for the unique index. The first choice is to use the subdocument {t, d} as the value for _id and make sure that all clients encode it in the same way. The second choice is to concatenate t and d into a string for the value of _id. This avoids the need to worry about clients encoding as in the first choice at the cost of storing t and d twice per doc. The third choice is to use ObjectId as the value of _id and then have a unique compound index on {t:1, d:1}. The cost for this is an extra index but inserts are in index order for both.
  • SQL with document datatypes - Use one row per insert and document datatype support in MySQL or Postgres to store all of the (m,v) pairs in that row.
  • SQL without document datatypes - Use one row per metric per insert. The unique index is defined on (t,d,m) rather than (t,d). The columns in the row are t, d, m, v.

Physical Schema - Read Optimized

These indexes are important to make queries efficient.
  • Get and Compare need one of (d,m,t) or (m,d,t).
  • Max/Min need one of (m,t,v,d) or (m,v,t,d)
From the perspective of one Get query there isn’t a difference between the index choices because there are equality predicates on d and m. But from a cache perspective one might be better than the other as they cluster on the first indexed field -- by d or m. If some metrics are less likely to be queried then clustering by m can be a better choice.

The unique constraint is on (d,m,t) or (m,d,t). If v is also in that index then uniqueness cannot be enforced. If v is not in the index then it might not be covering. With Postgres I can use the INCLUDE clause to get v in the index. With InnoDB and MyRocks the table is clustered on the PK, so it is sufficient to make (d,m,t) or (m,d,t) the PK. With MongoDB if I want a covering index then I have to create two indexes -- unique on (m,d,t) and then non-unique but covering with (m,d,t,v) or I can skip the covering index to reduce the write cost but make queries slower. Finally, by unique for MongoDB I mean that this is the value for _id, and then we are back to the question of using a subdocument for _id.

For Max and Min:
  • The (m,t,v,d) index uses index predicates for m and t and then does a sort. So this has an extra cost from the sort and can’t benefit from stopping the index scan once the limit N has been satisfied. Some DBMS have optimizations to make top-N sort fast, but there is no way to avoid the cost of not stopping the index scan early.
  • The (m,v,t,d) index uses index predicates only for m but avoids the sort and can stop the index scan once the limit N has been satisfied. The range predicate on t would be a filter predicate - evaluated for each entry read from the index.
There are cases where (m,t,v,d) is a better choice, but I think that (m,v,t,d) will be better more of the time. The index on (m,t,v,d) would be better when the selectivity from using t as an index predicate offsets the cost of the sort and the data is skewed so that limit N isn’t satisfied until most of the index has been scanned.

Write efficiency in the read optimized schema

Below I use streams of writes to mean the number of leaf pages that will be subject to read-modify-write from index maintenance during inserts at any point in time. If all inserts are to the right end of the index then there is 1 stream. If the index is on (client-ID, time), there are 10 clients active concurrently and each client inserts in time order then there are 10 streams of writes. When there is one leaf page in cache for each stream then IO efficiency is better. Otherwise each insert is more likely to force a page write back followed by a page read.

For the indexes used by Get and Compare -- (d,m,t,v) or (m,d,t,v) -- there will be |d|*|m| streams of writes. The cache demand from this is |d|*|m| pages. A small workload might use |d|=1M and |m|=10 and 10M 8kb pages needs 80G of RAM. Both indexes have a similar amount of cache amplification.

For the indexes used by Max and Min, the (m,t,v,d) index has |m| streams of writes vs |m|*|v| for the (m,v,t,d) index. Thus the (m,v,t,d) provides better query performance at the cost of more random IO and more cache amplification. One index has much less cache amplification.

For the non-unique secondary index used for Max and Min, MyRocks can use read-free index maintenance and the number of write streams isn't an issue for it. But it will be an issue for other DBMS.


Partitioning the tables and collections by time can help with write and read performance as well as making other tasks easier -- removing and rolling up old data. For example there could be a partition per hour for the last N hours. However if a DBMS doesn’t support partitions then there is a burden on the application developer to support cross-partition queries. As a user I expect to run queries across hourly boundaries.

Thursday, March 12, 2020

Tuning space and write amplification to minimize cost

This uses math to show how to tune space and write amplification to minimize storage costs for an index structure that uses index+log. The result, minimal cost, is true assuming my model is true. But at this point I will only claim that the model is truthy. I look forward to more results in this area from the smart people at DASlab and elsewhere. I wonder if database economics is a good name for this topic.

I explained the index+log index structure here and here and my truthy model assumes that write and space amplification are functions of PctUsed - the percent of available storage that is used by the index structure. The model is:
  • Space amplification = 100 / PctUsed
  • Write amplification = 100 / (100 - PctUsed)
In what follows I use SpaceAmp and WriteAmp for space and write amplification. When PctUsed is X then the remaining space on the storage device (100-X) is free, not used by anything. The formulas mean that a deployment can trade between SpaceAmp and WriteAmp by adjusting the value of PctUsed. When PctUsed is 80 the values of SpaceAmp and WriteAmp are 1.25 and 5. When PctUsed is 20 the values of SpaceAmp and WriteAmp are 5 and 1.25.


Back in the day hardware was scarce and there was much math in systems papers. While there isn't as much math today the rush to ML means that more people are learning applied math (me included) which is a good thing. I regret not learning enough applied math while in college.

Here I derive a formula for the cost of storage in terms of the database size and the IO required to do compaction (GC, garbage collection) for an index structure that uses the index+log approach. The cost is a function of PctUsed.

P = PctUsed
S = index structure size in GB
G = Cost/GB
N = Write rate to index structure in MB/s
f = value between 1 and 2, 1= no storage reads by GC, 2= all GC writes do storage reads first
I = Cost/IOPs for 4KB operations

Cost = Costspace + Costio

# Formulas for the cost of space and IOPs
# 256 converts MB/s into 4KB IOPs

Costspace = S * G * 100 * P-1
Costio = N * 256 * f * I * 100 * (100-P)-1

# Determine where Cost' = 0 to find the minimal cost
Cost' = Costspace+ Costio'

Costspace= -1 * S * G * 100 * P-2
Costio= N * 256 * f * I * 100 * (100-P)-2 * -1 * -1
Cost' = (-1 * S * G * 100 * P-2) + (N * 256 * f * I * 100 * (100-P)-2 )
# And Cost' = 0 when
S * G * 100 * P-2 = N * 256 * f * I * 100 * (100-P)-2 
# Skipping a few steps this reduces to
P* ((NfI/SG) - 1) + 200P - 10,000 = 0

# This can be solved by the quadratic equation with a=((NfI/SG) - 1), b=200, c=-10,000


Now I solve the equation above to determine the value of PctUsed that minimizes cost with prices from EBS provisioned IOPs. A Google Sheets spreadsheet with the solution is here. For the spreadsheet:
  • The graph uses log-scale for the y-axis and the y-axis doesn't start at 0. This makes it easier to see the impact of changing PctUsed, but can also be misleading. 
  • The solutions from the quadratic equation are quad1 and quad2
  • Cost is computed for PctUsed in (5, 10, 15, ..., 85, 90, 95)
  • The minimal value for Cost (quad1) is likely to be between these values
I then solve for 3 cases: N=1, 10 and 100 where N is the write rate to the index structure in MB/s. The minimal cost occurs at PctUsed = 67, 39 and 17 for N = 1, 10 and 100.

For N=1, a low write rate, the minimal cost is at PctUsed=67

For N=10, a moderate write rate, the minimal cost is at PctUsed=39

For N=100, an extreme write rate, the minimal cost is at PctUsed=17

Review of Benchmarking RocksDB Key-Value Workloads at Facebook

At some level all database engines have a key-value API even if they aren't exposed to the user. One name for this in a SQL DBMS is the data layer. The data layer is RocksDB for MyRocks & MongoRocks and then WiredTiger for MongoDB.

FAST 20 has a paper on characterizing RocksDB key-value workloads at Facebook. One of the workloads is UDB and UDB uses MyRocks, a SQL DBMS.

Benchmarks are useful to me but there is always room for improvement. Two common problems are the lack of load variance and the lack of complexity in access patterns. By the lack of load variance I mean that the workload doesn't change over time which fails to capture the daily spikes that occur for web-scale workloads as different parts of the world wake and sleep. They also fail to include daily and intermittent operational tasks like backup, schema change, data migration and archiving old data.

This paper explains the complexity in access patterns for three RocksDB use cases and shows how to reproduce them in benchmarks. UDB was one of the workloads so this work includes an update of the analysis shared in the Linkbench paper from 2013.

The primary contributions of the paper are the analysis of production workloads and the discovery that hot key ranges, rather than hot keys, is required to model production workloads. I appreciate that some of the work behind this paper has already been shared via RocksDB and that more might be shared as enhancements to YCSB.

Things I liked about the paper:
  • When generating skew in a benchmark use the concept of hot key ranges rather than hot keys. Linkbench and YCSB use hot keys distributed across the key space.
  • Explains the UDB workload which was last explained in the Linkbench paper
  • Explains two other production workloads
  • Provides some detail on the diurnal workload spikes in UDB
  • Explains the distribution of key sizes. While it states that UDB keys are small, by small it means a typical value is 16 to 30 bytes. Of course, the covering secondary index used by UDB is an outlier at ~64 bytes.
  • Explains that values are usually small, <= 128 bytes
  • Documents temporal access patterns
  • Demonstrates that using hot key ranges rather than hot keys can more accurately model production workloads

Wednesday, March 11, 2020

Building MySQL 5.6 from source on Ubuntu 18.04

This explains the steps I used to build MySQL 5.6.35 from source on Ubuntu 18.04. I build with the perf schema disabled and if you do that then don't use Connector/J 5.1.48 (5.1.47 is OK) or you won't be able to connect thanks to bug 98919. The new Connector/J dependency on perf schema variables can be a problem for a DBMS that implements the MySQL API -- see this bug report for MemSQL.
  1. Install an older version of gcc and g++: sudo apt install gcc-5 g++-5
  2. Unpack source. I used 5.6.35
  3. Remove the connection_control plugin because that does not compile when the perf schema is disabled at compile time -> rm -rf $SRC_ROOT/plugin/connection_control
  4. Run cmake. See below. Note that I disable the perf schema.
My Cmake script:

CC=gcc-5 CXX=g++-5 \
cmake .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DWITH_SSL="bundled" \
      -DWITH_ZLIB="bundled" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \

Wednesday, March 4, 2020


We use RDBMS as another name for SQL DBMS but SQL isn't relational. That isn't news, see this web site and book. SQL allows for but doesn't require relational and 1NF or 3NF are optional. JSON is in the SQL:2106 spec. What would Codd think?

Using Oracle as a SQL DBMS example. First there was support for collection data types, then there was XML and eventually JSON arrived. These let you violate 1NF. I won't argue whether these should be used. I only claim they can be used.

Have there been surveys to document how often the relational approach is used with a SQL DBMS? I assume it is better to think of a distribution of approaches (a value between 0 and 1 where 0 is SQL and 1 is relational) rather than a binary approach of relational vs SQL (not relational). I might call the SQL endpoint the pragmatic approach, but that introduces bias. While I have spent a long time working on SQL DBMS I am usually working under the hood and don't design applications.

Tuesday, February 18, 2020

New school vs old school DBMS

There are two approaches to DBMS deployment
  1. Old school - pay whatever it takes to keep that one instance running
  2. New school - allow for failure and make failover reliable
At what point do you stop paying to increase availability and durability of one DBMS instance and instead spend that money and energy elsewhere? The old school approach was more popular in the days before cloud and web-scale. Of course early in the cloud and web-scale days we allowed for failure without making failover reliable. Those weren't fun times.

That y must design for failures also doesn't mean you must tolerate lousy hardware. Just because failover should be fast, reliable and lossless doesn't mean you want it running too frequently. There is a difference between commodity and cheap hardware. On the other hand, asking for too much special hardware won't win you friends on the datacenter HW team. If the DBMS is the only thing asking for redundant power, redundant network, special cooling or HW RAID with battery-backed write cache then the DBMS is a problem for web-scale. I know this from experience.

One reason for the old school approach was the large cost of the DBMS licenses and the SMP server on which it ran. Users are motivated to not buy 3X the licenses when the cost is sufficiently large and instead invest in an environment that helps that DBMS keep running (more HA hardware). Any comparison between new school and old school should consider whether the goal is durability or availability because availability costs more than durability. You can try to provide one or both. In this post I am writing about something that needs both.

Storage is one place where you can spend a lot of money to keep that one instance running. That can be done via RAID-10 where the impact is buying twice more storage devices. Or it can be done via a HA storage solution that provides impressive levels of performance, availability and durability.

One thing I am curious about is whether SSD devices need RAID-10. All storage devices have some chance of failure so the question isn't whether RAID-10 is useful. The question is whether the usefulness outweighs the cost. I assume the answer will depend on the device. Samsung advertises fail in place (FIP) as a new feature. But chip fail protection and wear leveling have been here for a long time. For which devices is that sufficient so that RAID-10 isn't recommended? Do vendors make this clear in their docs?

New School

The new school approach must do two things: make durable commits fast, make failover fast and lossless.

Fast durable commits are usually done via sync replication. I think of sync log shipping as a variant of sync replication where the end point is a log archive rather than a replica. But some sync replication solutions already support some of that via a witness.

MongoDB provides fast durable commits via async replication. I was impressed when I first learned of the implementation. The property needed for durable commits is to avoid making writes visible before they are durable. Sync replication is an implementation detail. With MongoDB and the majority read concern a snapshot on the primary is advanced to track the point-in-time at which commits are durable (applied on enough replicas).

Whether durable commits are fast depends on where you place the replicas. Speed of light matters. When replicas are far apart then there will be more commit latency. When replicas are not far apart there might be a larger HW bill. Witnesses (log only replicas) can help here.

Lossless failover assumes a solution for durable commit. Once durable commit has been provided then fast failover is a matter of detecting failures, electing a new primary, promoting the replica to be a primary and then directing traffic to the new primary. There are many details here and plenty of opportunities for mistakes. I know from having helped make some of them. Fortunately the rise of web-scale DBMS means that we get solutions that work.

Things that I ignore in this post:
  • Fast failure is a challenging problem depending on how fast you want it to be. 
  • Systems that allow multiple replicas to initiate writes might not expose failover to a client, but many of the problems described here are still solved under the covers by such systems.
  • Even for systems that don't have explicit failover there is still an impact on clients from failed in-progress transactions. Although Comdb2 hides that from clients.