Monday, March 30, 2020

Compiling from source for Amazon Linux 2

Different is worse

I used to know struggle with  make & automake but I knew enough to get things done. Now I have fun with make, cmake, maven and gradle. I am sure this proliferation solves problems for people who create the build scripts that I use. But it isn't a good experience for the intermittent user -- different is worse because my skill level per tool is lower.

What is Amazon Linux 2? It has a landing page. I call it AL2 below. It claims to be like CentOS, RHEL, Fedora and it uses yum. I know how to use Google to get answers for those distros when trying to figure out how to build something from source.

But it isn't clear to me that packages that exist for CentOS also exist for AL2. One example of the challenge is that I ended up installing jemalloc from source because I couldn't find the package without advice from an expert.

I hope the output from cat /etc/os-release on AL2 is improved. Below are examples for AL2 and Ubuntu 18.04. Ubuntu specific that this is Ubuntu 18.04. From AL2 I learn it is like CentOS. But which CentOS?

NAME="Amazon Linux" 
VERSION="2" ID="amzn"
ID_LIKE="centos rhel fedora"
PRETTY_NAME="Amazon Linux 2"

Lets compare that with Ubuntu:

VERSION="18.04.4 LTS (Bionic Beaver)"
PRETTY_NAME="Ubuntu 18.04.4 LTS"

Dependencies for AL2

This explains how to satisfy the dependencies before running cmake. I struggled figuring out the names for packages that had to be installed. Some of this is because I don't have much experience with CentOS, some of this is on AL2. Eventually I found this page which might have helped.

Add libraries for the insert benchmark, some of this is needed for the build and all is needed when I run benchmarks so I will just install before I try to build:

sudo yum install -y python3
sudo python3 -m pip install pymongo
sudo yum install -y mysql-devel python3-devel
sudo pip3 install mysqlclient
sudo pip3 install psycopg2-binary
sudo yum install -y openssl11-libs
sudo yum install -y libzstd-1.3.3
sudo yum install -y ncurses-compat-libs
# The host arrives comes with /etc/my.cnf from MariaDB

sudo rm -f /etc/my.cnf /etc/mysql/my.cnf

Install jemalloc from source because I couldn't figure out how to install the package:

cd /media/ephemeral1
rm -rf jemalloc-5.2.1*
bunzip2 jemalloc-5.2.1.tar.bz2
tar xvf jemalloc-5.2.1.tar
cd jemalloc-5.2.1
./configure --prefix=/usr > 2>
make -j4 > o.m 2> e.m
sudo make install

Install things needed to compile MySQL8 and MyRocks from FB MySQL 5.6. I didn't confirm that all of these packages exist. But I was able to build after doing this:
sudo yum install -y cmake3
sudo yum install -y
sudo yum install -y numactl-devel
sudo yum install -y libedit-devel
sudo yum install -y cmake gcc-c++ bzip2-devel libaio-devel bison 
sudo yum install -y zlib-devel snappy-devel
sudo yum install -y gflags-devel readline-devel ncurses-devel openssl-devel
sudo yum install -y lz4-devel gdb git libzstd-devel

Building MySQL8

Download, compile and install MySQL 8. It is installed at /media/ephemeral1/my8018. The upstream docs are useful.

cd /media/ephemeral1
rm -rf mysql-8.0.18*
tar xzvf mysql-boost-8.0.18.tar.gz
cd mysql-8.0.18
mkdir build
cd build
bash /media/ephemeral1/cmk80 /media/ephemeral1/my8018 > 2>
make -j8 V=1 VERBOSE=1 > 2>
make install

The contents of /media/ephemeral/cmk80.

cmake3 .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DWITH_SSL="system" \
      -DWITH_ZLIB="system" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \
      -DWITH_BOOST=$PWD/../boost \

Building MyRocks

Download, compile and install MyRocks from FB MySQL 5.6. It is installed at /media/ephemeral1/fbmy56. The upstream docs are useful.

First install boost 1.65 from source:

cd /media/ephemeral1
bunzip2 boost_1_65_1.tar.bz2
tar xvf boost_1_65_1.tar
cd boost_1_65_1
./ --prefix=/usr

./b2 > o.b2 2> e.b2
sudo ./b2 install

Then build MySQL 5.6

cd /media/ephemeral1
git clone fbmy56-src
cd fbmy56-src
# This is the old version that I used for a while
# git checkout 20aaaf8d

git submodule init
git submodule update
mkdir build
cd build
bash /media/ephemeral1/cmkfbmy56 /media/ephemeral1/fbmy56 > 2>
make -j8 V=1 VERBOSE=1 > 2>
make install

The contents of /media/ephemeral1/cmkfbmy56

if [ -z $1 ]; then
echo Requires prefix as arg1
exit -1


# Look at /proc/cpuinfo to determine whether these (sse, avx) are supported


# -march=native gets avx and sse if they are supported
CXXF="-DNDEBUG -march=native"
CXXF+=" -faligned-new"

# extra flags to avoid warnings with gcc on ubuntu 18
CF="-Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

CXXF+=" -Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

cmake3 .. \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
-DWITH_LZ4=system \

Not done yet

There is a crash in Boost code that was added in the FB MySQL branch. Boost 1.53 is installed by yum install boost-devel. I will rebuild MyRocks with Boost 1.65 as that is what Ubuntu 18.04 uses, and MyRocks works great for me there.


stack_log: _ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7clas

Friday, March 27, 2020

kernel lockdown vs blktrace

I am trying to use blktrace to determine which files are the source of disk reads for a database that uses buffered IO. The server runs Ubuntu 18.04 on bare metal and the boot uses UEFI not legacy.

blktrace doesn't work, time to debug. Things that don't fix it include upgrading from 4.15 to 5.3 HWE kernel and disabling apparmor. Next up is disabling kernel lockdown via mokutil --disable-validation. Alas, blktrace still fails at startup.

After running mokutil and then rebooting there are still a few messages in dmesg output about lockdown so I wonder whether it was fully disabled.
Lockdown: Hibernation is restricted; see man kernel_lockdown.7
Lockdown: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
OK, lets read the man page. Great, it doesn't exist -- not for Ubuntu nor for other distros. There is a draft but I am starting to get the impression that lockdown wasn't ready for prime time. And Linus had a strong opinion about it in 2018.

Next up is a strong opinion from Brendan Gregg.
Many distros are enabling lockdown, breaking BPF. This is the worst OS change I've ever seen.
OK, maybe my problem is lockdown and mokutil wasn't sufficient. Time to try:
echo 1 > /proc/sys/kernel/sysrq; echo x > /proc/sysrq-trigger 
And now blktrace works. Well, until I reboot. I already have a script to run after reboot to reduce security so that PMP can run. That script just got larger:
echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger 

Monday, March 16, 2020

Insert Benchmark v3

I expect to replace the insert benchmark later this year. The insert benchmark I have been using might be ibench v2 as ibench v1 came from Tokutek. So the replacement I write would be ibench v3.

The reasons for a replacement include:
  • Switch from Python to something more performant (Java or Go) as it is difficult to sustain high QPS rates (100k / second or more) with Python and a Java/Go client will consume less CPU. I assume the DBMS I care about all have good golang clients (MongoDB, MySQL, Postgres).
  • Make the benchmark less synthetic. Early ibench was insert-only for a table with a PK and 3 secondary indexes. Eventually I added support for short range queries. It has always used uniform distribution. The plan for ibench v3 is to model a monitoring workload -- many inserts, some updates, many queries. It will use a PK and 1 or 2 secondary indexes.
  • Continue to support multiple DBMS -- MongoDB, MySQL and Postgres.
  • As a bonus, don't ignore coordinated omission
I still need to confirm whether this is worth implementing or whether I have a case of NIH and should be using TSBS. I browsed TSBS and the indexes I describe here might match the indexes that TSBS would use. However, TSBS has more query diversity. My focus is on the storage engine and a smaller set of queries is sufficient. Regardless, I need to spend more time reading about TSBS.

Logical Schema

The logical schema describes the data in one insert operation. The physical schema describes how that is stored in the DBMS. I focus on the logical schema here.

The logical schema is: timestamp, deviceID, [metricID, metricValue]+
  • This is the data from one insert for a given device. It has data for 1+ metrics.
  • timestamp is a 64-bit integer that will be mostly increasing on insert. Clocks are not synchronized across clocks, some devices will have bad/stuck clocks and some devices will have clock drift.
  • deviceID is a 64-bit integer. The number of devices will be used to scale the workload. There can be millions of devices for large scale factors.
  • Each (metricID, metricValue) pair represents the value for a metricID from a given device at a given point in time. metricID is a 32-bit integer. The number of metrics will also be used to scale the workload. There will be fewer metrics than devices. metricValue is a 64-bit integer


Fields are abbreviated when describing the workload: t=timestamp, d=deviceID, m=metricID, v=metricValue. Operations are described in SQL, except for insert.

The workload is a combination of:
  • GetOne: select v, t from C where m=M and d=D and t between T1 and T2 order by t
    • Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N
  • GetAll: select v, t from C where d=D and t between T1 and T2 order by t
    • Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N
    • Like GetOne but missing the equality predicate on m
    • If m isn't known then an extra secondary index on (d,t) might be needed unless the DBMS supports index skip scan via an index on (d,m,t,...).
  • Max, Min: select v, t, d from C where m=M and t between T1 and T2 order by v asc/desc limit N
  • Compare: call GetOne for m in (...) and d=D and t between T1 and T2
  • Insert: insert into C values (t, d, [(m1, v1),(m2, v2),...])
    • SQL above is not standard and assumes a schema that might not be used.
    • (t,d,m) is unique. I am still deciding whether the benchmark client can generate intermittent duplicate inserts and how that should be handled.
  • Rollup: Uses Compare to get all data for d in (...) in time range, then replaces that data with values aggregated over coarser granularity. Assume that base data is rolled up into 5 minute, then 1 hour and then 1 day granularity.
  • Backfill: inserts missing data that might be hours or days old
The value of N in GetOne, GetAll, Max and Min above will vary. For Max and Min it is likely to be <= 100 and frequently <= 10. For GetOne/GetAll in some cases the user wants all data in a time range, but in other cases they will set N to be <= 10,000 and not get all data in the time range.

Open Issues
  • Spammy data - is it worth simulating devices that misbehave? This includes sending data too frequently, sending incorrect values and timestamps and not sending data for some time periods.
  • Is the aggregated data stored inline with the base data or in a separate table? If stored inline do entries need a tag to indicate granularity like base, hourly, daily, etc?
  • Should the benchmark include queries and writes to correct bad data?
  • Can there be a way for the benchmark to run faster than wall clock time so that per-hour and per-day rollup is done more frequently than per real hour and per real day? This avoids the need to run the benchmark for too long.
  • How does a scale factor affect the size of the database, the write load and the query load?

Physical Schema - Write Optimized
The goal here is to make writes fast. That is done by using a unique index for (t,d). Inserts are to the right end of the index assuming they usually arrive in t order. While this makes inserts fast it makes queries too slow. The (t,d) index over-fetches because only the range predicate on t can be used for the index access path:
  • For the GetOne and Compare queries it over-fetches by a factor of |m| * |d|. If there are 1M devices and 10 metrics per insert it fetches 10M times more index entries than needed.
  • For the Max/Min queries it over-fetches by a factor of |m|.
The large over-fetching for GetOne and Compare makes it infeasible to run them with the write optimized schema. The Max and Min queries could still be run -- it would be painful but not infeasible given that |m| is likely to be <= 100.

Implementation notes:
  • MongoDB - There are 3 choices for the unique index. The first choice is to use the subdocument {t, d} as the value for _id and make sure that all clients encode it in the same way. The second choice is to concatenate t and d into a string for the value of _id. This avoids the need to worry about clients encoding as in the first choice at the cost of storing t and d twice per doc. The third choice is to use ObjectId as the value of _id and then have a unique compound index on {t:1, d:1}. The cost for this is an extra index but inserts are in index order for both.
  • SQL with document datatypes - Use one row per insert and document datatype support in MySQL or Postgres to store all of the (m,v) pairs in that row.
  • SQL without document datatypes - Use one row per metric per insert. The unique index is defined on (t,d,m) rather than (t,d). The columns in the row are t, d, m, v.

Physical Schema - Read Optimized

These indexes are important to make queries efficient.
  • GetOne and Compare need one of (d,m,t) or (m,d,t).
  • Max/Min need one of (m,t,v,d) or (m,v,t,d)
From the perspective of one GetOne query there isn’t a difference between the index choices because there are equality predicates on d and m. But from a cache perspective one might be better than the other as they cluster on the first indexed field -- by d or m. If some metrics are less likely to be queried then clustering by m can be a better choice.

The unique constraint is on (d,m,t) or (m,d,t). If v is also in that index then uniqueness cannot be enforced. If v is not in the index then it might not be covering. With Postgres I can use the INCLUDE clause to get v in the index. With InnoDB and MyRocks the table is clustered on the PK, so it is sufficient to make (d,m,t) or (m,d,t) the PK. With MongoDB if I want a covering index then I have to create two indexes -- unique on (m,d,t) and then non-unique but covering with (m,d,t,v) or I can skip the covering index to reduce the write cost but make queries slower. Finally, by unique for MongoDB I mean that this is the value for _id, and then we are back to the question of using a subdocument for _id.

For Max and Min:
  • The (m,t,v,d) index uses index predicates for m and t and then does a sort. So this has an extra cost from the sort and can’t benefit from stopping the index scan once the limit N has been satisfied. Some DBMS have optimizations to make top-N sort fast, but there is no way to avoid the cost of not stopping the index scan early.
  • The (m,v,t,d) index uses index predicates only for m but avoids the sort and can stop the index scan once the limit N has been satisfied. The range predicate on t would be a filter predicate - evaluated for each entry read from the index.
There are cases where (m,t,v,d) is a better choice, but I think that (m,v,t,d) will be better more of the time. The index on (m,t,v,d) would be better when the selectivity from using t as an index predicate offsets the cost of the sort and the data is skewed so that limit N isn’t satisfied until most of the index has been scanned.

Write efficiency in the read optimized schema

Below I use streams of writes to mean the number of leaf pages that will be subject to read-modify-write from index maintenance during inserts at any point in time. If all inserts are to the right end of the index then there is 1 stream. If the index is on (client-ID, time), there are 10 clients active concurrently and each client inserts in time order then there are 10 streams of writes. When there is one leaf page in cache for each stream then IO efficiency is better. Otherwise each insert is more likely to force a page write back followed by a page read.

For the indexes used by GetOne and Compare -- (d,m,t,v) or (m,d,t,v) -- there will be |d|*|m| streams of writes. The cache demand from this is |d|*|m| pages. A small workload might use |d|=1M and |m|=10 and 10M 8kb pages needs 80G of RAM. Both indexes have a similar amount of cache amplification.

For the indexes used by Max and Min, the (m,t,v,d) index has |m| streams of writes vs |m|*|v| for the (m,v,t,d) index. Thus the (m,v,t,d) provides better query performance at the cost of more random IO and more cache amplification. One index has much less cache amplification.

For the non-unique secondary index used for Max and Min, MyRocks can use read-free index maintenance and the number of write streams isn't an issue for it. But it will be an issue for other DBMS.


Partitioning the tables and collections by time can help with write and read performance as well as making other tasks easier -- removing and rolling up old data. For example there could be a partition per hour for the last N hours. However if a DBMS doesn’t support partitions then there is a burden on the application developer to support cross-partition queries. As a user I expect to run queries across hourly boundaries.

Thursday, March 12, 2020

Tuning space and write amplification to minimize cost

This uses math to show how to tune space and write amplification to minimize storage costs for an index structure that uses index+log. The result, minimal cost, is true assuming my model is true. But at this point I will only claim that the model is truthy. I look forward to more results in this area from the smart people at DASlab and elsewhere. I wonder if database economics is a good name for this topic.

I explained the index+log index structure here and here and my truthy model assumes that write and space amplification are functions of PctUsed - the percent of available storage that is used by the index structure. The model is:
  • Space amplification = 100 / PctUsed
  • Write amplification = 100 / (100 - PctUsed)
In what follows I use SpaceAmp and WriteAmp for space and write amplification. When PctUsed is X then the remaining space on the storage device (100-X) is free, not used by anything. The formulas mean that a deployment can trade between SpaceAmp and WriteAmp by adjusting the value of PctUsed. When PctUsed is 80 the values of SpaceAmp and WriteAmp are 1.25 and 5. When PctUsed is 20 the values of SpaceAmp and WriteAmp are 5 and 1.25.


Back in the day hardware was scarce and there was much math in systems papers. While there isn't as much math today the rush to ML means that more people are learning applied math (me included) which is a good thing. I regret not learning enough applied math while in college.

Here I derive a formula for the cost of storage in terms of the database size and the IO required to do compaction (GC, garbage collection) for an index structure that uses the index+log approach. The cost is a function of PctUsed.

P = PctUsed
S = index structure size in GB
G = Cost/GB
N = Write rate to index structure in MB/s
f = value between 1 and 2, 1= no storage reads by GC, 2= all GC writes do storage reads first
I = Cost/IOPs for 4KB operations

Cost = Costspace + Costio

# Formulas for the cost of space and IOPs
# 256 converts MB/s into 4KB IOPs

Costspace = S * G * 100 * P-1
Costio = N * 256 * f * I * 100 * (100-P)-1

# Determine where Cost' = 0 to find the minimal cost
Cost' = Costspace+ Costio'

Costspace= -1 * S * G * 100 * P-2
Costio= N * 256 * f * I * 100 * (100-P)-2 * -1 * -1
Cost' = (-1 * S * G * 100 * P-2) + (N * 256 * f * I * 100 * (100-P)-2 )
# And Cost' = 0 when
S * G * 100 * P-2 = N * 256 * f * I * 100 * (100-P)-2 
# Skipping a few steps this reduces to
P* ((NfI/SG) - 1) + 200P - 10,000 = 0

# This can be solved by the quadratic equation with a=((NfI/SG) - 1), b=200, c=-10,000


Now I solve the equation above to determine the value of PctUsed that minimizes cost with prices from EBS provisioned IOPs. A Google Sheets spreadsheet with the solution is here. For the spreadsheet:
  • The graph uses log-scale for the y-axis and the y-axis doesn't start at 0. This makes it easier to see the impact of changing PctUsed, but can also be misleading. 
  • The solutions from the quadratic equation are quad1 and quad2
  • Cost is computed for PctUsed in (5, 10, 15, ..., 85, 90, 95)
  • The minimal value for Cost (quad1) is likely to be between these values
I then solve for 3 cases: N=1, 10 and 100 where N is the write rate to the index structure in MB/s. The minimal cost occurs at PctUsed = 67, 39 and 17 for N = 1, 10 and 100.

For N=1, a low write rate, the minimal cost is at PctUsed=67

For N=10, a moderate write rate, the minimal cost is at PctUsed=39

For N=100, an extreme write rate, the minimal cost is at PctUsed=17

Review of Benchmarking RocksDB Key-Value Workloads at Facebook

At some level all database engines have a key-value API even if they aren't exposed to the user. One name for this in a SQL DBMS is the data layer. The data layer is RocksDB for MyRocks & MongoRocks and then WiredTiger for MongoDB.

FAST 20 has a paper on characterizing RocksDB key-value workloads at Facebook. One of the workloads is UDB and UDB uses MyRocks, a SQL DBMS.

Benchmarks are useful to me but there is always room for improvement. Two common problems are the lack of load variance and the lack of complexity in access patterns. By the lack of load variance I mean that the workload doesn't change over time which fails to capture the daily spikes that occur for web-scale workloads as different parts of the world wake and sleep. They also fail to include daily and intermittent operational tasks like backup, schema change, data migration and archiving old data.

This paper explains the complexity in access patterns for three RocksDB use cases and shows how to reproduce them in benchmarks. UDB was one of the workloads so this work includes an update of the analysis shared in the Linkbench paper from 2013.

The primary contributions of the paper are the analysis of production workloads and the discovery that hot key ranges, rather than hot keys, is required to model production workloads. I appreciate that some of the work behind this paper has already been shared via RocksDB and that more might be shared as enhancements to YCSB.

Things I liked about the paper:
  • When generating skew in a benchmark use the concept of hot key ranges rather than hot keys. Linkbench and YCSB use hot keys distributed across the key space.
  • Explains the UDB workload which was last explained in the Linkbench paper
  • Explains two other production workloads
  • Provides some detail on the diurnal workload spikes in UDB
  • Explains the distribution of key sizes. While it states that UDB keys are small, by small it means a typical value is 16 to 30 bytes. Of course, the covering secondary index used by UDB is an outlier at ~64 bytes.
  • Explains that values are usually small, <= 128 bytes
  • Documents temporal access patterns
  • Demonstrates that using hot key ranges rather than hot keys can more accurately model production workloads

Wednesday, March 11, 2020

Building MySQL 5.6 from source on Ubuntu 18.04

This explains the steps I used to build MySQL 5.6.35 from source on Ubuntu 18.04. I build with the perf schema disabled and if you do that then don't use Connector/J 5.1.48 (5.1.47 is OK) or you won't be able to connect thanks to bug 98919. The new Connector/J dependency on perf schema variables can be a problem for a DBMS that implements the MySQL API -- see this bug report for MemSQL.
  1. Install an older version of gcc and g++: sudo apt install gcc-5 g++-5
  2. Unpack source. I used 5.6.35
  3. Remove the connection_control plugin because that does not compile when the perf schema is disabled at compile time -> rm -rf $SRC_ROOT/plugin/connection_control
  4. Run cmake. See below. Note that I disable the perf schema.
My Cmake script:

CC=gcc-5 CXX=g++-5 \
cmake .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DWITH_SSL="bundled" \
      -DWITH_ZLIB="bundled" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \

Wednesday, March 4, 2020


We use RDBMS as another name for SQL DBMS but SQL isn't relational. That isn't news, see this web site and book. SQL allows for but doesn't require relational and 1NF or 3NF are optional. JSON is in the SQL:2106 spec. What would Codd think?

Using Oracle as a SQL DBMS example. First there was support for collection data types, then there was XML and eventually JSON arrived. These let you violate 1NF. I won't argue whether these should be used. I only claim they can be used.

Have there been surveys to document how often the relational approach is used with a SQL DBMS? I assume it is better to think of a distribution of approaches (a value between 0 and 1 where 0 is SQL and 1 is relational) rather than a binary approach of relational vs SQL (not relational). I might call the SQL endpoint the pragmatic approach, but that introduces bias. While I have spent a long time working on SQL DBMS I am usually working under the hood and don't design applications.