Monday, May 18, 2020

Avoiding reads in write-optimized index structures

This is a summary of features that avoid storage read IO in OSS write-optimized index structures. My scope is limited to MySQL and Tarantool -- I know there are similar features in other DBMS and am happy for comments on that:
  • RocksDB merge operator logically does read-modify-write but physically does a Put into the memtable and the read is deferred until user queries or compaction. I hope that MyRocks eventually takes advantage of the merge operator.
  • Read free replication in TokuDB and MyRocks.
  • Fast Updates in TokuDB do something similar to the merge operator for update and insert
  • Non-unique secondary index maintenance is read free in MyRocks
  • Deferred secondary index update in Tarantool is a new approach to avoiding reads for write-heavy workloads. It is worth reading. I wish it had been published by VLDB.
  • MyRocks also avoids some storage reads when validating unique constraints (for PK and unique secondary indexes) on inserts and some updates because the Get() that it does can use a bloom filter. This benefit is larger when there is a bloom filter on the max level of the LSM tree, but that has other costs is is frequently not done.
  • MyRocks uses the SingleDelete feature in RocksDB to allow tombstones to be removed earlier which decreases the CPU overhead for reads after a batch of writes.
  • A b-tree can also reduce index maintenance related storage read IO. See the InnoDB change buffer.
Updates

An incomplete list of relevant research:
  • DiffIndex - this is more about distributed systems, but still interesting
  • Techniques for reducing index maintenance overhead in an LSM by Luo & Carey (VLDB 2019)
  • Deferred Lightweight Indexing (DELI)

Thursday, April 9, 2020

Reviving mstat

A long time ago I wrote mstat to collect iostat, vmstat and MySQL (show global status) performance counters. It samples everything at the same interval (every N seconds), computes rates for all numeric values and has some support for expressions so I can define a new counter as a function of other counters.

Eventually I stopped using it and then there were problems. Recently I fixed many of them. The tool now works with Python3 and supports old and new iostat output format. This week I started to add support for Postgres and MongoDB. It gets data from pg_stat_bgwriter for Postgres and from db.serverStatus() for MongoDB. So the tool is useful to me again but it won't be useful to others until I add docs.

The tool prints one line in CSV format per time interval. The line is likely to be long because there can be hundreds of counters in the line. The name and offset for each counter is printed at startup. I extract data using bash and awk as part of my benchmark workflow.

Because of this dependency to list the offset per counter at startup the tool is not able to handle dynamic data yet. An example of dynamic data are per-DBMS counters such as the rows in pg_stat_database. That view has one row per database and the database names (datname) are not fixed. Similar problems exist for the per-table and per-index monitoring tables and views provided by MySQL, other stats views provided by Postgres and other monitoring data provided by MongoDB. So I need more time to figure out how to support them.

I have more work to do on mstat, but I am happy that I can use it again:
  • I need to figure out whether there are global counters in Postgres for things like queries and rows read
  • For MongoDB mstat has a hack to replace space, dot and parentheses with _ in key names
  • More work remains to make sure that counters are ignored unless their type is numeric or a date

Monday, April 6, 2020

Creating performance reports

I am trying to automate some of the tasks that I must do to produce performance reports. By trying I mean that I am just getting started on this and will be making decisions I might regret later. This is also a request for advice, feedback and corrections.

Some requirements, not all of which will be satisfied:
  • interactive - I prefer that the format can be hosted in a way that allows for discussions. Google Docs does a great job at this by supporting access control and threaded discussions on comments.
  • graphics - support charts and graphs. Back in the day I was a happy user of Google Image Charts that let me define a chart in a URL. Alas, the service has been turned off and the new thing from Google requires Javascript.
  • self-contained - I want to share the report as one thing. With Google Docs that thing can be the URL for the doc, even when there are additional things, like spreadsheets, linked to it. With HTML the thing is the HTML document.
  • script-able - I want to create as much as possible of the document via scripting. I frequently repeat tests and don't want to waste time reformatting tables.
  • private - some reports are not yet public and I prefer a solution that isn't always public. However I also want a solution that works for company-private reports and also for reports that are shared with the public. 
  • other - supports wide lines (via a slider, wrapping wide lines is lousy for a reader), what else?

Choices:
  • Google docs
    • interactive - yes
    • graphics - yes
    • self-contained -yes
    • scriptable - no
    • privacy - yes
    • other - no support for wide lines, they are always wrapped
  • Github Pages
    • interactive - no, always public
    • graphics - maybe
    • self-contained - yes
    • scriptable - yes
    • privacy - no, always public
    • other - supports wide lines
  • Blogger
    • interactive - somewhat via comments but not as nice as Google docs, but always public
    • graphics - yes by embedding images
    • self-contained - yes
    • scriptable - maybe if I can paste an HTML doc as the content
    • privacy - no, always public
    • other - weak support for wide lines
  • Plain HTML
    • interactive - maybe*
    • graphics - maybe*
    • self-contained - maybe*
    • scriptable - yes
    • privacy - yes*
    • other - supports wide lines
From the above I am not aware of anything close to a perfect solution. I am likely to choose plain HTML. Privacy can be managed by sharing via company email and or hosting the HTML files on company-private servers.

Charts and graphs could be provided by a web-service that lets a chart be defined in a URL. Quickchart is an example of that. Of course, use of such a service means the data is no longer private. But there is a difference between posting a web-page for the world to read and sharing that data with the company that provides the URL chart service.

If I use plain HTML then I must find a way to share such pages with the public. That last sentence is amusing given that the web runs on HTML. But I have never done this. I have been a long-time, and mostly happy, user of Blogger. I have begun to use Github Pages and am a big fan of markup.

As a bonus, I prefer a solution that doesn't depend on the whims of a vendor. For example, if I commit to service X and the vendor for service X turns off that service, I don't want to spend weeks reformatting reports while I migrate to a new service.

It would be great if I could inline some of the charts in the HTML document. I need to figure out whether there are good tools for simple ascii bar charts and graphs that can be pasted into an HTML document.

I am a fan of gnuplot and need to learn more about how I can share gnuplot output in PNG format as part of an HTML document without posting all of that on the web.

Monday, March 30, 2020

Compiling from source for Amazon Linux 2

Different is worse

I used to know struggle with  make & automake but I knew enough to get things done. Now I have fun with make, cmake, maven and gradle. I am sure this proliferation solves problems for people who create the build scripts that I use. But it isn't a good experience for the intermittent user -- different is worse because my skill level per tool is lower.

What is Amazon Linux 2? It has a landing page. I call it AL2 below. It claims to be like CentOS, RHEL, Fedora and it uses yum. I know how to use Google to get answers for those distros when trying to figure out how to build something from source.

But it isn't clear to me that packages that exist for CentOS also exist for AL2. One example of the challenge is that I ended up installing jemalloc from source because I couldn't find the package without advice from an expert.

I hope the output from cat /etc/os-release on AL2 is improved. Below are examples for AL2 and Ubuntu 18.04. Ubuntu specific that this is Ubuntu 18.04. From AL2 I learn it is like CentOS. But which CentOS?

NAME="Amazon Linux" 
VERSION="2" ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/" 


Lets compare that with Ubuntu:

NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Dependencies for AL2

This explains how to satisfy the dependencies before running cmake. I struggled figuring out the names for packages that had to be installed. Some of this is because I don't have much experience with CentOS, some of this is on AL2. Eventually I found this page which might have helped.

Add libraries for the insert benchmark, some of this is needed for the build and all is needed when I run benchmarks so I will just install before I try to build:

sudo yum install -y python3
sudo python3 -m pip install pymongo
sudo yum install -y mysql-devel python3-devel
sudo pip3 install mysqlclient
sudo pip3 install psycopg2-binary
sudo yum install -y openssl11-libs
sudo yum install -y libzstd-1.3.3
sudo yum install -y ncurses-compat-libs
# The host arrives comes with /etc/my.cnf from MariaDB

sudo rm -f /etc/my.cnf /etc/mysql/my.cnf

Install jemalloc from source because I couldn't figure out how to install the package:

cd /media/ephemeral1
rm -rf jemalloc-5.2.1*
wget https://github.com/jemalloc/jemalloc/releases/download/5.2.1/jemalloc-5.2.1.tar.bz2
bunzip2 jemalloc-5.2.1.tar.bz2
tar xvf jemalloc-5.2.1.tar
cd jemalloc-5.2.1
./configure --prefix=/usr > o.cf 2> e.cf
make -j4 > o.m 2> e.m
sudo make install

Install things needed to compile MySQL8 and MyRocks from FB MySQL 5.6. I didn't confirm that all of these packages exist. But I was able to build after doing this:
sudo yum install -y cmake3
sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install -y numactl-devel
sudo yum install -y libedit-devel
sudo yum install -y cmake gcc-c++ bzip2-devel libaio-devel bison 
sudo yum install -y zlib-devel snappy-devel
sudo yum install -y gflags-devel readline-devel ncurses-devel openssl-devel
sudo yum install -y lz4-devel gdb git libzstd-devel

Building MySQL8

Download, compile and install MySQL 8. It is installed at /media/ephemeral1/my8018. The upstream docs are useful.

cd /media/ephemeral1
rm -rf mysql-8.0.18*
wget https://downloads.mysql.com/archives/get/p/23/file/mysql-boost-8.0.18.tar.gz
tar xzvf mysql-boost-8.0.18.tar.gz
cd mysql-8.0.18
mkdir build
cd build
bash /media/ephemeral1/cmk80 /media/ephemeral1/my8018 > o.cm 2> e.cm
make -j8 V=1 VERBOSE=1 > o.mk 2> e.mk
make install

The contents of /media/ephemeral/cmk80.

prefix=$1
cmake3 .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DCMAKE_INSTALL_PREFIX:PATH=$prefix \
      -DWITH_SSL="system" \
      -DWITH_ZLIB="system" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \
      -DENABLED_LOCAL_INFILE=1 \
      -DMYSQL_MAINTAINER_MODE=0 \
      -DWITH_BOOST=$PWD/../boost \
      -DWITH_NUMA=ON 

Building MyRocks

Download, compile and install MyRocks from FB MySQL 5.6. It is installed at /media/ephemeral1/fbmy56. The upstream docs are useful.

First install boost 1.65 from source:

cd /media/ephemeral1
wget https://dl.bintray.com/boostorg/release/1.65.1/source/boost_1_65_1.tar.bz2
bunzip2 boost_1_65_1.tar.bz2
tar xvf boost_1_65_1.tar
cd boost_1_65_1
./bootstrap.sh --prefix=/usr

./b2 > o.b2 2> e.b2
sudo ./b2 install

Then build MySQL 5.6

cd /media/ephemeral1
git clone https://github.com/facebook/mysql-5.6.git fbmy56-src
cd fbmy56-src
git checkout 20aaaf8d
git submodule init
git submodule update
mkdir build
cd build
bash /media/ephemeral1/cmkfbmy56 /media/ephemeral1/fbmy56 > o.cm 2> e.cm
make -j8 V=1 VERBOSE=1 > o.mk 2> e.mk
make install

The contents of /media/ephemeral1/cmkfbmy56

if [ -z $1 ]; then
echo Requires prefix as arg1
exit -1
fi

RXF=" -DROCKSDB_DEFAULT_TO_ADAPTIVE_MUTEX -DROCKSDB_SUPPORT_THREAD_LOCAL -DROCKSDB_SCHED_GETCPU_PRESENT"
RXF+=" -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_FALLOCATE_PRESENT"
RXF+=" -DHAVE_ALIGNED_NEW"
RXF+=" -DROCKSDB_SUPPORT_THREAD_LOCAL"
RXF+=" -DHAVE_PCLMUL -DHAVE_SSE42 -DHAVE_AVX2 -DHAVE_UINT128_EXTENSION"
RXF+=" -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX"

CXXF="-DNDEBUG -march=native -msse -msse4.2 -mpclmul"
CXXF+=" -faligned-new"
CXXF+=" $RXF"

# extra flags to avoid warnings with gcc on ubuntu 18
CF="-Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

CXXF+=" -Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

cmake3 .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DENABLE_DTRACE=0 \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
-DWITH_LZ4=system \
-DWITH_ZSTD=system

Not done yet

There is a crash in Boost code that was added in the FB MySQL branch. Boost 1.53 is installed by yum install boost-devel. I will rebuild MyRocks with Boost 1.65 as that is what Ubuntu 18.04 uses, and MyRocks works great for me there.

./bin/mysqld(my_print_stacktrace+0x2e)[0xac581e]
./bin/mysqld(handle_fatal_signal+0x333)[0x7e8293]
/lib64/libpthread.so.0(+0x117e0)[0x7f22703f37e0]
./bin/mysqld(_ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7cl
assic7scannerIN9__gnu_cxx17__normal_iteratorIPcSt6vectorIcS8_EEENSG_16scanner_policiesINSG_28skip_parser_iteration_policyINSG_11alternativeINSQ_INSG_12space_parserENSG_13confix_pa
rserINSG_6strlitIPKcEENSG_11kleene_starINSG_14anychar_parserEEENSQ_INSG_10eol_parserENSG_10end_parserEEENSG_21unary_parser_categoryENSG_10non_nestedENSG_9is_lexemeEEEEENSS_ISW_SZ_
SW_S13_S14_S15_EEEENSG_16iteration_policyEEENSG_12match_policyENSG_13action_policyEEEEEED1Ev+0xc)[0x840eec]
./bin/mysqld(_ZN5boost6spirit7classic4impl14grammar_helperINS1_7grammarINS_13property_tree11json_parser12json_grammarINS5_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsI
cESaIcEEESE_St4lessISE_EEEEENS1_14parser_contextINS1_5nil_tEEEEESI_NS1_7scannerIN9__gnu_cxx17__normal_iteratorIPcSt6vectorIcSD_EEENS1_16scanner_policiesINS1_28skip_parser_iteratio
n_policyINS1_11alternativeINSW_INS1_12space_parserENS1_13confix_parserINS1_6strlitIPKcEENS1_11kleene_starINS1_14anychar_parserEEENSW_INS1_10eol_parserENS1_10end_parserEEENS1_21una
ry_parser_categoryENS1_10non_nestedENS1_9is_lexemeEEEEENSY_IS12_S15_S12_S19_S1A_S1B_EEEENS1_16iteration_policyEEENS1_12match_policyENS1_13action_policyEEEEEE8undefineEPSM_+0x43)[0
x840fe3]
./bin/mysqld(_ZN5boost6spirit7classic7grammarINS_13property_tree11json_parser12json_grammarINS3_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESC_St4lessISC_EEE
EENS1_14parser_contextINS1_5nil_tEEEED1Ev+0x34)[0x840cb4]
./bin/mysqld(_ZN5boost13property_tree11json_parser18read_json_internalINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEEEvRSt13basic_istream
INT_8key_type10value_typeES6_ISG_EERSE_RKS9_+0xc51)[0x844511]
./bin/mysqld[0x839e32]
./bin/mysqld(_ZN3THD12set_shard_idEv+0x1c)[0x83a26c]
./bin/mysqld(_ZN3THD15set_db_metadataEv+0x1e3)[0x852e73]
./bin/mysqld(_Z15mysql_change_dbP3THDPK19st_mysql_lex_stringb+0x5da)[0x84f6ca]
./bin/mysqld(_Z16acl_authenticateP3THDj+0x13a5)[0x80e685]
./bin/mysqld[0x847d26]
./bin/mysqld(_Z16login_connectionP3THD+0x4d)[0x849c2d]
./bin/mysqld(_Z24do_handle_one_connectionP3THD+0x199)[0x84aa29]
./bin/mysqld(handle_one_connection+0x9)[0x84adc9]
/lib64/libpthread.so.0(+0x740b)[0x7f22703e940b]
/lib64/libc.so.6(clone+0x3f)[0x7f226e2a7e7f]

stack_log: _ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7clas
sic7scannerIN9__gnu_cxx17__normal_iteratorIPcSt6vectorIcS8_EEENSG_16scanner_policiesINS


Friday, March 27, 2020

kernel lockdown vs blktrace

I am trying to use blktrace to determine which files are the source of disk reads for a database that uses buffered IO. The server runs Ubuntu 18.04 on bare metal and the boot uses UEFI not legacy.

blktrace doesn't work, time to debug. Things that don't fix it include upgrading from 4.15 to 5.3 HWE kernel and disabling apparmor. Next up is disabling kernel lockdown via mokutil --disable-validation. Alas, blktrace still fails at startup.

After running mokutil and then rebooting there are still a few messages in dmesg output about lockdown so I wonder whether it was fully disabled.
Lockdown: Hibernation is restricted; see man kernel_lockdown.7
Lockdown: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
OK, lets read the man page. Great, it doesn't exist -- not for Ubuntu nor for other distros. There is a draft but I am starting to get the impression that lockdown wasn't ready for prime time. And Linus had a strong opinion about it in 2018.

Next up is a strong opinion from Brendan Gregg.
Many distros are enabling lockdown, breaking BPF. This is the worst OS change I've ever seen.
OK, maybe my problem is lockdown and mokutil wasn't sufficient. Time to try:
echo 1 > /proc/sys/kernel/sysrq; echo x > /proc/sysrq-trigger 
And now blktrace works. Well, until I reboot. I already have a script to run after reboot to reduce security so that PMP can run. That script just got larger:
echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger 


Monday, March 16, 2020

Insert Benchmark v3

I expect to replace the insert benchmark later this year. The insert benchmark I have been using might be ibench v2 as ibench v1 came from Tokutek. So the replacement I write would be ibench v3.

The reasons for a replacement include:
  • Switch from Python to something more performant (Java or Go) as it is difficult to sustain high QPS rates (100k / second or more) with Python and a Java/Go client will consume less CPU. I assume the DBMS I care about all have good golang clients (MongoDB, MySQL, Postgres).
  • Make the benchmark less synthetic. Early ibench was insert-only for a table with a PK and 3 secondary indexes. Eventually I added support for short range queries. It has always used uniform distribution. The plan for ibench v3 is to model a monitoring workload -- many inserts, some updates, many queries. It will use a PK and 1 or 2 secondary indexes.
  • Continue to support multiple DBMS -- MongoDB, MySQL and Postgres.
  • As a bonus, don't ignore coordinated omission
I still need to confirm whether this is worth implementing or whether I have a case of NIH and should be using TSBS. I browsed TSBS and the indexes I describe here might match the indexes that TSBS would use. However, TSBS has more query diversity. My focus is on the storage engine and a smaller set of queries is sufficient. Regardless, I need to spend more time reading about TSBS.

Logical Schema

The logical schema describes the data in one insert operation. The physical schema describes how that is stored in the DBMS. I focus on the logical schema here.

The logical schema is: timestamp, deviceID, [metricID, metricValue]+
  • This is the data from one insert for a given device. It has data for 1+ metrics.
  • timestamp is a 64-bit integer that will be mostly increasing on insert. Clocks are not synchronized across clocks, some devices will have bad/stuck clocks and some devices will have clock drift.
  • deviceID is a 64-bit integer. The number of devices will be used to scale the workload. There can be millions of devices for large scale factors.
  • Each (metricID, metricValue) pair represents the value for a metricID from a given device at a given point in time. metricID is a 32-bit integer. The number of metrics will also be used to scale the workload. There will be fewer metrics than devices. metricValue is a 64-bit integer

Workload

Fields are abbreviated when describing the workload: t=timestamp, d=deviceID, m=metricID, v=metricValue. Operations are described in SQL, except for insert.

The workload is a combination of:
  • Get: select v, t from C where m=M and d=D and t between T1 and T2 order by t
    • Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N
  • Max, Min: select v, t, d from C where m=M and t between T1 and T2 order by v asc/desc limit N
  • Compare: call Get for m in (...) and d=D and t between T1 and T2
  • Insert: insert into C values (t, d, [(m1, v1),(m2, v2),...])
    • SQL above is not standard and assumes a schema that might not be used.
    • (t,d,m) is unique. I am still deciding whether the benchmark client can generate intermittent duplicate inserts and how that should be handled.
  • Rollup: Uses Compare to get all data for d in (...) in time range, then replaces that data with values aggregated over coarser granularity. Assume that base data is rolled up into 5 minute, then 1 hour and then 1 day granularity.
  • Backfill: inserts missing data that might be hours or days old

The value of N in Get, Max and Min above will vary. For Max and Min it is likely to be <= 100 and frequently <= 10. For Get in some cases the user wants all data in a time range, but in other cases they will set N to be <= 10,000 and not get all data in the time range.

I decided to not have a Get query variant, called GetD, that uses where d=D and t between … (it lacks a predicate on m) because this data is provided via the Compare query assuming that the metricIDs are known. If that assumption isn’t true and GetD were needed then an extra secondary index on (d,t) would also be required to make that query efficient unless the DBMS supported index skip scan via an index on (d,m,t,...).

Open Issues
  • Spammy data - is it worth simulating devices that misbehave? This includes sending data too frequently, sending incorrect values and timestamps and not sending data for some time periods.
  • Is the aggregated data stored inline with the base data or in a separate table? If stored inline do entries need a tag to indicate granularity like base, hourly, daily, etc?
  • Should the benchmark include queries and writes to correct bad data?
  • Can there be a way for the benchmark to run faster than wall clock time so that per-hour and per-day rollup is done more frequently than per real hour and per real day? This avoids the need to run the benchmark for too long.
  • How does a scale factor affect the size of the database, the write load and the query load?

Physical Schema - Write Optimized
The goal here is to make writes fast. That is done by using a unique index for (t,d). Inserts are to the right end of the index assuming they usually arrive in t order. While this makes inserts fast it makes queries too slow. The (t,d) index over-fetches because only the range predicate on t can be used for the index access path:
  • For the Get and Compare queries it over-fetches by a factor of |m| * |d|. If there are 1M devices and 10 metrics per insert it fetches 10M times more index entries than needed.
  • For the Max/Min queries it over-fetches by a factor of |m|.
The large over-fetching for Get and Compare makes it infeasible to run them with the write optimized schema. The Max and Min queries could still be run -- it would be painful but not infeasible given that |m| is likely to be <= 100.

Implementation notes:
  • MongoDB - There are 3 choices for the unique index. The first choice is to use the subdocument {t, d} as the value for _id and make sure that all clients encode it in the same way. The second choice is to concatenate t and d into a string for the value of _id. This avoids the need to worry about clients encoding as in the first choice at the cost of storing t and d twice per doc. The third choice is to use ObjectId as the value of _id and then have a unique compound index on {t:1, d:1}. The cost for this is an extra index but inserts are in index order for both.
  • SQL with document datatypes - Use one row per insert and document datatype support in MySQL or Postgres to store all of the (m,v) pairs in that row.
  • SQL without document datatypes - Use one row per metric per insert. The unique index is defined on (t,d,m) rather than (t,d). The columns in the row are t, d, m, v.

Physical Schema - Read Optimized

These indexes are important to make queries efficient.
  • Get and Compare need one of (d,m,t) or (m,d,t).
  • Max/Min need one of (m,t,v,d) or (m,v,t,d)
From the perspective of one Get query there isn’t a difference between the index choices because there are equality predicates on d and m. But from a cache perspective one might be better than the other as they cluster on the first indexed field -- by d or m. If some metrics are less likely to be queried then clustering by m can be a better choice.

The unique constraint is on (d,m,t) or (m,d,t). If v is also in that index then uniqueness cannot be enforced. If v is not in the index then it might not be covering. With Postgres I can use the INCLUDE clause to get v in the index. With InnoDB and MyRocks the table is clustered on the PK, so it is sufficient to make (d,m,t) or (m,d,t) the PK. With MongoDB if I want a covering index then I have to create two indexes -- unique on (m,d,t) and then non-unique but covering with (m,d,t,v) or I can skip the covering index to reduce the write cost but make queries slower. Finally, by unique for MongoDB I mean that this is the value for _id, and then we are back to the question of using a subdocument for _id.

For Max and Min:
  • The (m,t,v,d) index uses index predicates for m and t and then does a sort. So this has an extra cost from the sort and can’t benefit from stopping the index scan once the limit N has been satisfied. Some DBMS have optimizations to make top-N sort fast, but there is no way to avoid the cost of not stopping the index scan early.
  • The (m,v,t,d) index uses index predicates only for m but avoids the sort and can stop the index scan once the limit N has been satisfied. The range predicate on t would be a filter predicate - evaluated for each entry read from the index.
There are cases where (m,t,v,d) is a better choice, but I think that (m,v,t,d) will be better more of the time. The index on (m,t,v,d) would be better when the selectivity from using t as an index predicate offsets the cost of the sort and the data is skewed so that limit N isn’t satisfied until most of the index has been scanned.

Write efficiency in the read optimized schema

Below I use streams of writes to mean the number of leaf pages that will be subject to read-modify-write from index maintenance during inserts at any point in time. If all inserts are to the right end of the index then there is 1 stream. If the index is on (client-ID, time), there are 10 clients active concurrently and each client inserts in time order then there are 10 streams of writes. When there is one leaf page in cache for each stream then IO efficiency is better. Otherwise each insert is more likely to force a page write back followed by a page read.

For the indexes used by Get and Compare -- (d,m,t,v) or (m,d,t,v) -- there will be |d|*|m| streams of writes. The cache demand from this is |d|*|m| pages. A small workload might use |d|=1M and |m|=10 and 10M 8kb pages needs 80G of RAM. Both indexes have a similar amount of cache amplification.

For the indexes used by Max and Min, the (m,t,v,d) index has |m| streams of writes vs |m|*|v| for the (m,v,t,d) index. Thus the (m,v,t,d) provides better query performance at the cost of more random IO and more cache amplification. One index has much less cache amplification.

For the non-unique secondary index used for Max and Min, MyRocks can use read-free index maintenance and the number of write streams isn't an issue for it. But it will be an issue for other DBMS.

Partitioning

Partitioning the tables and collections by time can help with write and read performance as well as making other tasks easier -- removing and rolling up old data. For example there could be a partition per hour for the last N hours. However if a DBMS doesn’t support partitions then there is a burden on the application developer to support cross-partition queries. As a user I expect to run queries across hourly boundaries.

Thursday, March 12, 2020

Tuning space and write amplification to minimize cost

This uses math to show how to tune space and write amplification to minimize storage costs for an index structure that uses index+log. The result, minimal cost, is true assuming my model is true. But at this point I will only claim that the model is truthy. I look forward to more results in this area from the smart people at DASlab and elsewhere. I wonder if database economics is a good name for this topic.

I explained the index+log index structure here and here and my truthy model assumes that write and space amplification are functions of PctUsed - the percent of available storage that is used by the index structure. The model is:
  • Space amplification = 100 / PctUsed
  • Write amplification = 100 / (100 - PctUsed)
In what follows I use SpaceAmp and WriteAmp for space and write amplification. When PctUsed is X then the remaining space on the storage device (100-X) is free, not used by anything. The formulas mean that a deployment can trade between SpaceAmp and WriteAmp by adjusting the value of PctUsed. When PctUsed is 80 the values of SpaceAmp and WriteAmp are 1.25 and 5. When PctUsed is 20 the values of SpaceAmp and WriteAmp are 5 and 1.25.

Math

Back in the day hardware was scarce and there was much math in systems papers. While there isn't as much math today the rush to ML means that more people are learning applied math (me included) which is a good thing. I regret not learning enough applied math while in college.

Here I derive a formula for the cost of storage in terms of the database size and the IO required to do compaction (GC, garbage collection) for an index structure that uses the index+log approach. The cost is a function of PctUsed.

Assume:
P = PctUsed
S = index structure size in GB
G = Cost/GB
N = Write rate to index structure in MB/s
f = value between 1 and 2, 1= no storage reads by GC, 2= all GC writes do storage reads first
I = Cost/IOPs for 4KB operations

Cost = Costspace + Costio

# Formulas for the cost of space and IOPs
# 256 converts MB/s into 4KB IOPs

Costspace = S * G * 100 * P-1
Costio = N * 256 * f * I * 100 * (100-P)-1

# Determine where Cost' = 0 to find the minimal cost
Cost' = Costspace+ Costio'

Costspace= -1 * S * G * 100 * P-2
Costio= N * 256 * f * I * 100 * (100-P)-2 * -1 * -1
Cost' = (-1 * S * G * 100 * P-2) + (N * 256 * f * I * 100 * (100-P)-2 )
# And Cost' = 0 when
S * G * 100 * P-2 = N * 256 * f * I * 100 * (100-P)-2 
# Skipping a few steps this reduces to
P* ((NfI/SG) - 1) + 200P - 10,000 = 0

# This can be solved by the quadratic equation with a=((NfI/SG) - 1), b=200, c=-10,000

Graphs

Now I solve the equation above to determine the value of PctUsed that minimizes cost with prices from EBS provisioned IOPs. A Google Sheets spreadsheet with the solution is here. For the spreadsheet:
  • The graph uses log-scale for the y-axis and the y-axis doesn't start at 0. This makes it easier to see the impact of changing PctUsed, but can also be misleading. 
  • The solutions from the quadratic equation are quad1 and quad2
  • Cost is computed for PctUsed in (5, 10, 15, ..., 85, 90, 95)
  • The minimal value for Cost (quad1) is likely to be between these values
I then solve for 3 cases: N=1, 10 and 100 where N is the write rate to the index structure in MB/s. The minimal cost occurs at PctUsed = 67, 39 and 17 for N = 1, 10 and 100.

For N=1, a low write rate, the minimal cost is at PctUsed=67


For N=10, a moderate write rate, the minimal cost is at PctUsed=39


For N=100, an extreme write rate, the minimal cost is at PctUsed=17