Wednesday, July 1, 2020

Something changed for the better in create index between MySQL 8.0.18 and 8.0.20

I include MongoDB, Postgres, MySQL/InnoDB and MySQL/RocksDB (MyRocks) in the things I test via Linkbench and the insert benchmark. Hopefully I add another test later this year. I have been using MySQL 8.0.18 this year and recently started to use 8.0.20.

The new insert benchmark workflow is create tables with PK indexes, load tables, create 3 secondary indexes on each table, continue the load and then run several rounds of reads+inserts. For this test there were 8 clients with a table per client and 10M rows/table after the initial insert.

I think something changed for the better from MySQL 8.0.18 to 8.0.20 and I didn't spot the change in the release notes. I see:
  • Create index in 8.0.20 is 10% to 20% faster (nice, but not why I wrote this)
  • Create index in 8.0.18 uses ~7X more read IO with innodb_flush_method = O_DIRECT (14G vs 2G) and ~2X more read IO with it set to O_DIRECT_NO_FSYNC (4G vs 2G). The table was cached at the start of create index and the indexed table should be less than 2G. Tests used innodb_sort_buffer_size = 64M.
    • Something changed with the sort used by create index. Maybe it does fewer passes. I am not sure there is a way to monitor that.
    • I don't understand the impact from O_DIRECT vs O_DIRECT_NO_FSYNC in 8.0.18, while it had no impact in 8.0.20.
  • When the load was restarted after create index there was initially a lot of read IO with 8.0.20. Either the indexed table or the newly created indexes were not in cache and my bet is on the new indexes. This doesn't happen with 8.0.18. 
  • For 8.0.20, create index is faster with innodb_use_native_aio set to ON, but I am ignoring that topic for now.
Tests are run for 3 configurations (c10b40, c10b40a, c10b40b). The InnoDB options for those configs are here. Perhaps because of the way I compiled MySQL from source, innodb_use_native_aio is off by default for 8.0.18 but on for 8.0.20 -- so 8.0.18 and 8.0.20 differ at runtime only for that option with the c10b40 config. For the c10b40a and c10b40b configs, the options are the same at run time because innodb_use_native_aio is set. The differences between the configs are:
  • c10b40 - use O_DIRECT
  • c10b40a - start with c10b40, add innodb_use_native_aio=FALSE
  • c10b40b - start with c10b40b, change to O_DIRECT_NO_FSYNC
Performance metrics are here. The slightly out of date legend for the metrics is here. Similar to my favorite InnoDB performance expert, the results are compressed so I can compare many configurations on one screen. The metrics don't include it but the time to create the index for 8.0.18 is (275, 281, 233) seconds and for 8.0.20 is (157, 228, 209) seconds for the (c10b40, c10b40a, c10b40b) configs. The ips metric in the tables for create index is indexed rows per second computed as (number of rows in table / time to create all indexes).

Tuesday, June 30, 2020

Java GC and Linkbench

I filed a bug for the Sun JDK a long time ago, sometime around 1998. At the time growing the heap didn't work and the workaround was to set -Xms and -Xmx to the same value. That worked as long as you knew a reasonable value -- too small and your process dies, too large and you waste memory.

Now I get to revisit Java GC. I am running Linkbench and the bin/linkbench script that starts the bench client uses -Xmx=1000 to limit the heap to 1000MB of RAM. Someone else wrote that script and I didn't know it was done until a test failed when the heap was too small.

I use Linkbench on large and small servers and need to be careful about memory usage on the small servers so I ran a few tests to understand the impact on performance and memory usage for Linkbench run with -Xmx=1000, -Xmx=2000 and -Xmx not set. I used a test with a ~10G database (maxid1=10M) and two levels of concurrency -- 16 clients, 64 clients.

For the tests I report metrics from the Linkbench client process: max value for VSZ, max value for RSS and number of CPU seconds. This is collected via "ps aux" run at 30-second intervals and now I wish I reduced that to 1-second intervals but I don't want to repeat the tests. The tests used MySQL 8.0.18 and JDBC via Connector/J 8.0.20.

Now I get to revisit Java GC on Amazon Linux 2 and Java is:
$ java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
Disclaimers -- I am far from a Java expert and the JDK I used might not be the latest and greatest. Feedback is welcome.

Results

I noticed two things:
  1. VSZ is 5X to 10X larger than RSS. I get scared when I see a large VSZ, even when RSS is small. But some good things (jemalloc) do that so eventually I tolerate it.
  2. Unlimited heap size doesn't save much CPU time. I expected it to help more by reducing the GC overhead. It helped a bit with 64 clients, but hurt with 16 clients.
Based on these results I will update my test scripts to use -Xmx=1000 on small servers and -Xmx=2000 on large servers. Were I to use more than 64 concurrent clients then I might use a value larger than 2000.

The data

Legend:
  • hMax - value for -Xmx
  • vsz - Linkbench client max VSZ in GB
  • rss.l, rss.r - Linkbench client max RSS in GB during the load (rss.l) and run (rss.r)
  • cpuSec - number of CPU seconds used by the Linkbench client

---16 clients
hMax    vsz     rss.l   rss.r   cpuSec
1000     5.8    0.5     0.9     2795
2000     6.9    0.8     1.3     2856
none    20.8    0.8     4.0     2853

--- 64 clients
hMax    vsz     rss.l   rss.r   cpuSec
1000     9.0    0.9     1.3     8104
2000    10.1    1.2     1.8     7800
none    24.0    5.2     5.8     7773


Monday, May 18, 2020

Avoiding reads in write-optimized index structures

This is a summary of features that avoid storage read IO in OSS write-optimized index structures. My scope is limited to MySQL and Tarantool -- I know there are similar features in other DBMS and am happy for comments on that:
  • RocksDB merge operator logically does read-modify-write but physically does a Put into the memtable and the read is deferred until user queries or compaction. I hope that MyRocks eventually takes advantage of the merge operator.
  • Read free replication in TokuDB and MyRocks.
  • Fast Updates in TokuDB do something similar to the merge operator for update and insert
  • Non-unique secondary index maintenance is read free in MyRocks
  • Deferred secondary index update in Tarantool is a new approach to avoiding reads for write-heavy workloads. It is worth reading. I wish it had been published by VLDB.
  • MyRocks also avoids some storage reads when validating unique constraints (for PK and unique secondary indexes) on inserts and some updates because the Get() that it does can use a bloom filter. This benefit is larger when there is a bloom filter on the max level of the LSM tree, but that has other costs is is frequently not done.
  • MyRocks uses the SingleDelete feature in RocksDB to allow tombstones to be removed earlier which decreases the CPU overhead for reads after a batch of writes.
  • A b-tree can also reduce index maintenance related storage read IO. See the InnoDB change buffer.
Updates

An incomplete list of relevant research:
  • DiffIndex - this is more about distributed systems, but still interesting
  • Techniques for reducing index maintenance overhead in an LSM by Luo & Carey (VLDB 2019)
  • Deferred Lightweight Indexing (DELI)

Thursday, April 9, 2020

Reviving mstat

A long time ago I wrote mstat to collect iostat, vmstat and MySQL (show global status) performance counters. It samples everything at the same interval (every N seconds), computes rates for all numeric values and has some support for expressions so I can define a new counter as a function of other counters.

Eventually I stopped using it and then there were problems. Recently I fixed many of them. The tool now works with Python3 and supports old and new iostat output format. This week I started to add support for Postgres and MongoDB. It gets data from pg_stat_bgwriter for Postgres and from db.serverStatus() for MongoDB. So the tool is useful to me again but it won't be useful to others until I add docs.

The tool prints one line in CSV format per time interval. The line is likely to be long because there can be hundreds of counters in the line. The name and offset for each counter is printed at startup. I extract data using bash and awk as part of my benchmark workflow.

Because of this dependency to list the offset per counter at startup the tool is not able to handle dynamic data yet. An example of dynamic data are per-DBMS counters such as the rows in pg_stat_database. That view has one row per database and the database names (datname) are not fixed. Similar problems exist for the per-table and per-index monitoring tables and views provided by MySQL, other stats views provided by Postgres and other monitoring data provided by MongoDB. So I need more time to figure out how to support them.

I have more work to do on mstat, but I am happy that I can use it again:
  • I need to figure out whether there are global counters in Postgres for things like queries and rows read
  • For MongoDB mstat has a hack to replace space, dot and parentheses with _ in key names
  • More work remains to make sure that counters are ignored unless their type is numeric or a date

Monday, April 6, 2020

Creating performance reports

I am trying to automate some of the tasks that I must do to produce performance reports. By trying I mean that I am just getting started on this and will be making decisions I might regret later. This is also a request for advice, feedback and corrections.

Some requirements, not all of which will be satisfied:
  • interactive - I prefer that the format can be hosted in a way that allows for discussions. Google Docs does a great job at this by supporting access control and threaded discussions on comments.
  • graphics - support charts and graphs. Back in the day I was a happy user of Google Image Charts that let me define a chart in a URL. Alas, the service has been turned off and the new thing from Google requires Javascript.
  • self-contained - I want to share the report as one thing. With Google Docs that thing can be the URL for the doc, even when there are additional things, like spreadsheets, linked to it. With HTML the thing is the HTML document.
  • script-able - I want to create as much as possible of the document via scripting. I frequently repeat tests and don't want to waste time reformatting tables.
  • private - some reports are not yet public and I prefer a solution that isn't always public. However I also want a solution that works for company-private reports and also for reports that are shared with the public. 
  • other - supports wide lines (via a slider, wrapping wide lines is lousy for a reader), what else?

Choices:
  • Google docs
    • interactive - yes
    • graphics - yes
    • self-contained -yes
    • scriptable - no
    • privacy - yes
    • other - no support for wide lines, they are always wrapped
  • Github Pages
    • interactive - no, always public
    • graphics - maybe
    • self-contained - yes
    • scriptable - yes
    • privacy - no, always public
    • other - supports wide lines
  • Blogger
    • interactive - somewhat via comments but not as nice as Google docs, but always public
    • graphics - yes by embedding images
    • self-contained - yes
    • scriptable - maybe if I can paste an HTML doc as the content
    • privacy - no, always public
    • other - weak support for wide lines
  • Plain HTML
    • interactive - maybe*
    • graphics - maybe*
    • self-contained - maybe*
    • scriptable - yes
    • privacy - yes*
    • other - supports wide lines
From the above I am not aware of anything close to a perfect solution. I am likely to choose plain HTML. Privacy can be managed by sharing via company email and or hosting the HTML files on company-private servers.

Charts and graphs could be provided by a web-service that lets a chart be defined in a URL. Quickchart is an example of that. Of course, use of such a service means the data is no longer private. But there is a difference between posting a web-page for the world to read and sharing that data with the company that provides the URL chart service.

If I use plain HTML then I must find a way to share such pages with the public. That last sentence is amusing given that the web runs on HTML. But I have never done this. I have been a long-time, and mostly happy, user of Blogger. I have begun to use Github Pages and am a big fan of markup.

As a bonus, I prefer a solution that doesn't depend on the whims of a vendor. For example, if I commit to service X and the vendor for service X turns off that service, I don't want to spend weeks reformatting reports while I migrate to a new service.

It would be great if I could inline some of the charts in the HTML document. I need to figure out whether there are good tools for simple ascii bar charts and graphs that can be pasted into an HTML document.

I am a fan of gnuplot and need to learn more about how I can share gnuplot output in PNG format as part of an HTML document without posting all of that on the web.

Monday, March 30, 2020

Compiling from source for Amazon Linux 2

Different is worse

I used to know struggle with  make & automake but I knew enough to get things done. Now I have fun with make, cmake, maven and gradle. I am sure this proliferation solves problems for people who create the build scripts that I use. But it isn't a good experience for the intermittent user -- different is worse because my skill level per tool is lower.

What is Amazon Linux 2? It has a landing page. I call it AL2 below. It claims to be like CentOS, RHEL, Fedora and it uses yum. I know how to use Google to get answers for those distros when trying to figure out how to build something from source.

But it isn't clear to me that packages that exist for CentOS also exist for AL2. One example of the challenge is that I ended up installing jemalloc from source because I couldn't find the package without advice from an expert.

I hope the output from cat /etc/os-release on AL2 is improved. Below are examples for AL2 and Ubuntu 18.04. Ubuntu specific that this is Ubuntu 18.04. From AL2 I learn it is like CentOS. But which CentOS?

NAME="Amazon Linux" 
VERSION="2" ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/" 


Lets compare that with Ubuntu:

NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Dependencies for AL2

This explains how to satisfy the dependencies before running cmake. I struggled figuring out the names for packages that had to be installed. Some of this is because I don't have much experience with CentOS, some of this is on AL2. Eventually I found this page which might have helped.

Add libraries for the insert benchmark, some of this is needed for the build and all is needed when I run benchmarks so I will just install before I try to build:

sudo yum install -y python3
sudo python3 -m pip install pymongo
sudo yum install -y mysql-devel python3-devel
sudo pip3 install mysqlclient
sudo pip3 install psycopg2-binary
sudo yum install -y openssl11-libs
sudo yum install -y libzstd-1.3.3
sudo yum install -y ncurses-compat-libs
# The host arrives comes with /etc/my.cnf from MariaDB

sudo rm -f /etc/my.cnf /etc/mysql/my.cnf

Install jemalloc from source because I couldn't figure out how to install the package:

cd /media/ephemeral1
rm -rf jemalloc-5.2.1*
wget https://github.com/jemalloc/jemalloc/releases/download/5.2.1/jemalloc-5.2.1.tar.bz2
bunzip2 jemalloc-5.2.1.tar.bz2
tar xvf jemalloc-5.2.1.tar
cd jemalloc-5.2.1
./configure --prefix=/usr > o.cf 2> e.cf
make -j4 > o.m 2> e.m
sudo make install

Install things needed to compile MySQL8 and MyRocks from FB MySQL 5.6. I didn't confirm that all of these packages exist. But I was able to build after doing this:
sudo yum install -y cmake3
sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install -y numactl-devel
sudo yum install -y libedit-devel
sudo yum install -y cmake gcc-c++ bzip2-devel libaio-devel bison 
sudo yum install -y zlib-devel snappy-devel
sudo yum install -y gflags-devel readline-devel ncurses-devel openssl-devel
sudo yum install -y lz4-devel gdb git libzstd-devel

Building MySQL8

Download, compile and install MySQL 8. It is installed at /media/ephemeral1/my8018. The upstream docs are useful.

cd /media/ephemeral1
rm -rf mysql-8.0.18*
wget https://downloads.mysql.com/archives/get/p/23/file/mysql-boost-8.0.18.tar.gz
tar xzvf mysql-boost-8.0.18.tar.gz
cd mysql-8.0.18
mkdir build
cd build
bash /media/ephemeral1/cmk80 /media/ephemeral1/my8018 > o.cm 2> e.cm
make -j8 V=1 VERBOSE=1 > o.mk 2> e.mk
make install

The contents of /media/ephemeral/cmk80.

prefix=$1
cmake3 .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DCMAKE_INSTALL_PREFIX:PATH=$prefix \
      -DWITH_SSL="system" \
      -DWITH_ZLIB="system" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \
      -DENABLED_LOCAL_INFILE=1 \
      -DMYSQL_MAINTAINER_MODE=0 \
      -DWITH_BOOST=$PWD/../boost \
      -DWITH_NUMA=ON 

Building MyRocks

Download, compile and install MyRocks from FB MySQL 5.6. It is installed at /media/ephemeral1/fbmy56. The upstream docs are useful.

First install boost 1.65 from source:

cd /media/ephemeral1
wget https://dl.bintray.com/boostorg/release/1.65.1/source/boost_1_65_1.tar.bz2
bunzip2 boost_1_65_1.tar.bz2
tar xvf boost_1_65_1.tar
cd boost_1_65_1
./bootstrap.sh --prefix=/usr

./b2 > o.b2 2> e.b2
sudo ./b2 install

Then build MySQL 5.6

cd /media/ephemeral1
git clone https://github.com/facebook/mysql-5.6.git fbmy56-src
cd fbmy56-src
# This is the old version that I used for a while
# git checkout 20aaaf8d

git submodule init
git submodule update
mkdir build
cd build
bash /media/ephemeral1/cmkfbmy56 /media/ephemeral1/fbmy56 > o.cm 2> e.cm
make -j8 V=1 VERBOSE=1 > o.mk 2> e.mk
make install

The contents of /media/ephemeral1/cmkfbmy56

if [ -z $1 ]; then
echo Requires prefix as arg1
exit -1
fi

RXF=" -DROCKSDB_DEFAULT_TO_ADAPTIVE_MUTEX -DROCKSDB_SUPPORT_THREAD_LOCAL -DROCKSDB_SCHED_GETCPU_PRESENT"
RXF+=" -DROCKSDB_RANGESYNC_PRESENT -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_FALLOCATE_PRESENT"
RXF+=" -DHAVE_ALIGNED_NEW"
RXF+=" -DROCKSDB_SUPPORT_THREAD_LOCAL"
RXF+=" -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -DOS_LINUX"

# Look at /proc/cpuinfo to determine whether these (sse, avx) are supported

#RXF+=" -DHAVE_PCLMUL -DHAVE_SSE42 -DHAVE_AVX2 -DHAVE_UINT128_EXTENSION"
RXF+=" -DHAVE_PCLMUL -DHAVE_SSE42 -DHAVE_UINT128_EXTENSION"


# -march=native gets avx and sse if they are supported
CXXF="-DNDEBUG -march=native"
CXXF+=" -faligned-new"
CXXF+=" $RXF"

# extra flags to avoid warnings with gcc on ubuntu 18
CF="-Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

CXXF+=" -Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

cmake3 .. \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DMYSQL_MAINTAINER_MODE=0 \
  -DENABLED_LOCAL_INFILE=1 \
  -DENABLE_DTRACE=0 \
  -DCMAKE_CXX_FLAGS="$CXXF" -DCMAKE_C_FLAGS="$CF" \
  -DCMAKE_INSTALL_PREFIX=$1 \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
-DWITH_LZ4=system \
-DWITH_ZSTD=system

Not done yet

There is a crash in Boost code that was added in the FB MySQL branch. Boost 1.53 is installed by yum install boost-devel. I will rebuild MyRocks with Boost 1.65 as that is what Ubuntu 18.04 uses, and MyRocks works great for me there.

./bin/mysqld(my_print_stacktrace+0x2e)[0xac581e]
./bin/mysqld(handle_fatal_signal+0x333)[0x7e8293]
/lib64/libpthread.so.0(+0x117e0)[0x7f22703f37e0]
./bin/mysqld(_ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7cl
assic7scannerIN9__gnu_cxx17__normal_iteratorIPcSt6vectorIcS8_EEENSG_16scanner_policiesINSG_28skip_parser_iteration_policyINSG_11alternativeINSQ_INSG_12space_parserENSG_13confix_pa
rserINSG_6strlitIPKcEENSG_11kleene_starINSG_14anychar_parserEEENSQ_INSG_10eol_parserENSG_10end_parserEEENSG_21unary_parser_categoryENSG_10non_nestedENSG_9is_lexemeEEEEENSS_ISW_SZ_
SW_S13_S14_S15_EEEENSG_16iteration_policyEEENSG_12match_policyENSG_13action_policyEEEEEED1Ev+0xc)[0x840eec]
./bin/mysqld(_ZN5boost6spirit7classic4impl14grammar_helperINS1_7grammarINS_13property_tree11json_parser12json_grammarINS5_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsI
cESaIcEEESE_St4lessISE_EEEEENS1_14parser_contextINS1_5nil_tEEEEESI_NS1_7scannerIN9__gnu_cxx17__normal_iteratorIPcSt6vectorIcSD_EEENS1_16scanner_policiesINS1_28skip_parser_iteratio
n_policyINS1_11alternativeINSW_INS1_12space_parserENS1_13confix_parserINS1_6strlitIPKcEENS1_11kleene_starINS1_14anychar_parserEEENSW_INS1_10eol_parserENS1_10end_parserEEENS1_21una
ry_parser_categoryENS1_10non_nestedENS1_9is_lexemeEEEEENSY_IS12_S15_S12_S19_S1A_S1B_EEEENS1_16iteration_policyEEENS1_12match_policyENS1_13action_policyEEEEEE8undefineEPSM_+0x43)[0
x840fe3]
./bin/mysqld(_ZN5boost6spirit7classic7grammarINS_13property_tree11json_parser12json_grammarINS3_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESC_St4lessISC_EEE
EENS1_14parser_contextINS1_5nil_tEEEED1Ev+0x34)[0x840cb4]
./bin/mysqld(_ZN5boost13property_tree11json_parser18read_json_internalINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEEEvRSt13basic_istream
INT_8key_type10value_typeES6_ISG_EERSE_RKS9_+0xc51)[0x844511]
./bin/mysqld[0x839e32]
./bin/mysqld(_ZN3THD12set_shard_idEv+0x1c)[0x83a26c]
./bin/mysqld(_ZN3THD15set_db_metadataEv+0x1e3)[0x852e73]
./bin/mysqld(_Z15mysql_change_dbP3THDPK19st_mysql_lex_stringb+0x5da)[0x84f6ca]
./bin/mysqld(_Z16acl_authenticateP3THDj+0x13a5)[0x80e685]
./bin/mysqld[0x847d26]
./bin/mysqld(_Z16login_connectionP3THD+0x4d)[0x849c2d]
./bin/mysqld(_Z24do_handle_one_connectionP3THD+0x199)[0x84aa29]
./bin/mysqld(handle_one_connection+0x9)[0x84adc9]
/lib64/libpthread.so.0(+0x740b)[0x7f22703e940b]
/lib64/libc.so.6(clone+0x3f)[0x7f226e2a7e7f]

stack_log: _ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7clas
sic7scannerIN9__gnu_cxx17__normal_iteratorIPcSt6vectorIcS8_EEENSG_16scanner_policiesINS


Friday, March 27, 2020

kernel lockdown vs blktrace

I am trying to use blktrace to determine which files are the source of disk reads for a database that uses buffered IO. The server runs Ubuntu 18.04 on bare metal and the boot uses UEFI not legacy.

blktrace doesn't work, time to debug. Things that don't fix it include upgrading from 4.15 to 5.3 HWE kernel and disabling apparmor. Next up is disabling kernel lockdown via mokutil --disable-validation. Alas, blktrace still fails at startup.

After running mokutil and then rebooting there are still a few messages in dmesg output about lockdown so I wonder whether it was fully disabled.
Lockdown: Hibernation is restricted; see man kernel_lockdown.7
Lockdown: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
OK, lets read the man page. Great, it doesn't exist -- not for Ubuntu nor for other distros. There is a draft but I am starting to get the impression that lockdown wasn't ready for prime time. And Linus had a strong opinion about it in 2018.

Next up is a strong opinion from Brendan Gregg.
Many distros are enabling lockdown, breaking BPF. This is the worst OS change I've ever seen.
OK, maybe my problem is lockdown and mokutil wasn't sufficient. Time to try:
echo 1 > /proc/sys/kernel/sysrq; echo x > /proc/sysrq-trigger 
And now blktrace works. Well, until I reboot. I already have a script to run after reboot to reduce security so that PMP can run. That script just got larger:
echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger 


Monday, March 16, 2020

Insert Benchmark v3

I expect to replace the insert benchmark later this year. The insert benchmark I have been using might be ibench v2 as ibench v1 came from Tokutek. So the replacement I write would be ibench v3.

The reasons for a replacement include:
  • Switch from Python to something more performant (Java or Go) as it is difficult to sustain high QPS rates (100k / second or more) with Python and a Java/Go client will consume less CPU. I assume the DBMS I care about all have good golang clients (MongoDB, MySQL, Postgres).
  • Make the benchmark less synthetic. Early ibench was insert-only for a table with a PK and 3 secondary indexes. Eventually I added support for short range queries. It has always used uniform distribution. The plan for ibench v3 is to model a monitoring workload -- many inserts, some updates, many queries. It will use a PK and 1 or 2 secondary indexes.
  • Continue to support multiple DBMS -- MongoDB, MySQL and Postgres.
  • As a bonus, don't ignore coordinated omission
I still need to confirm whether this is worth implementing or whether I have a case of NIH and should be using TSBS. I browsed TSBS and the indexes I describe here might match the indexes that TSBS would use. However, TSBS has more query diversity. My focus is on the storage engine and a smaller set of queries is sufficient. Regardless, I need to spend more time reading about TSBS.

Logical Schema

The logical schema describes the data in one insert operation. The physical schema describes how that is stored in the DBMS. I focus on the logical schema here.

The logical schema is: timestamp, deviceID, [metricID, metricValue]+
  • This is the data from one insert for a given device. It has data for 1+ metrics.
  • timestamp is a 64-bit integer that will be mostly increasing on insert. Clocks are not synchronized across clocks, some devices will have bad/stuck clocks and some devices will have clock drift.
  • deviceID is a 64-bit integer. The number of devices will be used to scale the workload. There can be millions of devices for large scale factors.
  • Each (metricID, metricValue) pair represents the value for a metricID from a given device at a given point in time. metricID is a 32-bit integer. The number of metrics will also be used to scale the workload. There will be fewer metrics than devices. metricValue is a 64-bit integer

Workload

Fields are abbreviated when describing the workload: t=timestamp, d=deviceID, m=metricID, v=metricValue. Operations are described in SQL, except for insert.

The workload is a combination of:
  • Get: select v, t from C where m=M and d=D and t between T1 and T2 order by t
    • Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N
  • Max, Min: select v, t, d from C where m=M and t between T1 and T2 order by v asc/desc limit N
  • Compare: call Get for m in (...) and d=D and t between T1 and T2
  • Insert: insert into C values (t, d, [(m1, v1),(m2, v2),...])
    • SQL above is not standard and assumes a schema that might not be used.
    • (t,d,m) is unique. I am still deciding whether the benchmark client can generate intermittent duplicate inserts and how that should be handled.
  • Rollup: Uses Compare to get all data for d in (...) in time range, then replaces that data with values aggregated over coarser granularity. Assume that base data is rolled up into 5 minute, then 1 hour and then 1 day granularity.
  • Backfill: inserts missing data that might be hours or days old

The value of N in Get, Max and Min above will vary. For Max and Min it is likely to be <= 100 and frequently <= 10. For Get in some cases the user wants all data in a time range, but in other cases they will set N to be <= 10,000 and not get all data in the time range.

I decided to not have a Get query variant, called GetD, that uses where d=D and t between … (it lacks a predicate on m) because this data is provided via the Compare query assuming that the metricIDs are known. If that assumption isn’t true and GetD were needed then an extra secondary index on (d,t) would also be required to make that query efficient unless the DBMS supported index skip scan via an index on (d,m,t,...).

Open Issues
  • Spammy data - is it worth simulating devices that misbehave? This includes sending data too frequently, sending incorrect values and timestamps and not sending data for some time periods.
  • Is the aggregated data stored inline with the base data or in a separate table? If stored inline do entries need a tag to indicate granularity like base, hourly, daily, etc?
  • Should the benchmark include queries and writes to correct bad data?
  • Can there be a way for the benchmark to run faster than wall clock time so that per-hour and per-day rollup is done more frequently than per real hour and per real day? This avoids the need to run the benchmark for too long.
  • How does a scale factor affect the size of the database, the write load and the query load?

Physical Schema - Write Optimized
The goal here is to make writes fast. That is done by using a unique index for (t,d). Inserts are to the right end of the index assuming they usually arrive in t order. While this makes inserts fast it makes queries too slow. The (t,d) index over-fetches because only the range predicate on t can be used for the index access path:
  • For the Get and Compare queries it over-fetches by a factor of |m| * |d|. If there are 1M devices and 10 metrics per insert it fetches 10M times more index entries than needed.
  • For the Max/Min queries it over-fetches by a factor of |m|.
The large over-fetching for Get and Compare makes it infeasible to run them with the write optimized schema. The Max and Min queries could still be run -- it would be painful but not infeasible given that |m| is likely to be <= 100.

Implementation notes:
  • MongoDB - There are 3 choices for the unique index. The first choice is to use the subdocument {t, d} as the value for _id and make sure that all clients encode it in the same way. The second choice is to concatenate t and d into a string for the value of _id. This avoids the need to worry about clients encoding as in the first choice at the cost of storing t and d twice per doc. The third choice is to use ObjectId as the value of _id and then have a unique compound index on {t:1, d:1}. The cost for this is an extra index but inserts are in index order for both.
  • SQL with document datatypes - Use one row per insert and document datatype support in MySQL or Postgres to store all of the (m,v) pairs in that row.
  • SQL without document datatypes - Use one row per metric per insert. The unique index is defined on (t,d,m) rather than (t,d). The columns in the row are t, d, m, v.

Physical Schema - Read Optimized

These indexes are important to make queries efficient.
  • Get and Compare need one of (d,m,t) or (m,d,t).
  • Max/Min need one of (m,t,v,d) or (m,v,t,d)
From the perspective of one Get query there isn’t a difference between the index choices because there are equality predicates on d and m. But from a cache perspective one might be better than the other as they cluster on the first indexed field -- by d or m. If some metrics are less likely to be queried then clustering by m can be a better choice.

The unique constraint is on (d,m,t) or (m,d,t). If v is also in that index then uniqueness cannot be enforced. If v is not in the index then it might not be covering. With Postgres I can use the INCLUDE clause to get v in the index. With InnoDB and MyRocks the table is clustered on the PK, so it is sufficient to make (d,m,t) or (m,d,t) the PK. With MongoDB if I want a covering index then I have to create two indexes -- unique on (m,d,t) and then non-unique but covering with (m,d,t,v) or I can skip the covering index to reduce the write cost but make queries slower. Finally, by unique for MongoDB I mean that this is the value for _id, and then we are back to the question of using a subdocument for _id.

For Max and Min:
  • The (m,t,v,d) index uses index predicates for m and t and then does a sort. So this has an extra cost from the sort and can’t benefit from stopping the index scan once the limit N has been satisfied. Some DBMS have optimizations to make top-N sort fast, but there is no way to avoid the cost of not stopping the index scan early.
  • The (m,v,t,d) index uses index predicates only for m but avoids the sort and can stop the index scan once the limit N has been satisfied. The range predicate on t would be a filter predicate - evaluated for each entry read from the index.
There are cases where (m,t,v,d) is a better choice, but I think that (m,v,t,d) will be better more of the time. The index on (m,t,v,d) would be better when the selectivity from using t as an index predicate offsets the cost of the sort and the data is skewed so that limit N isn’t satisfied until most of the index has been scanned.

Write efficiency in the read optimized schema

Below I use streams of writes to mean the number of leaf pages that will be subject to read-modify-write from index maintenance during inserts at any point in time. If all inserts are to the right end of the index then there is 1 stream. If the index is on (client-ID, time), there are 10 clients active concurrently and each client inserts in time order then there are 10 streams of writes. When there is one leaf page in cache for each stream then IO efficiency is better. Otherwise each insert is more likely to force a page write back followed by a page read.

For the indexes used by Get and Compare -- (d,m,t,v) or (m,d,t,v) -- there will be |d|*|m| streams of writes. The cache demand from this is |d|*|m| pages. A small workload might use |d|=1M and |m|=10 and 10M 8kb pages needs 80G of RAM. Both indexes have a similar amount of cache amplification.

For the indexes used by Max and Min, the (m,t,v,d) index has |m| streams of writes vs |m|*|v| for the (m,v,t,d) index. Thus the (m,v,t,d) provides better query performance at the cost of more random IO and more cache amplification. One index has much less cache amplification.

For the non-unique secondary index used for Max and Min, MyRocks can use read-free index maintenance and the number of write streams isn't an issue for it. But it will be an issue for other DBMS.

Partitioning

Partitioning the tables and collections by time can help with write and read performance as well as making other tasks easier -- removing and rolling up old data. For example there could be a partition per hour for the last N hours. However if a DBMS doesn’t support partitions then there is a burden on the application developer to support cross-partition queries. As a user I expect to run queries across hourly boundaries.

Thursday, March 12, 2020

Tuning space and write amplification to minimize cost

This uses math to show how to tune space and write amplification to minimize storage costs for an index structure that uses index+log. The result, minimal cost, is true assuming my model is true. But at this point I will only claim that the model is truthy. I look forward to more results in this area from the smart people at DASlab and elsewhere. I wonder if database economics is a good name for this topic.

I explained the index+log index structure here and here and my truthy model assumes that write and space amplification are functions of PctUsed - the percent of available storage that is used by the index structure. The model is:
  • Space amplification = 100 / PctUsed
  • Write amplification = 100 / (100 - PctUsed)
In what follows I use SpaceAmp and WriteAmp for space and write amplification. When PctUsed is X then the remaining space on the storage device (100-X) is free, not used by anything. The formulas mean that a deployment can trade between SpaceAmp and WriteAmp by adjusting the value of PctUsed. When PctUsed is 80 the values of SpaceAmp and WriteAmp are 1.25 and 5. When PctUsed is 20 the values of SpaceAmp and WriteAmp are 5 and 1.25.

Math

Back in the day hardware was scarce and there was much math in systems papers. While there isn't as much math today the rush to ML means that more people are learning applied math (me included) which is a good thing. I regret not learning enough applied math while in college.

Here I derive a formula for the cost of storage in terms of the database size and the IO required to do compaction (GC, garbage collection) for an index structure that uses the index+log approach. The cost is a function of PctUsed.

Assume:
P = PctUsed
S = index structure size in GB
G = Cost/GB
N = Write rate to index structure in MB/s
f = value between 1 and 2, 1= no storage reads by GC, 2= all GC writes do storage reads first
I = Cost/IOPs for 4KB operations

Cost = Costspace + Costio

# Formulas for the cost of space and IOPs
# 256 converts MB/s into 4KB IOPs

Costspace = S * G * 100 * P-1
Costio = N * 256 * f * I * 100 * (100-P)-1

# Determine where Cost' = 0 to find the minimal cost
Cost' = Costspace+ Costio'

Costspace= -1 * S * G * 100 * P-2
Costio= N * 256 * f * I * 100 * (100-P)-2 * -1 * -1
Cost' = (-1 * S * G * 100 * P-2) + (N * 256 * f * I * 100 * (100-P)-2 )
# And Cost' = 0 when
S * G * 100 * P-2 = N * 256 * f * I * 100 * (100-P)-2 
# Skipping a few steps this reduces to
P* ((NfI/SG) - 1) + 200P - 10,000 = 0

# This can be solved by the quadratic equation with a=((NfI/SG) - 1), b=200, c=-10,000

Graphs

Now I solve the equation above to determine the value of PctUsed that minimizes cost with prices from EBS provisioned IOPs. A Google Sheets spreadsheet with the solution is here. For the spreadsheet:
  • The graph uses log-scale for the y-axis and the y-axis doesn't start at 0. This makes it easier to see the impact of changing PctUsed, but can also be misleading. 
  • The solutions from the quadratic equation are quad1 and quad2
  • Cost is computed for PctUsed in (5, 10, 15, ..., 85, 90, 95)
  • The minimal value for Cost (quad1) is likely to be between these values
I then solve for 3 cases: N=1, 10 and 100 where N is the write rate to the index structure in MB/s. The minimal cost occurs at PctUsed = 67, 39 and 17 for N = 1, 10 and 100.

For N=1, a low write rate, the minimal cost is at PctUsed=67


For N=10, a moderate write rate, the minimal cost is at PctUsed=39


For N=100, an extreme write rate, the minimal cost is at PctUsed=17


Review of Benchmarking RocksDB Key-Value Workloads at Facebook

At some level all database engines have a key-value API even if they aren't exposed to the user. One name for this in a SQL DBMS is the data layer. The data layer is RocksDB for MyRocks & MongoRocks and then WiredTiger for MongoDB.

FAST 20 has a paper on characterizing RocksDB key-value workloads at Facebook. One of the workloads is UDB and UDB uses MyRocks, a SQL DBMS.

Benchmarks are useful to me but there is always room for improvement. Two common problems are the lack of load variance and the lack of complexity in access patterns. By the lack of load variance I mean that the workload doesn't change over time which fails to capture the daily spikes that occur for web-scale workloads as different parts of the world wake and sleep. They also fail to include daily and intermittent operational tasks like backup, schema change, data migration and archiving old data.

This paper explains the complexity in access patterns for three RocksDB use cases and shows how to reproduce them in benchmarks. UDB was one of the workloads so this work includes an update of the analysis shared in the Linkbench paper from 2013.

The primary contributions of the paper are the analysis of production workloads and the discovery that hot key ranges, rather than hot keys, is required to model production workloads. I appreciate that some of the work behind this paper has already been shared via RocksDB and that more might be shared as enhancements to YCSB.

Things I liked about the paper:
  • When generating skew in a benchmark use the concept of hot key ranges rather than hot keys. Linkbench and YCSB use hot keys distributed across the key space.
  • Explains the UDB workload which was last explained in the Linkbench paper
  • Explains two other production workloads
  • Provides some detail on the diurnal workload spikes in UDB
  • Explains the distribution of key sizes. While it states that UDB keys are small, by small it means a typical value is 16 to 30 bytes. Of course, the covering secondary index used by UDB is an outlier at ~64 bytes.
  • Explains that values are usually small, <= 128 bytes
  • Documents temporal access patterns
  • Demonstrates that using hot key ranges rather than hot keys can more accurately model production workloads

Wednesday, March 11, 2020

Building MySQL 5.6 from source on Ubuntu 18.04

This explains the steps I used to build MySQL 5.6.35 from source on Ubuntu 18.04. I build with the perf schema disabled and if you do that then don't use Connector/J 5.1.48 (5.1.47 is OK) or you won't be able to connect thanks to bug 98919. The new Connector/J dependency on perf schema variables can be a problem for a DBMS that implements the MySQL API -- see this bug report for MemSQL.
  1. Install an older version of gcc and g++: sudo apt install gcc-5 g++-5
  2. Unpack source. I used 5.6.35
  3. Remove the connection_control plugin because that does not compile when the perf schema is disabled at compile time -> rm -rf $SRC_ROOT/plugin/connection_control
  4. Run cmake. See below. Note that I disable the perf schema.
My Cmake script:

prefix=$1
CC=gcc-5 CXX=g++-5 \
cmake .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DCMAKE_INSTALL_PREFIX:PATH=$prefix \
      -DWITH_SSL="bundled" \
      -DWITH_ZLIB="bundled" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \
      -DENABLED_LOCAL_INFILE=1 \
      -DMYSQL_MAINTAINER_MODE=0 \
      -DWITH_PERFSCHEMA_STORAGE_ENGINE=0 > o.cm 2> e.cm

Wednesday, March 4, 2020

RDBMS != SQL DBMS

We use RDBMS as another name for SQL DBMS but SQL isn't relational. That isn't news, see this web site and book. SQL allows for but doesn't require relational and 1NF or 3NF are optional. JSON is in the SQL:2106 spec. What would Codd think?

Using Oracle as a SQL DBMS example. First there was support for collection data types, then there was XML and eventually JSON arrived. These let you violate 1NF. I won't argue whether these should be used. I only claim they can be used.

Have there been surveys to document how often the relational approach is used with a SQL DBMS? I assume it is better to think of a distribution of approaches (a value between 0 and 1 where 0 is SQL and 1 is relational) rather than a binary approach of relational vs SQL (not relational). I might call the SQL endpoint the pragmatic approach, but that introduces bias. While I have spent a long time working on SQL DBMS I am usually working under the hood and don't design applications.

Tuesday, February 18, 2020

New school vs old school DBMS

There are two approaches to DBMS deployment
  1. Old school - pay whatever it takes to keep that one instance running
  2. New school - allow for failure and make failover reliable
At what point do you stop paying to increase availability and durability of one DBMS instance and instead spend that money and energy elsewhere? The old school approach was more popular in the days before cloud and web-scale. Of course early in the cloud and web-scale days we allowed for failure without making failover reliable. Those weren't fun times.

That y must design for failures also doesn't mean you must tolerate lousy hardware. Just because failover should be fast, reliable and lossless doesn't mean you want it running too frequently. There is a difference between commodity and cheap hardware. On the other hand, asking for too much special hardware won't win you friends on the datacenter HW team. If the DBMS is the only thing asking for redundant power, redundant network, special cooling or HW RAID with battery-backed write cache then the DBMS is a problem for web-scale. I know this from experience.

One reason for the old school approach was the large cost of the DBMS licenses and the SMP server on which it ran. Users are motivated to not buy 3X the licenses when the cost is sufficiently large and instead invest in an environment that helps that DBMS keep running (more HA hardware). Any comparison between new school and old school should consider whether the goal is durability or availability because availability costs more than durability. You can try to provide one or both. In this post I am writing about something that needs both.

Storage is one place where you can spend a lot of money to keep that one instance running. That can be done via RAID-10 where the impact is buying twice more storage devices. Or it can be done via a HA storage solution that provides impressive levels of performance, availability and durability.

One thing I am curious about is whether SSD devices need RAID-10. All storage devices have some chance of failure so the question isn't whether RAID-10 is useful. The question is whether the usefulness outweighs the cost. I assume the answer will depend on the device. Samsung advertises fail in place (FIP) as a new feature. But chip fail protection and wear leveling have been here for a long time. For which devices is that sufficient so that RAID-10 isn't recommended? Do vendors make this clear in their docs?

New School

The new school approach must do two things: make durable commits fast, make failover fast and lossless.

Fast durable commits are usually done via sync replication. I think of sync log shipping as a variant of sync replication where the end point is a log archive rather than a replica. But some sync replication solutions already support some of that via a witness.

MongoDB provides fast durable commits via async replication. I was impressed when I first learned of the implementation. The property needed for durable commits is to avoid making writes visible before they are durable. Sync replication is an implementation detail. With MongoDB and the majority read concern a snapshot on the primary is advanced to track the point-in-time at which commits are durable (applied on enough replicas).

Whether durable commits are fast depends on where you place the replicas. Speed of light matters. When replicas are far apart then there will be more commit latency. When replicas are not far apart there might be a larger HW bill. Witnesses (log only replicas) can help here.

Lossless failover assumes a solution for durable commit. Once durable commit has been provided then fast failover is a matter of detecting failures, electing a new primary, promoting the replica to be a primary and then directing traffic to the new primary. There are many details here and plenty of opportunities for mistakes. I know from having helped make some of them. Fortunately the rise of web-scale DBMS means that we get solutions that work.

Things that I ignore in this post:
  • Fast failure is a challenging problem depending on how fast you want it to be. 
  • Systems that allow multiple replicas to initiate writes might not expose failover to a client, but many of the problems described here are still solved under the covers by such systems.
  • Even for systems that don't have explicit failover there is still an impact on clients from failed in-progress transactions. Although Comdb2 hides that from clients.

Friday, February 14, 2020

Describing replication

There is an opportunity for confusion when describing replication including physical vs logical and synchronous vs asynchronous. I prefer describing replication as physical, statement or document/row rather than logical vs physical and briefly define them below.

In physical replication the log records the changes to the database pages (change bytes starting at offset X on page Y from this sequence to that sequence). A benefit of this approach is that the database pages will be the same between primary and replica. This approach also avoids the overhead (more CPU than IO) of evaluating the queries that lead to the page changes. A problem from this approach is that the pages need to be the same between the primary and replica. Changes to pages must be deterministic, long running reporting queries on a replica might block replication apply, MVCC GC requires coordination, etc.

In statement based replication (SBR) the log records the statements that modified the database. The benefit from this approach is that it is easier to implement. When one statement changes many rows then this also reduces the size of the log. But for OLTP statements are less likely to change many rows. Unfortunately long running write statements on the primary will be repeated on the slave and that can waste CPU and cause replication lag. It is also difficult to detect dependencies between transactions which makes it harder to replay in parallel on a replica. I did web-scale MySQL for many years with SBR. It worked, but the move to RBR is a good thing.

In document/row replication (aka RBR or DBR) the log has one entry per changed doc/row. This avoids the need for the replica to reevaluate the statement that generated the changes -- it can just fetch the docs/rows by ID and apply changes to them. When the doc/row entries in the log also include PK (_id) values then it is possible to determine which transactions can be replayed in parallel. It is possible to make these changes idempotent in contrast to SBR which reduces the burden of making replay crash safe.

MongoDB oplog

This is an example of the oplog contents for MongoDB. This is document based based on the terminology above:

db.bar.insertMany([{a:1}, {a:2}, {a:3}])

use local
db.oplog.rs.find({"o.msg": {$ne: "periodic noop"}},
                 {op:1, o:1}).sort({$natural:-1}).limit(3)

{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 3 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 2 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 1 } }

Benchmarks vs real workloads

How do you test a DBMS that you are building for internal usage or the world? This is a good question whether you are the upstream DBMS or maintaining a downstream fork. The choices include real workloads, synthetic benchmarks and throw it over the wall.

There is a great paper from Microsoft on doing live capture & replay at scale for real customers. I hope to add a link for a talk on the live capture & replay that was done for testing at Facebook.

Real workloads

Real workloads are great. Alas real workloads are hard to get (code IP, data privacy). Even if these are resolved a bigger problem is that real workloads are complex and buggy (all software is buggy). If you get a real workload without a team to support it then you will spend a lot of time debugging it and otherwise trying to figure it out. This is probably more true for OLTP than analytics.

Real workloads v2

Capture & replay (C&R) is an easier way to get some of the goodness of a real workload with less effort. This avoids the overhead of dealing with the code from the real workload. This doesn't avoid issues with data privacy.

C&R can be online or offline. Online C&R does the replay online (immediately after the capture). Online C&R is a great choice when you have significant internal workloads that use the DBMS you are building. Running a DBMS in production is a great education for database developers in so many ways. This is one more perk.

Offline C&R archives the captured workload for replay at a later time. This is useful for workloads that can't be shadowed (customer workloads). This is also useful when you want a workload that can be repeated (replayed many times) as online C&R does not allow for that. An interesting problem for offline C&R is making the replay realistic -- respecting the timing and concurrency that was recorded. This is more challenging when the DBMS being tested is a lot faster than the DBMS used for the capture.

One question for C&R is the point at which the capture is done. When capture is done at the DBMS then you don't miss anything. But sometimes it is easier to do the capture at the application tier.

The Facebook MySQL team had a great setup for online C&R via a shadow tier. I don't have much experience with offline C&R but I know it has been useful for others. In my Google MySQL years I built a tool to sample production queries (SELECT, not writes) and replay them against replicas running new and old binaries to compare performance and correctness. That was a fun and useful project for me. It was mostly online C&R. One interesting bug it found was a result of changing the internal datatype for aggregation between decimal and IEEE754.

Synthetic benchmarks

Benchmarks have a bad reputation but I like them. I even enjoy benchmarketing but only when I am not involved. I rarely run benchmarks in isolation. Absolute values (QPS, TPS) don't help me. I am a big fan of relative comparisons (new binary vs old binary, this DBMS vs that DBMS).

Comparisons require sufficient expertise in all of the systems that are tested as the goal is to get results that can be explained and trusted. Expertise is hard to find. Therefore I have less faith in benchmark results that test more than one DBMS -- whether this is work done by myself or others. Perhaps one day someone will do a study and provide a function that estimates truthiness as a function of the number of DBMS tested.

Throw it over the wall

This has to be done at some point in time, otherwise you are building a DBMS that nobody uses. Hopefully the steps above have been done to increase the chance that there won't be too much drama at this point.

Thursday, February 13, 2020

Short guide to MongoDB monitoring

This is short and incomplete. But it is a good start. This is written from the perspective of someone who spends all of their time trying to explain MongoDB performance for benchmarks. I have plenty of experience with databases in production. I have no experience with MongoDB in production.

Notes:
  • One day I hope for the equivalent of user/table statistics
  • For profiling use db.setProfilingLevel and db.getProfilingStatus.
  • To enable debug level messages in the diagnostic log use db.setLogLevel and db.getLogComponents. However I think it is best to start with non-debug messages so you can ignore these for a while.
  • I wish the set and get names above were symmetric. Just remember the set*Level method names and don't worry about the get* names. 
  • db.setProfilingLevel also determines what can get written to the diagnostic log. The important parameter is slowms. My advice is to set level=0 and slowms to something > 0 and use the diagnostic log. 
  • Learn how to use mtools to process entries from the diagnostic log. The tools are great and the docs are awesome.
  • The diagnostic log entries for COMMAND and TXN are the most likely entries you will examine. These lines can include many performance metrics but they are only printed when the values are > 0. Examples are below.
  • Also see db.stats, db.serverStatus, db.collStatsdb.collection.statsdb.collection.latencyStats, connPoolStats, replSetGetStatus,
  • Consider saving the FTDC files from the diagnostic.data directory if you have problems. There are some tools for accessing the data from an FTDC file.
  • Use explain to understand query plans and execution
  • Use mongostat and mongotop to see performance from a high level before drilling down
  • I don't have much experience with them yet but MongoDB has additional tools and services for monitoring.
Diagnostic log examples
These show some of the counters that can be included in diagnostic log entries. Remember that many of the counters are only printed when > 0.
Metrics on writes (bytesWritten, timeWritingMicros)
2020-02-04T18:05:24.787-0800 I COMMAND [conn41] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3139, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868324, 23435), signature: { hash: BinData(0, 9E42A21BD71EA0BB8DFEAD52013CFFF325E60BB6), keyId: 6789772850304647170 } }, lsid: { id: UUID("5d900961-93cf-4f7a-9426-940a37a71eca") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesWritten: 324324, timeWritingMicros: 656 } } protocol:op_msg 401ms
Metrics on reads (bytesRead, timeReadingMicros)
2020-02-04T18:08:52.442-0800 I COMMAND [conn33] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3719, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868532, 2858), signature: { hash: BinData(0, D289506AFBD0E8C8E8184112608CFED62E9A1B7D), keyId: 6789772850304647170 } }, lsid: { id: UUID("7391b47c-fd95-491b-a640-20c953bcacc0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1369, timeReadingMicros: 10 } } protocol:op_msg 389ms

timeWaitingMicros is less frequent. In this case it was ~2.2 seconds. That seems like a lot but the operation wrote more than 4M of data.
2020-02-04T18:22:05.668-0800 I COMMAND [conn46] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 144, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580869323, 2352), signature: { hash: BinData(0, 2CF49EC657832ECB9D7DD0AC5C65E5C3F80CA10B), keyId: 6789781594858061826 } }, lsid: { id: UUID("d88d7d65-f293-4060-a894-17c58123d787") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1881, bytesWritten: 4458441, timeReadingMicros: 15, timeWritingMicros: 5071 }, timeWaitingMicros: { cache: 2215267 } } protocol:op_msg 2621ms
timeAcquiringMicros for locks is also less frequent. I have a few examples from insert stress tests. I have yet to learn whether mongod times all lock waits. It isn't easy to do that without an impact on performance.
2020-02-04T17:52:55.133-0800 I COMMAND [conn43] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 1023, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580867574, 46329), signature: { hash: BinData(0, D158AF9490A8ABAEA2969E809D2C27FE57157B41), keyId: 6789772850304647170 } }, lsid: { id: UUID("fabaaa7b-c2c4-4712-ae16-f2c0f58a03a0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 14042 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{} protocol:op_msg 376ms
Commit appears in the log via TXN

2020-02-05T17:31:54.792-0800 I TXN [conn81] transaction parameters:{ lsid: { id: UUID("cae8daad-5603-4a41-87db-3fda07e90bee"), uid: BinData(0, 6399AB0DAC62F20BFC466753B10FB58FB7E692BEC952C69B84D997021794D1F8) }, txnNumber: 255, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1580952714, 1391) } }, readTimestamp:Timestamp(0, 0), ninserted:10 keysInserted:20 terminationCause:committed timeActiveMicros:604 timeInactiveMicros:1016 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 5 } }, Global: { acquireCount: { r: 3, w: 1 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 23 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 1ms

Wednesday, February 12, 2020

How long should you wait for a JVM JIT before attaching a CPU profiler?

I haven't done serious Java development since 2000. I am happy for comments that help me learn more as a lot has changed since then. I was recently confronted with Java that used too much CPU time (Linkbench client for an OSS DBMS). I started with jPMP and the results were useful. I then found VisualVM. It has a CPU sampler and profiler. The sampler uses wall clock time while the profiler uses CPU time.

How much time does the JIT need to do its magic before I attach a CPU profiler? It depends. This article is great but if some decisions depend on the number of calls to a method then you might want to wait for that method to be called enough times. Thus it depends. If the method is called once per database query then this depends on the query response time.

I added -XX:+PrintCompilation to the Linkbench client command line to observe the JIT in action. It looks like most optimization is done within 3 minutes for a workload doing ~1000 QPS based on the output and that seems to match what the article states about compilation thresholds and tiered compilation.

Be careful when comparing profile output between two runs. If method A accounts for 10% of time with server 1 and 20% of time with server 2 that doesn't always mean method A is a problem for server 2. Assume that QPS is 3X better with server 2, then normalize CPU overheads by QPS (CPU / QPS) and the method A / server 2 combination doesn't look so bad.