Wednesday, July 1, 2020

Something changed for the better in create index between MySQL 8.0.18 and 8.0.20

I include MongoDB, Postgres, MySQL/InnoDB and MySQL/RocksDB (MyRocks) in the things I test via Linkbench and the insert benchmark. Hopefully I add another test later this year. I have been using MySQL 8.0.18 this year and recently started to use 8.0.20.

The new insert benchmark workflow is create tables with PK indexes, load tables, create 3 secondary indexes on each table, continue the load and then run several rounds of reads+inserts. For this test there were 8 clients with a table per client and 10M rows/table after the initial insert.

I think something changed for the better from MySQL 8.0.18 to 8.0.20 and I didn't spot the change in the release notes. I see:
  • Create index in 8.0.20 is 10% to 20% faster (nice, but not why I wrote this)
  • Create index in 8.0.18 uses ~7X more read IO with innodb_flush_method = O_DIRECT (14G vs 2G) and ~2X more read IO with it set to O_DIRECT_NO_FSYNC (4G vs 2G). The table was cached at the start of create index and the indexed table should be less than 2G. Tests used innodb_sort_buffer_size = 64M.
    • Something changed with the sort used by create index. Maybe it does fewer passes. I am not sure there is a way to monitor that.
    • I don't understand the impact from O_DIRECT vs O_DIRECT_NO_FSYNC in 8.0.18, while it had no impact in 8.0.20.
  • When the load was restarted after create index there was initially a lot of read IO with 8.0.20. Either the indexed table or the newly created indexes were not in cache and my bet is on the new indexes. This doesn't happen with 8.0.18. 
  • For 8.0.20, create index is faster with innodb_use_native_aio set to ON, but I am ignoring that topic for now.
Tests are run for 3 configurations (c10b40, c10b40a, c10b40b). The InnoDB options for those configs are here. Perhaps because of the way I compiled MySQL from source, innodb_use_native_aio is off by default for 8.0.18 but on for 8.0.20 -- so 8.0.18 and 8.0.20 differ at runtime only for that option with the c10b40 config. For the c10b40a and c10b40b configs, the options are the same at run time because innodb_use_native_aio is set. The differences between the configs are:
  • c10b40 - use O_DIRECT
  • c10b40a - start with c10b40, add innodb_use_native_aio=FALSE
  • c10b40b - start with c10b40b, change to O_DIRECT_NO_FSYNC
Performance metrics are here. The slightly out of date legend for the metrics is here. Similar to my favorite InnoDB performance expert, the results are compressed so I can compare many configurations on one screen. The metrics don't include it but the time to create the index for 8.0.18 is (275, 281, 233) seconds and for 8.0.20 is (157, 228, 209) seconds for the (c10b40, c10b40a, c10b40b) configs. The ips metric in the tables for create index is indexed rows per second computed as (number of rows in table / time to create all indexes).

Tuesday, June 30, 2020

Java GC and Linkbench

I filed a bug for the Sun JDK a long time ago, sometime around 1998. At the time growing the heap didn't work and the workaround was to set -Xms and -Xmx to the same value. That worked as long as you knew a reasonable value -- too small and your process dies, too large and you waste memory.

Now I get to revisit Java GC. I am running Linkbench and the bin/linkbench script that starts the bench client uses -Xmx=1000 to limit the heap to 1000MB of RAM. Someone else wrote that script and I didn't know it was done until a test failed when the heap was too small.

I use Linkbench on large and small servers and need to be careful about memory usage on the small servers so I ran a few tests to understand the impact on performance and memory usage for Linkbench run with -Xmx=1000, -Xmx=2000 and -Xmx not set. I used a test with a ~10G database (maxid1=10M) and two levels of concurrency -- 16 clients, 64 clients.

For the tests I report metrics from the Linkbench client process: max value for VSZ, max value for RSS and number of CPU seconds. This is collected via "ps aux" run at 30-second intervals and now I wish I reduced that to 1-second intervals but I don't want to repeat the tests. The tests used MySQL 8.0.18 and JDBC via Connector/J 8.0.20.

Now I get to revisit Java GC on Amazon Linux 2 and Java is:
$ java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)
Disclaimers -- I am far from a Java expert and the JDK I used might not be the latest and greatest. Feedback is welcome.


I noticed two things:
  1. VSZ is 5X to 10X larger than RSS. I get scared when I see a large VSZ, even when RSS is small. But some good things (jemalloc) do that so eventually I tolerate it.
  2. Unlimited heap size doesn't save much CPU time. I expected it to help more by reducing the GC overhead. It helped a bit with 64 clients, but hurt with 16 clients.
Based on these results I will update my test scripts to use -Xmx=1000 on small servers and -Xmx=2000 on large servers. Were I to use more than 64 concurrent clients then I might use a value larger than 2000.

The data

  • hMax - value for -Xmx
  • vsz - Linkbench client max VSZ in GB
  • rss.l, rss.r - Linkbench client max RSS in GB during the load (rss.l) and run (rss.r)
  • cpuSec - number of CPU seconds used by the Linkbench client

---16 clients
hMax    vsz     rss.l   rss.r   cpuSec
1000     5.8    0.5     0.9     2795
2000     6.9    0.8     1.3     2856
none    20.8    0.8     4.0     2853

--- 64 clients
hMax    vsz     rss.l   rss.r   cpuSec
1000     9.0    0.9     1.3     8104
2000    10.1    1.2     1.8     7800
none    24.0    5.2     5.8     7773

Monday, May 18, 2020

Avoiding reads in write-optimized index structures

This is a summary of features that avoid storage read IO in OSS write-optimized index structures. My scope is limited to MySQL and Tarantool -- I know there are similar features in other DBMS and am happy for comments on that:
  • RocksDB merge operator logically does read-modify-write but physically does a Put into the memtable and the read is deferred until user queries or compaction. I hope that MyRocks eventually takes advantage of the merge operator.
  • Read free replication in TokuDB and MyRocks.
  • Fast Updates in TokuDB do something similar to the merge operator for update and insert
  • Non-unique secondary index maintenance is read free in MyRocks
  • Deferred secondary index update in Tarantool is a new approach to avoiding reads for write-heavy workloads. It is worth reading. I wish it had been published by VLDB.
  • MyRocks also avoids some storage reads when validating unique constraints (for PK and unique secondary indexes) on inserts and some updates because the Get() that it does can use a bloom filter. This benefit is larger when there is a bloom filter on the max level of the LSM tree, but that has other costs is is frequently not done.
  • MyRocks uses the SingleDelete feature in RocksDB to allow tombstones to be removed earlier which decreases the CPU overhead for reads after a batch of writes.
  • A b-tree can also reduce index maintenance related storage read IO. See the InnoDB change buffer.

An incomplete list of relevant research:
  • DiffIndex - this is more about distributed systems, but still interesting
  • Techniques for reducing index maintenance overhead in an LSM by Luo & Carey (VLDB 2019)
  • Deferred Lightweight Indexing (DELI)

Thursday, April 9, 2020

Reviving mstat

A long time ago I wrote mstat to collect iostat, vmstat and MySQL (show global status) performance counters. It samples everything at the same interval (every N seconds), computes rates for all numeric values and has some support for expressions so I can define a new counter as a function of other counters.

Eventually I stopped using it and then there were problems. Recently I fixed many of them. The tool now works with Python3 and supports old and new iostat output format. This week I started to add support for Postgres and MongoDB. It gets data from pg_stat_bgwriter for Postgres and from db.serverStatus() for MongoDB. So the tool is useful to me again but it won't be useful to others until I add docs.

The tool prints one line in CSV format per time interval. The line is likely to be long because there can be hundreds of counters in the line. The name and offset for each counter is printed at startup. I extract data using bash and awk as part of my benchmark workflow.

Because of this dependency to list the offset per counter at startup the tool is not able to handle dynamic data yet. An example of dynamic data are per-DBMS counters such as the rows in pg_stat_database. That view has one row per database and the database names (datname) are not fixed. Similar problems exist for the per-table and per-index monitoring tables and views provided by MySQL, other stats views provided by Postgres and other monitoring data provided by MongoDB. So I need more time to figure out how to support them.

I have more work to do on mstat, but I am happy that I can use it again:
  • I need to figure out whether there are global counters in Postgres for things like queries and rows read
  • For MongoDB mstat has a hack to replace space, dot and parentheses with _ in key names
  • More work remains to make sure that counters are ignored unless their type is numeric or a date

Monday, April 6, 2020

Creating performance reports

I am trying to automate some of the tasks that I must do to produce performance reports. By trying I mean that I am just getting started on this and will be making decisions I might regret later. This is also a request for advice, feedback and corrections.

Some requirements, not all of which will be satisfied:
  • interactive - I prefer that the format can be hosted in a way that allows for discussions. Google Docs does a great job at this by supporting access control and threaded discussions on comments.
  • graphics - support charts and graphs. Back in the day I was a happy user of Google Image Charts that let me define a chart in a URL. Alas, the service has been turned off and the new thing from Google requires Javascript.
  • self-contained - I want to share the report as one thing. With Google Docs that thing can be the URL for the doc, even when there are additional things, like spreadsheets, linked to it. With HTML the thing is the HTML document.
  • script-able - I want to create as much as possible of the document via scripting. I frequently repeat tests and don't want to waste time reformatting tables.
  • private - some reports are not yet public and I prefer a solution that isn't always public. However I also want a solution that works for company-private reports and also for reports that are shared with the public. 
  • other - supports wide lines (via a slider, wrapping wide lines is lousy for a reader), what else?

  • Google docs
    • interactive - yes
    • graphics - yes
    • self-contained -yes
    • scriptable - no
    • privacy - yes
    • other - no support for wide lines, they are always wrapped
  • Github Pages
    • interactive - no, always public
    • graphics - maybe
    • self-contained - yes
    • scriptable - yes
    • privacy - no, always public
    • other - supports wide lines
  • Blogger
    • interactive - somewhat via comments but not as nice as Google docs, but always public
    • graphics - yes by embedding images
    • self-contained - yes
    • scriptable - maybe if I can paste an HTML doc as the content
    • privacy - no, always public
    • other - weak support for wide lines
  • Plain HTML
    • interactive - maybe*
    • graphics - maybe*
    • self-contained - maybe*
    • scriptable - yes
    • privacy - yes*
    • other - supports wide lines
From the above I am not aware of anything close to a perfect solution. I am likely to choose plain HTML. Privacy can be managed by sharing via company email and or hosting the HTML files on company-private servers.

Charts and graphs could be provided by a web-service that lets a chart be defined in a URL. Quickchart is an example of that. Of course, use of such a service means the data is no longer private. But there is a difference between posting a web-page for the world to read and sharing that data with the company that provides the URL chart service.

If I use plain HTML then I must find a way to share such pages with the public. That last sentence is amusing given that the web runs on HTML. But I have never done this. I have been a long-time, and mostly happy, user of Blogger. I have begun to use Github Pages and am a big fan of markup.

As a bonus, I prefer a solution that doesn't depend on the whims of a vendor. For example, if I commit to service X and the vendor for service X turns off that service, I don't want to spend weeks reformatting reports while I migrate to a new service.

It would be great if I could inline some of the charts in the HTML document. I need to figure out whether there are good tools for simple ascii bar charts and graphs that can be pasted into an HTML document.

I am a fan of gnuplot and need to learn more about how I can share gnuplot output in PNG format as part of an HTML document without posting all of that on the web.

Monday, March 30, 2020

Compiling from source for Amazon Linux 2

Different is worse

I used to know struggle with  make & automake but I knew enough to get things done. Now I have fun with make, cmake, maven and gradle. I am sure this proliferation solves problems for people who create the build scripts that I use. But it isn't a good experience for the intermittent user -- different is worse because my skill level per tool is lower.

What is Amazon Linux 2? It has a landing page. I call it AL2 below. It claims to be like CentOS, RHEL, Fedora and it uses yum. I know how to use Google to get answers for those distros when trying to figure out how to build something from source.

But it isn't clear to me that packages that exist for CentOS also exist for AL2. One example of the challenge is that I ended up installing jemalloc from source because I couldn't find the package without advice from an expert.

I hope the output from cat /etc/os-release on AL2 is improved. Below are examples for AL2 and Ubuntu 18.04. Ubuntu specific that this is Ubuntu 18.04. From AL2 I learn it is like CentOS. But which CentOS?

NAME="Amazon Linux" 
VERSION="2" ID="amzn"
ID_LIKE="centos rhel fedora"
PRETTY_NAME="Amazon Linux 2"

Lets compare that with Ubuntu:

VERSION="18.04.4 LTS (Bionic Beaver)"
PRETTY_NAME="Ubuntu 18.04.4 LTS"

Dependencies for AL2

This explains how to satisfy the dependencies before running cmake. I struggled figuring out the names for packages that had to be installed. Some of this is because I don't have much experience with CentOS, some of this is on AL2. Eventually I found this page which might have helped.

Add libraries for the insert benchmark, some of this is needed for the build and all is needed when I run benchmarks so I will just install before I try to build:

sudo yum install -y python3
sudo python3 -m pip install pymongo
sudo yum install -y mysql-devel python3-devel
sudo pip3 install mysqlclient
sudo pip3 install psycopg2-binary
sudo yum install -y openssl11-libs
sudo yum install -y libzstd-1.3.3
sudo yum install -y ncurses-compat-libs
# The host arrives comes with /etc/my.cnf from MariaDB

sudo rm -f /etc/my.cnf /etc/mysql/my.cnf

Install jemalloc from source because I couldn't figure out how to install the package:

cd /media/ephemeral1
rm -rf jemalloc-5.2.1*
bunzip2 jemalloc-5.2.1.tar.bz2
tar xvf jemalloc-5.2.1.tar
cd jemalloc-5.2.1
./configure --prefix=/usr > 2>
make -j4 > o.m 2> e.m
sudo make install

Install things needed to compile MySQL8 and MyRocks from FB MySQL 5.6. I didn't confirm that all of these packages exist. But I was able to build after doing this:
sudo yum install -y cmake3
sudo yum install -y
sudo yum install -y numactl-devel
sudo yum install -y libedit-devel
sudo yum install -y cmake gcc-c++ bzip2-devel libaio-devel bison 
sudo yum install -y zlib-devel snappy-devel
sudo yum install -y gflags-devel readline-devel ncurses-devel openssl-devel
sudo yum install -y lz4-devel gdb git libzstd-devel

Building MySQL8

Download, compile and install MySQL 8. It is installed at /media/ephemeral1/my8018. The upstream docs are useful.

cd /media/ephemeral1
rm -rf mysql-8.0.18*
tar xzvf mysql-boost-8.0.18.tar.gz
cd mysql-8.0.18
mkdir build
cd build
bash /media/ephemeral1/cmk80 /media/ephemeral1/my8018 > 2>
make -j8 V=1 VERBOSE=1 > 2>
make install

The contents of /media/ephemeral/cmk80.

cmake3 .. \
      -DBUILD_CONFIG=mysql_release \
      -DCMAKE_BUILD_TYPE=RelWithDebInfo \
      -DWITH_SSL="system" \
      -DWITH_ZLIB="system" \
      -DMYSQL_DATADIR="${prefix}/data" \
      -DMYSQL_UNIX_ADDR="${prefix}/var/mysql.sock" \
      -DWITH_BOOST=$PWD/../boost \

Building MyRocks

Download, compile and install MyRocks from FB MySQL 5.6. It is installed at /media/ephemeral1/fbmy56. The upstream docs are useful.

First install boost 1.65 from source:

cd /media/ephemeral1
bunzip2 boost_1_65_1.tar.bz2
tar xvf boost_1_65_1.tar
cd boost_1_65_1
./ --prefix=/usr

./b2 > o.b2 2> e.b2
sudo ./b2 install

Then build MySQL 5.6

cd /media/ephemeral1
git clone fbmy56-src
cd fbmy56-src
# This is the old version that I used for a while
# git checkout 20aaaf8d

git submodule init
git submodule update
mkdir build
cd build
bash /media/ephemeral1/cmkfbmy56 /media/ephemeral1/fbmy56 > 2>
make -j8 V=1 VERBOSE=1 > 2>
make install

The contents of /media/ephemeral1/cmkfbmy56

if [ -z $1 ]; then
echo Requires prefix as arg1
exit -1


# Look at /proc/cpuinfo to determine whether these (sse, avx) are supported


# -march=native gets avx and sse if they are supported
CXXF="-DNDEBUG -march=native"
CXXF+=" -faligned-new"

# extra flags to avoid warnings with gcc on ubuntu 18
CF="-Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

CXXF+=" -Wno-implicit-fallthrough -Wno-int-in-bool-context \
  -Wno-shift-negative-value -Wno-misleading-indentation \
  -Wno-format-overflow -Wno-nonnull -Wno-unused-function"

cmake3 .. \
  -DWITH_SSL=system \
  -DWITH_ZLIB=bundled \
-DWITH_LZ4=system \

Not done yet

There is a crash in Boost code that was added in the FB MySQL branch. Boost 1.53 is installed by yum install boost-devel. I will rebuild MyRocks with Boost 1.65 as that is what Ubuntu 18.04 uses, and MyRocks works great for me there.


stack_log: _ZN5boost13property_tree11json_parser12json_grammarINS0_11basic_ptreeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_St4lessIS9_EEEE10definitionINS_6spirit7clas

Friday, March 27, 2020

kernel lockdown vs blktrace

I am trying to use blktrace to determine which files are the source of disk reads for a database that uses buffered IO. The server runs Ubuntu 18.04 on bare metal and the boot uses UEFI not legacy.

blktrace doesn't work, time to debug. Things that don't fix it include upgrading from 4.15 to 5.3 HWE kernel and disabling apparmor. Next up is disabling kernel lockdown via mokutil --disable-validation. Alas, blktrace still fails at startup.

After running mokutil and then rebooting there are still a few messages in dmesg output about lockdown so I wonder whether it was fully disabled.
Lockdown: Hibernation is restricted; see man kernel_lockdown.7
Lockdown: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
OK, lets read the man page. Great, it doesn't exist -- not for Ubuntu nor for other distros. There is a draft but I am starting to get the impression that lockdown wasn't ready for prime time. And Linus had a strong opinion about it in 2018.

Next up is a strong opinion from Brendan Gregg.
Many distros are enabling lockdown, breaking BPF. This is the worst OS change I've ever seen.
OK, maybe my problem is lockdown and mokutil wasn't sufficient. Time to try:
echo 1 > /proc/sys/kernel/sysrq; echo x > /proc/sysrq-trigger 
And now blktrace works. Well, until I reboot. I already have a script to run after reboot to reduce security so that PMP can run. That script just got larger:
echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger