Small Datum

Wednesday, March 28, 2018

Missing documentation

In my time with Linux there are some things that would benefit from better documentation.

The use of per-inode mutexes for buffered IO writes. This prevents concurrent writes per file between the block layer and storage. The problem is most easy to see with a hard disk when the drive write cache is disabled. XFS with O_DIRECT avoids this problem. I am not sure about other file systems with O_DIRECT. TIL the per-inode mutex is now the per-inode rwsem. An old FB note on this is here. A paper on filesystem scalability is here.
PTHREAD_MUTEX_ADAPTIVE_NP - this enables some busy-waiting when trying to lock a mutex. It isn't mentioned in man pages. Interesting details are here.

Friday, March 9, 2018

How much of the database must be in cache so that a point-query does at most one read from storage? I call this cache-amplification or cache amplification. The answer depends on the index structure (b-tree, LSM, something else). Cache amplification can join read, write and space amplification. Given that RWS was renamed RUM by the excellent RUM Conjecture now we have CRUM which is close to crummy. I briefly wrote about this in a previous post.

To do at most 1 storage read for a point query:

clustered b-tree - everything above the leaf level must be in cache. This is a key/pointer pair per leaf block. The InnoDB primary key index is an example.
non-clustered b-tree - the entire index must be in cache. This is a key/pointer pair per row which is much more memory than the cache-amplification for a clustered-btree. Non-covering secondary indexes with InnoDB are an example, although in that case everything you must also consider the cache-amplification for the PK index.
LSM - I assume there is a bloom filter per SST. Bloom filters for all levels but the max level should be in cache. Block indexes for all levels should be in cache. Data blocks don't have to be in cache. I assume there are no false positives from the bloom filter so at most one data block will be read. Note that with an LSM, more space amplification means more cache amplification. So cache-amp is worse (higher) for tiered compaction than for leveled.
something else - there have a been a few interesting variants on the same theme that I call index+log -- BitCask, ForestDB and WiscKey. These are similar to a non-clustered b-tree in that the entire index must be in cache so that the storage read can be spent on reading the data from the log.

I have ignored hash-based solutions for now but eventually they will be important. SILT is a great example of a solution with excellent cache-amplification.

Updated to correct what should be in cache for the LSM.

Friday, February 16, 2018

Sharded replica sets - MySQL and MongoDB

MongoDB used to have a great story for sharded replica sets. But the storage engine, sharding and replica management code had significant room for improvement. Over the last few releases they made remarkable progress on that and the code is starting to match the story. I continue to be impressed by the rate at which they paid off their tech debt and transactions coming to MongoDB 4.0 is one more example.

It is time for us to do the same in the MySQL community.

I used to be skeptical about the market for sharded replica sets with MySQL. This is popular with the web-scale crowd but that is a small market. Today I am less skeptical and assume the market extends far beyond web-scale. This can be true even if the market for replicasets, without sharding, is so much larger.

The market for replica sets is huge. For most users, if you need one instance of MySQL then you also need HA and disaster recovery. So you must manage failover and for a long time (before crash-proof slaves and GTID) that was a lousy experience. It is better today thanks to cloud providers and DIY solutions even if some assembly is required. Upstream is finally putting a solution together with MySQL Group Replication and other pieces.

But sharded replica sets are much harder, and even more so if you want to do cross-shard queries and transactions. While there have been many attempts at sharding solutions for the MySQL community, it is difficult to provide something that works across customers. Fortunately Vitess has shown this can be done and already has many customers in production.

ProxySQL and Orchestrator might also be vital pieces of this stack. I am curious to see how the traditional vendors (MySQL, MariaDB, Percona) respond to this progress.

Updates:

I think binlog server should be part of the solution. But for that to happen we need a GPLv2 binlog server and that has yet to be published.

Wednesday, January 17, 2018

Meltdown vs storage

tl;dr - sysbench fileio throughput for ext4 drops by more than 20% from Linux 4.8 to 4.13

I shared results from sysbench with a cached database to show a small impact from the Meltdown patch in Ubuntu 16.04. Then I repeated the test for an IO-bound configuration using a 200mb buffer pool for InnoDB and database that is ~1.5gb.

The results for read-only tests looked similar to what I saw previously so I won't share them. The results for write-heavy tests were odd as QPS for the kernel without the patch (4.8.0-36) were much better than for the kernel with the patch (4.13.0-26).

The next step was to use sysbench fileio to determine whether storage performance was OK and it was similar for 4.8 and 4.13 with read-only and write-only tests. But throughput with 4.8 was better than 4.13 for a mixed test that does reads and writes.

Configuration

I used a NUC7i5bnh server with a Samsung 960 EVO SSD that uses NVMe. The OS is Ubuntu 16.04 with the HWE kernels -- either 4.13.0-26 that has the Meltdown fix or 4.8.0-36 that does not. For the 4.13 kernel I repeat the test with PTI enabled and disabled. The test uses sysbench with one 2gb file, O_DIRECT and 4 client threads. The server has 2 cores and 4 HW threads. The filesystem is ext4.

I used these command lines for sysbench:
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
--max-requests=0 --num-threads=4 --max-time=60 prepare
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
--max-requests=0 --num-threads=4 --max-time=60 run

And I see this:
cat /sys/block/nvme0n1/queue/write_cache
write back

Results

The next step was to understand the impact of the filesystem mount options. I used ext4 for these tests and don't have much experience with it. The table has the throughput in MB/s from sysbench fileio that does reads and writes. I noticed a few things:

Throughput is much worse with the nobarrier mount option. I don't know whether this is expected.
There is a small difference in performance from enabling the Meltdown fix - about 3%
There is a big difference in performance between the 4.8 and 4.13 kernels, whether or not PTI is enabled for the 4.13 kernel. I get about 25% more throughput with the 4.8 kernel.

4.13 4.13 4.8 mount options
pti=on pti=off no-pti
100 104 137 nobarrier,data=ordered,discard,noauto,dioread_nolock
93 119 128 nobarrier,data=ordered,discard,noauto
226 235 275 data=ordered,discard,noauto
233 239 299 data=ordered,discard,noauto,dioread_nolock

Is it the kernel?

I am curious about what happened between 4.8 and 4.13 to explain the 25% loss of IO throughput.

I have another set of Intel NUC servers that use Ubuntu 16.04 without the HWE kernels -- 4.4.0-109 with the Meltdown fix and 4.4.0-38 without the Meltdown fix. These servers still use XFS. I get ~2% more throughput with the 4.4.0-38 kernel than the 4.4.0-109 kernel (whether or not PTI is enabled).

The loss in sysbench fileio throughput does not reproduce for XFS. The filesystem mount options are "noatime,nodiratime,discard,noauto" and tests were run with /sys/block/nvme0n1/queue/write_cache set to write back and write through. The table below has MB/s of IO throughput.

4.13 4.13 4.8
pti=on pti=off no-pti
225 229 232 write_cache="write back"
125 168 138 write_cache="write through"

More debugging

This is vmstat output from the sysbench test and the values for wa are over 40 for the 4.13 kernel but less than 10 for the 4.8 kernel. The ratio of cs per IO operation is similar for 4.13 and 4.8.

# vmstat from 4.13 with pti=off
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 0 15065620 299600 830564 0 0 64768 43940 7071 21629 1 6 42 51 0
0 4 0 15065000 300168 830512 0 0 67728 45972 7312 22816 1 3 44 52 0
2 2 0 15064380 300752 830564 0 0 69856 47516 7584 23657 1 5 43 51 0
0 2 0 15063884 301288 830524 0 0 64688 43924 7003 21745 0 4 43 52 0

# vmstat from 4.8 with pti=on
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 4 0 14998364 384536 818532 0 0 142080 96484 15538 38791 1 6 9 84 0
0 4 0 14997868 385132 818248 0 0 144096 97788 15828 39576 1 7 10 83 0
1 4 0 14997248 385704 818488 0 0 151360 102796 16533 41417 2 9 9 81 0
0 4 0 14997124 385704 818660 0 0 140240 95140 15301 38219 1 7 11 82 0

Output from Linux perf for 4.8 and for 4.13.

Friday, January 12, 2018

Meltdown vs MySQL part 2: in-memory sysbench and a core i5 NUC

This is my second performance report for the Meltdown patch using in-memory sysbench and a small server. In this test I used a core i5 NUC with the 4.13 and 4.8 kernels. In the previous test I used a core i3 NUC with the 4.4 kernel.

results for 4.13 are mixed -- sometimes there is more QPS with the fix enabled, sometimes there is more with the fix disabled. The typical difference is small, about 2%.
QPS for 4.8, which doesn't have the Meltdown fix, are usually better than with 4.13, the largest difference is ~10% and the difference tend to be larger at 1 client than at 2 or 8.

Configuration

My usage of sysbench is described here. The servers are described here. For this test I used the core i5 NUC (NUC7i5bnh) with Ubuntu 16.04. I have 3 such servers and ran tests with the fix enabled (kernel 4.13.0-26), the fix disabled via pti=off (kernel 4.13.0-26) and the old kernel (4.8.0-36) that doesn't have the fix. From cat /proc/cpuinfo I see pcid. This server uses the HWE kernels to make wireless work. I repeated tests after learning that 4.13 doesn't support the nobarrier mount option for XFS. My workaround was to switch to ext4 and the results here are from ext4.

The servers have 2 cores and 4 HW threads. I normally use them for low-concurrency benchmarks with 1 or 2 concurrent database clients. For this test I used 1, 2 and 8 concurrent clients to determine whether more concurrency and more mutex contention would cause more of a performance loss.

The sysbench test was configured to use 1 table with 4M rows and InnoDB. The InnoDB buffer pool was large enough to cache the table. The sysbench client runs on the same host as mysqld.

I just noticed that all servers had the doublewrite buffer and binlog disabled. This was leftover from debugging the XFS nobarrier change.

Results

My usage of sysbench is described here which explains the tests that I list below. Each test has QPS for 1, 2 and 8 concurrent clients. Results are provided for

pti enabled - kernel 4.13.0-26 with the Meltdown fix enabled
pti disabled - kernel 4.13.0-26 with the Meltdown fix disabled via pti=off
old kernel, no pti - kernel 4.8.0-36 which doesn't have the Meltdown fix

After each of the QPS sections, there are two lines for QPS ratios. The first line compares the QPS for the kernel with the Meltdown fix enabled vs disabled. The second line compares the QPS for the kernel with the Meltdown fix vs the old kernel. A value less than one means that MySQL gets less QPS with the Meltdown fix.

update-inlist
1 2 8 concurrency
5603 7546 8212 pti enabled
5618 7483 8076 pti disabled
5847 7613 8149 old kernel, no pti
----- ----- -----
0.997 1.008 1.016 qps ratio: pti on/off
0.958 0.991 1.007 qps ratio: pti on / old kernel

update-one
1 2 8 concurrency
11764 18880 16699 pti enabled
12074 19475 17132 pti disabled
12931 19573 16559 old kernel, no pti
----- ----- -----
0.974 0.969 0.974 qps ratio: pti on/off
0.909 0.964 1.008 qps ratio: pti on / old kernel

update-index
1 2 8 concurrency
7202 12688 16738 pti enabled
7197 12581 17466 pti disabled
7443 12926 17720 old kernel, no pti
----- ----- -----
1.000 1.000 0.958 qps ratio: pti on/off
0.967 0.981 0.944 qps ratio: pti on / old kernel

update-nonindex
1 2 8 concurrency
11103 18062 22964 pti enabled
11414 18208 23076 pti disabled
12395 18529 22168 old kernel, no pti
----- ----- -----
0.972 0.991 0.995 qps ratio: pti on/off
0.895 0.974 1.035 qps ratio: pti on / old kernel

delete
1 2 8 concurrency
19197 30830 43605 pti enabled
19720 31437 44935 pti disabled
21584 32109 43660 old kernel, no pti
----- ----- -----
0.973 0.980 0.970 qps ratio: pti on/off
0.889 0.960 0.998 qps ratio: pti on / old kernel

read-write range=100
1 2 8 concurrency
11956 20047 29336 pti enabled
12475 20021 29726 pti disabled
13098 19627 30030 old kernel, no pti
----- ----- -----
0.958 1.001 0.986 qps ratio: pti on/off
0.912 1.021 0.976 qps ratio: pti on / old kernel

read-write range=10000
1 2 8 concurrency
488 815 1080 pti enabled
480 768 1073 pti disabled
504 848 1083 old kernel, no pti
----- ----- -----
1.016 1.061 1.006 qps ratio: pti on/off
0.968 0.961 0.997 qps ratio: pti on / old kernel

read-only range=100
1 2 8 concurrency
12089 21529 33487 pti enabled
12170 21595 33604 pti disabled
11948 22479 33876 old kernel, no pti
----- ----- -----
0.993 0.996 0.996 qps ratio: pti on/off
1.011 0.957 0.988 qps ratio: pti on / old kernel

read-only.pre range=10000
1 2 8 concurrency
392 709 876 pti enabled
397 707 872 pti disabled
403 726 877 old kernel, no pti
----- ----- -----
0.987 1.002 1.004 qps ratio: pti on/off
0.972 0.976 0.998 qps ratio: pti on / old kernel

read-only range=10000
1 2 8 concurrency
394 701 874 pti enabled
389 698 871 pti disabled
402 725 877 old kernel, no pti
----- ----- -----
1.012 1.004 1.003 qps ratio: pti on/off
0.980 0.966 0.996 qps ratio: pti on / old kernel

point-query.pre
1 2 8 concurrency
18490 31914 56337 pti enabled
19107 32201 58331 pti disabled
18095 32978 55590 old kernel, no pti
----- ----- -----
0.967 0.991 0.965 qps ratio: pti on/off
1.021 0.967 1.013 qps ratio: pti on / old kernel

point-query
1 2 8 concurrency
18212 31855 56116 pti enabled
18913 32123 58320 pti disabled
17907 32941 55430 old kernel, no pti
----- ----- -----
0.962 0.991 0.962 qps ratio: pti on/off
1.017 0.967 1.012 qps ratio: pti on / old kernel

random-points.pre
1 2 8 concurrency
3043 5940 8131 pti enabled
2944 5681 7984 pti disabled
3030 6015 8098 old kernel, no pti
----- ----- -----
1.033 1.045 1.018 qps ratio: pti on/off
1.004 0.987 1.004 qps ratio: pti on / old kernel

random-points
1 2 8 concurrency
3053 5930 8128 pti enabled
2949 5756 7981 pti disabled
3058 6011 8116 old kernel, no pti
----- ----- -----
1.035 1.030 1.018 qps ratio: pti on/off
0.998 0.986 1.001 qps ratio: pti on / old kernel

hot-points
1 2 8 concurrency
3931 7522 9500 pti enabled
3894 7535 9214 pti disabled
3914 7692 9448 old kernel, no pti
----- ----- -----
1.009 0.998 1.031 qps ratio: pti on/off
1.004 0.977 1.005 qps ratio: pti on / old kernel

insert
1 2 8 concurrency
12469 21418 25158 pti enabled
12561 21327 25094 pti disabled
13045 21768 21258 old kernel, no pti
----- ----- -----
0.992 1.004 1.002 qps ratio: pti on/off
0.955 0.983 1.183 qps ratio: pti on / old kernel

XFS, nobarrier and the 4.13 Linux kernel

tl;dr

My day

nobarrier isn't supported as a mount option for XFS in kernel 4.13.0-26 with Ubuntu 16.04. I assume this isn't limited to Ubuntu. Read this for more detail on the change.
write throughput is much worse on my SSD without nobarrier
there is no error on the command line when mounting a device that uses the nobarrier option
there is an error message in dmesg output for this

There might be two workarounds:

switch from XFS to ext4
echo "write through" > /sys/block/$device/queue/write_cache

The Story

I have a NUC cluster at home for performance tests with 3 NUC5i3ryh and 3 NUC7i5bnh. I recently replaced the SSD devices in all of them because previous testing wore them out. I use Ubuntu 16.04 LTS and recently upgraded the kernel on some of them to get the fix for Meltdown.

The NUC7i5bnh server has a Samsung 960 EVO SSD that uses NVMe. I use the HWE kernel to make wireless work. The old kernel without the Meltdown fix is 4.8.0-36 and the kernel with the Meltdown fix is 4.13.0-26. Note that with the old kernel I used XFS with the nobarrier option. With the new kernel I assumed I was still getting nobarrier, but I was not. I have since switched from XFS to ext4.

The NUC5i3ryh server has a Samsung 850 EVO SSD that uses SATA. The old kernel without the Meltdown fix is 4.4.0-38 and the kernel with the Meltdown fix is 4.4.0-109. I continue to use XFS on these.

Results sysbench for NUC5i3ryh show not much regression from the Meltdown fix. Results for the NUC7i5bnh show a lot of regression for the write-heavy tests and not much for the read-heavy tests.

I started to debug the odd 7i5bnh results and noticed that write IO throughput was much lower for servers with the Meltdown fix using 4.13.0-26.
Then I used sysbench fileio to run IO tests without MySQL and noticed that read IO was fine, but write IO throughput was much worse with the 4.13.0-26 kernel.
Then I consulted my local experts, Domas Mituzas and Jens Axboe.
Then I noticed the error message in dmesg output

Meltdown vs MySQL part 1: in-memory sysbench and a core i3 NUC

This is my first performance report for the Meltdown patch using in-memory sysbench and a small server.

the worst case overhead was ~5.5%
a typical overhead was ~2%
QPS was similar between the kernel with the Meltdown fix disabled and the old kernel
the overhead with too much concurrency (8 clients) wasn't worse than than the overhead without too much concurrency (1 or 2 clients)

Configuration

My usage of sysbench is described here. The servers are described here. For this test I used the core i3 NUC (NUC5i3ryh) with Ubuntu 16.04. I have 3 such servers and ran tests with the fix enabled (kernel 4.4.0-109), the fix disabled via pti=off (kernel 4.4.0-109) and the old kernel (4.4.0-38) that doesn't have the fix. From cat /proc/cpuinfo I see pcid.

The sysbench test was configured to use 1 table with 4M rows and InnoDB. The InnoDB buffer pool was large enough to cache the table. The sysbench client runs on the same host as mysqld.

Results

My usage of sysbench is described here which explains the tests that I list below. Each test has QPS for 1, 2 and 8 concurrent clients. Results are provided for

pti enabled - kernel 4.4.0-109 with the Meltdown fix enabled
pti disabled - kernel 4.4.0-109 with the Meltdown fix disabled via pti=off
old kernel, no pti - kernel 4.4.0-38 which doesn't have the Meltdown fix

update-inlist

1 2 8 concurrency

2039 2238 2388 pti enabled

2049 2449 2369 pti disabled

2059 2199 2397 old kernel, no pti

----- ----- -----

0.995 0.913 1.008 qps ratio: pti on/off

0.990 1.017 0.996 qps ratio: pti on / old kernel

update-one

1 2 8 concurrency

8086 11407 9498 pti enabled

8234 11683 9748 pti disabled

8215 11708 9755 old kernel, no pti

----- ----- -----

0.982 0.976 0.974 qps ratio: pti on/off

0.984 0.974 0.973 qps ratio: pti on / old kernel

update-index

1 2 8 concurrency

2944 4528 7330 pti enabled

3022 4664 7504 pti disabled

3020 4784 7555 old kernel, no pti

----- ----- -----

0.974 0.970 0.976 qps ratio: pti on/off

0.974 0.946 0.970 qps ratio: pti on / old kernel

update-nonindex

1 2 8 concurrency

6310 8688 12600 pti enabled

6103 8482 11900 pti disabled

6374 8723 12142 old kernel, no pti

----- ----- -----

1.033 1.024 1.058 qps ratio: pti on/off

0.989 0.995 1.037 qps ratio: pti on / old kernel

delete

1 2 8 concurrency

12348 17087 23670 pti enabled

12568 17342 24448 pti disabled

12665 17749 24499 old kernel, no pti

----- ----- -----

0.982 0.985 0.968 qps ratio: pti on/off

0.974 0.962 0.966 qps ratio: pti on / old kernel

read-write range=100

1 2 8 concurrency

9999 14973 21618 pti enabled

10177 15239 22088 pti disabled

10209 15249 22153 old kernel, no pti

----- ----- -----

0.982 0.982 0.978 qps ratio: pti on/off

0.979 0.981 0.975 qps ratio: pti on / old kernel

read-write range=10000

1 2 8 concurrency

430 762 865 pti enabled

438 777 881 pti disabled

439 777 882 old kernel, no pti

----- ----- -----

0.981 0.980 0.981 qps ratio: pti on/off

0.979 0.980 0.980 qps ratio: pti on / old kernel

read-only range=100

1 2 8 concurrency

10472 19016 26631 pti enabled

10588 20124 27587 pti disabled

11290 20153 27796 old kernel, no pti

----- ----- -----

0.989 0.944 0.965 qps ratio: pti on/off

0.927 0.943 0.958 qps ratio: pti on / old kernel

read-only.pre range=10000

1 2 8 concurrency

346 622 704 pti enabled

359 640 714 pti disabled

356 631 715 old kernel, no pti

----- ----- -----

0.963 0.971 0.985 qps ratio: pti on/off

0.971 0.985 0.984 qps ratio: pti on / old kernel

read-only range=10000

1 2 8 concurrency

347 621 703 pti enabled

354 633 716 pti disabled

354 638 716 old kernel, no pti

----- ----- -----

0.980 0.981 0.988 qps ratio: pti on/off

0.980 0.973 0.981 qps ratio: pti on / old kernel

point-query.pre

1 2 8 concurrency

16104 29540 46863 pti enabled

16716 30052 49404 pti disabled

16605 30392 49872 old kernel, no pti

----- ----- -----

0.963 0.982 0.948 qps ratio: pti on/off

0.969 0.971 0.939 qps ratio: pti on / old kernel

point-query

1 2 8 concurrency

16240 29359 47141 pti enabled

16640 29785 49015 pti disabled

16369 30226 49530 old kernel, no pti

----- ----- -----

0.975 0.985 0.961 qps ratio: pti on/off

0.992 0.971 0.951 qps ratio: pti on / old kernel

random-points.pre

1 2 8 concurrency

2756 5202 6211 pti enabled

2764 5216 6245 pti disabled

2679 5130 6188 old kernel, no pti

----- ----- -----

0.997 0.997 0.994 qps ratio: pti on/off

1.028 1.014 1.003 qps ratio: pti on / old kernel

random-points

1 2 8 concurrency

2763 5177 6191 pti enabled

2768 5188 6238 pti disabled

2701 5076 6182 old kernel, no pti

----- ----- -----

0.998 0.997 0.992 qps ratio: pti on/off

1.022 1.019 1.001 qps ratio: pti on / old kernel

hot-points

1 2 8 concurrency

3414 6533 7285 pti enabled

3466 6623 7287 pti disabled

3288 6312 6998 old kernel, no pti

----- ----- -----

0.984 0.986 0.999 qps ratio: pti on/off

1.038 1.035 1.041 qps ratio: pti on / old kernel

insert

1 2 8 concurrency

7612 10051 11943 pti enabled

7713 10150 12322 pti disabled

7834 10243 12514 old kernel, no pti

----- ----- -----

0.986 0.990 0.969 qps ratio: pti on/off

0.971 0.981 0.954 qps ratio: pti on / old kernel