Wednesday, January 17, 2018

Meltdown vs storage

tl;dr - sysbench fileio throughput for ext4 drops by more than 20% from Linux 4.8 to 4.13

I shared results from sysbench with a cached database to show a small impact from the Meltdown patch in Ubuntu 16.04. Then I repeated the test for an IO-bound configuration using a 200mb buffer pool for InnoDB and database that is ~1.5gb.

The results for read-only tests looked similar to what I saw previously so I won't share them. The results for write-heavy tests were odd as QPS for the kernel without the patch (4.8.0-36) were much better than for the kernel with the patch (4.13.0-26).

The next step was to use sysbench fileio to determine whether storage performance was OK and it was similar for 4.8 and 4.13 with read-only and write-only tests. But throughput with 4.8 was better than 4.13 for a mixed test that does reads and writes.

Configuration


I used a NUC7i5bnh server with a Samsung 960 EVO SSD that uses NVMe. The OS is Ubuntu 16.04 with the HWE kernels -- either 4.13.0-26 that has the Meltdown fix or 4.8.0-36 that does not. For the 4.13 kernel I repeat the test with PTI enabled and disabled. The test uses sysbench with one 2gb file, O_DIRECT and 4 client threads. The server has 2 cores and 4 HW threads. The filesystem is ext4.

I used these command lines for sysbench:
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
    --max-requests=0 --num-threads=4 --max-time=60 prepare
sysbench fileio --file-num=1 --file-test-mode=rndrw --file-extra-flags=direct \
    --max-requests=0 --num-threads=4 --max-time=60 run

And I see this:
cat /sys/block/nvme0n1/queue/write_cache
write back

Results

The next step was to understand the impact of the filesystem mount options. I used ext4 for these tests and don't have much experience with it. The table has the throughput in MB/s from sysbench fileio that does reads and writes. I noticed a few things:
  1. Throughput is much worse with the nobarrier mount option. I don't know whether this is expected.
  2. There is a small difference in performance from enabling the Meltdown fix - about 3%
  3. There is a big difference in performance between the 4.8 and 4.13 kernels, whether or not PTI is enabled for the 4.13 kernel. I get about 25% more throughput with the 4.8 kernel.

4.13    4.13    4.8    mount options
pti=on  pti=off no-pti
100     104     137     nobarrier,data=ordered,discard,noauto,dioread_nolock
 93     119     128     nobarrier,data=ordered,discard,noauto
226     235     275     data=ordered,discard,noauto
233     239     299     data=ordered,discard,noauto,dioread_nolock

Is it the kernel?

I am curious about what happened between 4.8 and 4.13 to explain the 25% loss of IO throughput.

I have another set of Intel NUC servers that use Ubuntu 16.04 without the HWE kernels -- 4.4.0-109 with the Meltdown fix and 4.4.0-38 without the Meltdown fix. These servers still use XFS. I get ~2% more throughput with the 4.4.0-38 kernel than the 4.4.0-109 kernel (whether or not PTI is enabled).

The loss in sysbench fileio throughput does not reproduce for XFS. The filesystem mount options are "noatime,nodiratime,discard,noauto" and tests were run with /sys/block/nvme0n1/queue/write_cache set to write back and write through. The table below has MB/s of IO throughput.

4.13    4.13    4.8
pti=on  pti=off no-pti
225     229     232     write_cache="write back"
125     168     138     write_cache="write through"

More debugging

This is vmstat output from the sysbench test and the values for wa are over 40 for the 4.13 kernel but less than 10 for the 4.8 kernel. The ratio of cs per IO operation is similar for 4.13 and 4.8.

# vmstat from 4.13 with pti=off

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  4      0 15065620 299600 830564    0    0 64768 43940 7071 21629  1  6 42 51  0
 0  4      0 15065000 300168 830512    0    0 67728 45972 7312 22816  1  3 44 52  0
 2  2      0 15064380 300752 830564    0    0 69856 47516 7584 23657  1  5 43 51  0
 0  2      0 15063884 301288 830524    0    0 64688 43924 7003 21745  0  4 43 52  0

# vmstat from 4.8 with pti=on

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  4      0 14998364 384536 818532    0    0 142080 96484 15538 38791  1  6  9 84  0
 0  4      0 14997868 385132 818248    0    0 144096 97788 15828 39576  1  7 10 83  0
 1  4      0 14997248 385704 818488    0    0 151360 102796 16533 41417  2  9  9 81  0
 0  4      0 14997124 385704 818660    0    0 140240 95140 15301 38219  1  7 11 82  0

Output from Linux perf for 4.8 and for 4.13.

2 comments:

  1. IBRS, which tweaks the branch predictor, seems to cause some I/O performance degradation. IBRS can be disabled via 'echo 0 >/proc/sys/kernel/ibrs_enabled'. You might also try IBPB.

    ReplyDelete
  2. Thank you for the thorough testing and concise explanation. This is helpful to characterize the impact on SSD storage-based databases.

    Adding to Bradley's comment, https://access.redhat.com/articles/3311301 details more IBRS and IBPB settings. You can turn off parts of the security depending on your needs (private network & no virtualization vs. public network with virtualization, for instance).

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...