Friday, December 2, 2022

Benchmark(et)ing RocksDB vs SplinterDB

While I am a huge fan of research papers presenting storage engines that claim to be better than RocksDB, I am always wary of the performance results. A paper can be great despite an imperfect performance evaluation, so pointing out the imperfections doesn't take away from the interesting ideas in the paper. Also, as a believer of the (C)RUM Conjecture I want to know how the new thing is better and worse, but papers mostly focus only on the better parts and don't highlight what isn't better.

One factor that determines the truthiness of a database benchmark is the number of DBMS that are compared. It is hard enough to get expertise in one DBMS and more DBMS == more chance of making a mistake. Here I present a result for RocksDB and SplinterDB. I am definitely not an expert on SplinterDB. Perhaps my results are more truthy than true.

I read the SplinterDB paper and hope their research continues. However, the paper didn't have enough detail on how the benchmark was done so I had to guess. I wish more papers published artifacts (scripts, etc) that made reproduction easier. I expect reproduction to be infrequent and don't want researchers to spend their time doing that, but the artifacts are nice to have.

tl;dr

  • For insert performance
    • It is risky to compare write performance between SplinterDB and RocksDB because SplinterDB doesn't force data to storage via fsync, fdatasync or msync, and that is documented.
    • For IO-bound - RocksDB with universal was the fastest
    • For cached - performance is similar between RocksDB and SplinterDB
  • For point query performance
    • For IO-bound - RocksDB was faster because it does less IO/query
    • For cached - SplinterDB was faster because it uses less CPU/query
  • For range query performance
    • For IO-bound - RocksDB was a lot faster because it does less IO/query
    • For cached - RocksDB was a lot faster because it uses less CPU/query

SplinterDB

I patched SplinterDB as of git hash a1833060. My patched SplinterDB branch is here. The patch includes:

  • debug printfs I added to understand the code
  • changes to stop the lookup tests after 1200 seconds
  • a fix for a memory leak that I reported, and has now been fixed upstream (issue 440)
  • hardwired to only run the Small range test for range queries
  • make the payload size a constant rather than a range to match what I do for RocksDB
Details

I used my small servers (4-core NUC, 16G RAM, NVMe SSD), patched SplinterDB and RocksDB 7.6.0. The benchmarks were run with 1 thread for inserts and then 1 thread for queries. My test scripts are here.

The benchmark was run for two workloads: cached and IO-bound. Cached means cached by the DBMS and IO-bound means the database is larger than memory. The workload is simple: insert in random order, do point queries, do range queries where a range query fetches 10 values. I used the benchmark client provided by each DBMS: db_bench for RocksDB and driver_test splinter_test for SplinterDB. The scripts that have the config details are here for RocksDB and SplinterDB.

For RocksDB I repeated tests with leveled and universal compaction.

Tests were done for the following sizes:
  • cached: 50M rows, 100-byte value
  • cached: 25M rows, 200-byte value
  • IO-bound: 2B rows, 100-byte value
  • IO-bound: 1B rows, 200-byte value
To get results for RocksDB with universal I had to repeat the IO-bound tests at 75% of the number of rows listed above (1.5B for 100-byte, 750M for 200-byte) because issue 10533 caused me to get unexpected disk full errors.

For RocksDB I did fillrandom to measure insert performance, then reloaded with fillseq to make sure all keys from 1..N would exist. Next time I might use filluniquerandom (see here). For RocksDB I measured performance immediately after the load (well, after waiting for compaction debt to reduce), then again after flushing the memtable and L0. And for leveled compaction I then flushed the L1 and measured perf again. Here I report performance after flushing the memtable and L0 (see here).

The point and range query benchmark steps ran for 1200 seconds each with SplinterDB and 1800 seconds each with RocksDB. Were I to repeat this I would use the same duration.

Differences

Things that differ between RocksDB and SplinterDB:
  • A uniform key access distribution was used by both benchmark clients. However, RocksDB does this by using a RNG while SplinterDB uses something like: hash(x) for x in 1 ... N.  I didn't confirm it, but I am curious if the point and range query tests will see the same key sequence, and if tests are run for a short period of time then point query tests warm the cache for the range query tests making results less truthy for IO-bound workloads.
  • SplinterDB does not have redo/WAL so I disabled that for RocksDB. 
  • SplinterDB does not sync (fsync, msync, fdatasync) writes and I did not try to mimic that with RocksDB. This means that write amplification will be under-stated and write throughput will be over-stated for SplinterDB when compared to RocksDB.
Results for cached

Legend:
* insert - inserts/second
* point - point queries/second
* range - range queries/second
* wamp - write-amplification during the inserts

50M rows, 100-byte values
---- ops/second ----
insert  point   range   wamp    dbms
533584  246964  31236   1.3     splinterdb
483519  191592  76600   4.5     rocksdb, leveled
488971  180769  57394   2.9     rocksdb, universal

25M rows, 200-byte values
---- ops/second ----
insert  point   range   wamp    dbms
474446  261444  33538   1.1     splinterdb
495325  188851  75122   3.7     rocksdb, leveled
500862  201667  83686   2.8     rocksdb, universal

Results for IO-bound

The performance can be explained by the amount of read IO per query.

IO reads per point query for 2B rows, 100-byte values:
  • 1.29 - SplinterDB
  • 0.98 - RocksDB, leveled
  • 1.00 - RocksDB, universal, 1.5B rows
IO reads per range query for 2B rows, 100-byte values:
  • 7.42 - SplinterDB
  • 2.34 - RocksDB, leveled
  • 3.24 - RocksDB, universal, 1.5B rows
2B rows, 100-byte values
---- ops/second ----
insert  point   range   wamp    dbms
308740  6110    1060     3.7    2B rows, splinterdb
181556  8428    3056    14.1    2B rows, rocksdb, leveled
205736  8404    3029    12.7    1.5B rows, rocksdb, leveled
393873  8144    1889     6.0    1.5B rows, rocksdb, universal

1B rows, 200-byte values
---- ops/second ----
insert  point   range   wamp    dbms
221007  7175     908     3.7    1B rows, splinterdb
107692  8393    2436    13.0    1B rows, rocksdb, leveled
121519  8159    2519    11.7    750M rows, rocksdb, leveled
324555  8621    2763     5.3    750M rows, rocksdb universal

Details

For SplinterDB with 2b rows and 100-byte values the tree shape was:

Space used by level: trunk_tree_height=4
0:   191168 MiB
1:    34554 MiB
2:     6586 MiB
3:     1307 MiB
4:      206 MiB

And the LSM tree shape for RocksDB with 2B rows, 100-byte values and leveled compaction uses long lines so that is in a gist.


Wednesday, November 30, 2022

Compiling MySQL 5.6 & 5.7 on Ubuntu 22.04

One of my hobbies is testing open source DBMS for CPU regressions and for that I want to compare perf between old and new versions of the DBMS. Depending on the DBMS it can be a challenge to build the old DBMS with the current (modern) compiler toolchain. 

Using open source frequently means compiling from source and compiling from source eventually means debugging a failed build. Alas, the proliferation of build tools means you are likely to be debugging a build tool you know little about. For me that included svn+MongoDB, cmake+MySQL, make/configure+MySQL, mvn+Linkbench, mvn+First_Robotics and make+RocksDB. Perhaps my debugging would be easier if there weren't as many build tools.

Postgres might be an exception WRT compiling old versions - it works great. Alas, this isn't as easy with MySQL versions 5.6.51 and 5.7.39. Note that MySQL 5.6 reached end of life in 2021 but 5.7 doesn't reach that until next year. For MySQL 5.6, 5.7 and perhaps some 8.0 releases prior to 8.0.31, there is a common error -- cmake fails with a message about being unable to find ssl and a suggestion to install it. 
CMake Error at cmake/ssl.cmake:63 (MESSAGE):
  Please install the appropriate openssl developer package.

Call Stack (most recent call first):
  cmake/ssl.cmake:306 (FATAL_SSL_NOT_FOUND_ERROR)
  CMakeLists.txt:603 (MYSQL_CHECK_SSL)
This occurs on Ubuntu 22.04.1 with gcc 11.3.0 and OpenSSL was definitely installed, and used when I compiled MySQL 8.0.31 from source. The problem is there are several conditions that trigger the error message and one of them occurs when the version number is wrong or the version number cannot be parsed from a header. So a better error message would make debugging easier in the future.

For 5.7.39 I fixed the error by backporting code from 8.0.31 that parses the SSL version numbers from openssl/openslv.h. A diff for that is here. Perhaps the real fix needed is smaller but I was in copy/paste mode given my lack of cmake skills.

For 5.6.51 I fixed the error in the same way and then added --std=c++11 to the CXX flags because some code uses the register keyword which is no longer a thing in C++ 2017. A diff for that is here. Output from the compiler error before --std=c++11 was added is here.

Updates:
  • Filed bug 109251, must stay ahead of Vilnius DB in the bug reporter contest
  • Compiling FB MySQL 5.6.35 was even more fun, see below
Compiling FB MySQL 5.6.35

A diff to make this work is here. I had to modify cmake/ssl.cmake as explained above and storage/rocksdb/get_rocksdb_files.sh.  The cmake command line is here.

I also did cd rocksdb; make static_lib to generate rocksdb/util/build_version.cc because the automation to do that was not working, but after I applied the diff listed above that wasn't needed.

Compiling FB MySQL 8.0.28

I needed a small diff to edit storage/rocksdb/get_rocksdb_files.sh similar to what was done for FB MySQL 5.6.35 above. The cmake command line is here.

While boost is installed by git submodule update I am not sure that works for my OSS builds and/or cmake command line (-DWITH_BOOST=$PWD/../boost). So I just do:
  cd $SRC_ROOT; mv boost boost.up; cp -p -r ~/mysql-8.0.28/boost .

Tuesday, November 29, 2022

Insert benchmark: Postgres, InnoDB and MyRocks with low concurrency

This has results for the insert benchmark using Postgres, InnoDB and MyRocks. For an overview of the insert benchmark see here and here. Some information on the performance summaries generated by my test scripts is here. I used small servers and ran the test at low concurrency (1 or 2 threads) for cached and IO-bound workloads. The insert benchmark has several phases and the interesting phases are: insert-only without secondary indexes, insert-only with 3 secondary indexes and then range queries with rate-limited inserts.

Performance reports are provided for:

Disclaimer - I used a small server and will soon repeat this on larger servers and with more concurrency.

tl;dr

  • Postgres is boring: no CPU regressions from version 12 to 15 for cached and IO-bound workloads.
    • The insert benchmark found a CPU regression in 15beta1 which was quickly fixed.
  • InnoDB with a cached workload has large CPU regressions from 5.6 to 8.0.
    • This is visible here. By large I mean that the insert and query rates for InnoDB in 8.0.31 are ~64% and ~75% of the rates in 5.6. The root cause is new CPU overhead in code other than the storage engine.
  • MyRocks with a cached workload has large CPU regressions from 5.6 to 8.0
    • This is visible here. The regression for insert-only without secondary indexes is similar to InnoDB. The regression for range queries is smaller than the one above for InnoDB (throughput declines by ~8%). The root cause is new CPU overhead in upstream MySQL.
  • InnoDB with an IO-bound workload has large CPU regressions in the insert-only without secondary index phase, which is normally CPU-bound. See here.
    • The regression is similar to what I describe above for the cached workload. Throughput for the insert-only with secondary index phase and range queries with rate-limited inserts phase is better in 8.0.31 than 5.6.
  • MyRocks with an IO-bound workload has CPU regressions that are similar to MyRocks with a cached workload. See here.

Friday, November 18, 2022

SSD read response time: raw device vs a filesystem

I am trying to understand why 4kb random reads from an SSD are about 2X slower when using a filesystem vs a raw device and this reproduces across different servers but I am only sharing the results from my home Intel NUCs. The symptoms are, the read response time is:

  • ~2X larger per iostat's r_await for reads done via a filesystem vs a raw device (.04 vs .08 millisecs)
  • ~3X larger per blkparse for reads done via a filesystem vs a raw device (~16 vs 50+ microsecs)
Once again, I am confused. Is something below user land doing (if from-filesystem -> go-slow). In theory, if the D->C transition reported by blkparse includes a lot of CPU overhead from the filesystem then that might explain this, but the CPU per IO overhead isn't large enough for that to be true here.

My test scripts for this are rfio.sh and do_fio.sh.

Update - the mystery has been solved thanks to advice from an expert (Andreas Freund) who is in my Twitter circle, and engaging with experts I would never get to meet in real life is why I use Twitter. I have been running the raw device tests with the device mostly empty and the fix is to run the test with it mostly full otherwise the SSD firmware can do something special (and faster) when reading data that was never written.

blktrace + blkparse

I used fio for 3 configurations: raw device, O_DIRECT and buffered IO. For O_DIRECT and buffered IO the filesystem is XFS and there were 8 20G files on a server that has 16G of RAM. The results for O_DIRECT and buffered IO were similar.

The results I discuss in the next 2 paragraphs are from fio run with --numjobs=1.

Output from blkparse for a few IOs are here. The lifecycle of an IO starts with state Q (queued) and ends with state C (completed). The number in parentheses at the end of the line for state C is the response time in microseconds. The RWBS field has RA for buffered IO (R = read, A = possible readahead) and R for O_DIRECT and raw. The timestamp format is seconds.nanoseconds and I pasted examples from ~18 seconds into the measurement. Most of the time for an IO request occurs between state D (dispatch to device) and C (completed). The blkparse man page is here.

From the examples I shared the response time is ~15 microseconds for raw and 50+ microseconds for O_DIRECT and buffered IO. Averaging the values from all samples collected over 20 seconds shows the average was 15 microseconds for raw vs ~73 microseconds for O_DIRECT and buffered. The read request size is 4096 in all cases (see + 8 in the blkparse output).

The sector offsets in the blkparse output are all a multiple of 8 (8 x 512 == 4096) so the requests have that in common between raw, O_DIRECT and buffered.

fio

Command lines and performance metrics from fio are here. I ran fio for numjobs in 1, 2, 4, 8, 16, 32, 48 and 64. A summary of the results:
And results after making the fix described in Update above, the results from raw are only slightly better than O_DIRECT, while buffered gets a better throughput number because it benefits from some hits in the OS page cache.

Monday, October 31, 2022

s/optimal/better/g - on reviewing conference papers

I spent a few years reviewing papers for database conferences. I think that is winding down. I was OK as a reviewer, definitely not great, and this summarizes my experience.

For starters, I am in awe of good reviwers. As a reviewer you get to see feedback from the other reviwers after submitting your review. And I was always nervous while waiting to see the other reviews. Was my review an outlier? How much did I miss in my review? Reading good reviews after submitting a mediocre review is a great way to learn.

As always, a key to success is to choose the right base case especially if you want to show linear speedup or scaleup. Many of the papers use the DBMS that I know quite well (MySQL, RocksDB) as the base case. So I have frequent someone on the internet is wrong moments while reading such papers. My offer to provide (free) benchmark advice to research projects was ignored. But maybe that is OK, because I am already busy.

My goal was to focus on the ideas in the paper and allow that the experimental results might be truthy rather than true. There are many interesting ideas in conference papers even if the experimental results aren't perfect. The review process was better for me after I learned to appreciate the ideas and worry less about flaws in the benchmarks.

I prefer to read a paper that explains where the new idea both is and is not better, but perhaps the marketing pressure and page limit means that papers don't focus much on where the new idea isn't better. I also prefer to have an explanation of the results -- I have had to retract a few DBMS perf blog posts because I didn't explain the results and they turned out to be misleading or bogus. Alas, papers rarely explained the (disappointing) results of the base case when showing they were so much faster than the base case.

Frequent feedback I had for papers:

  • s/optimal/better/g because optimal has a high bar. Papers would show their thing improved performance in some cases but better might not be optimal. Some math would be needed for that.
  • Provide more details to make reproduction possible. I am not offering to spend the time to do a repro nor am I suggesting that others should spend their time on that. But I want to know basic things about the base case (the thing that your thing is better than) including version tested, configuration details and command lines.
A short rant on the innovation bar
  • By innovation bar I mean the requirement that a paper shows how the idea is new relative to previously published peer-reviewed research. For some research track papers that I have reviewed I feel like the bar is raised too high, but that isn't my rant. My rant is that I have seen at least one industrial track paper rejected for not being innovative and my reading of the guidelines for industrial track papers in VLDB and SIGMOD is that there isn't an innovation bar for industry papers.
Bogus feedback that I avoided giving:
  • This idea has been implemented by DBMS or documented in some blog post. Conferences place a lot of value on innovation but fortunately the scope for prior art is limited to peer reviewed publications, not random blogs like this one. And researchers should not be expected to know all of the implementation details for amazing but not always well documented open and closed source DBMS. By bogus I mean that this doesn't count against the innovation bar -- research-track papers must show that something is innovative in their work. When said idea has previously been published then the reviewed paper must show how it differs.  But the paper isn't required to show how it differs from an idea described in a blog post or implemented in a production DBMS. Sometimes I gave feedback with a strong disclaimer when ideas have previously appeared in blog posts or an existing DBMS. And by strong disclaimer I mean that I made it clear that I would not and could not hold this against the paper.
Feedback that I was extremely reluctant to provide:
  1. Run X more tests. I am happy to require that before software is deployed in production. I am wary of asking for much more unless the experiment section is sad.
Updates:
  • Clarified what I mean by bogus feedback
  • Added rant on innovation bar

Wednesday, October 26, 2022

Quantifying storage on Linux

Some things are complicated but I understand them (RocksDB). Clearly that isn't too complicated and the complexity might be a barrier to entry which boosts the demand for my skills. Other things are complicated and I don't understand them that well. Clearly those things are too complicated.

Yes, I am trying to be funny but what I wrote above might be true for many of us. In this case the thing that I don't understand that well are things that support IO for a DBMS -- filesystems, block layer and storage devices. It is likely that something in this post is factually incorrect and I am happy to be corrected. Some of my posts are thinly veiled attempts to get free advice from experts.

The problem I am trying to understand this week is the size of IO requests at different layers of the stack while running RocksDB benchmarks. To be specific: are the reads being done at a multiple of 512 or 4096 bytes? And when might that be possible (O_DIRECT vs buffered IO). From the details below I suspect I can do 512-byte reads with O_DIRECT on the v3.small and v4.small servers, but a 512-byte read on the GCP server will end up doing a 4096-byte transfer at some level of the stack.

I am trying to understand the cases when a read-only workload with RockDB can and cannot saturate the IO capacity of a storage device. I am using 3 types of servers: home servers that I will abbreviate as v3.small and v4.small, and a c2-standard-60 server in GCP that uses SSD Persistent Disk. In all cases the filesystem is XFS and the OS is Ubuntu 22.04. You need CPU to do IO and the number of CPU cores is 4 (Intel i7 @ 2.7GHz) for v3.small, 8 (AMD Ryzen 7 at 2GHz) for v4.small and 30 (Intel Xeon @ 3.1GHz) for c2-standard-60. The storage is NVMe from Samsung 970 EVO on v3.small and Kingston on v4.small. I don't know what device is used for GCP.

The rest of this post lists the information that I found via:

  • /sys/block/$device/queue/*
  • lsblk -t $dev
  • xfs_info

Details: lsblk

Note:

  • {v3,v4}.small use min-io=512, phy-sec=512, log-sec=512
  • GCP uses min-io=4096, phy-sec=4096, log-sec=512
  • From this I wonder whether there are cases where RocksDB can actually do 512-byte IO requests (logical) and whether all layers of the stack will respect that and not do 4096-byte requests to return the requested 512 bytes (physical). 
  • One guess that when LOG-SEC < PHY-SEC (see GCP below) that some layer of the stack will do an operation at the larger (PHY-SEC) size but return the smaller (LOG-SEC) size.
From lsblk --help:
  • MIN-IO - minimum I/O size
  • PHY-SEC - physical sector size
  • LOG-SEC - logical sector size

And the full details on the storage that I use. The man page for lsblk is here.

# v3.small
$ lsblk -t /dev/nvme0n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0    512      0     512     512    0 none     1023 128    0B

# v4.small
$ lsblk -t /dev/nvme0n1
NAME    ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
nvme0n1         0    512      0     512     512    0 none      255 128    0B

# GCP
$ lsblk -t /dev/sdb
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
sdb          0   4096      0    4096     512    0 none     8192 128    4G

Details: /sys

Docs for these are here and several of these values are also in lsblk output above.

From /sys/block/$device/queue/$name
v3      v4      GCP     name
512     512     4096    physical_block_size
512     512     512     logical_block_size
512     512     512     hw_sector_size
512     512     4096    minimum_io_size
512     512     4096    physical_block_size
none    none    none    scheduler
512     512     4096    discard_granularity
1280    256     256     max_sectors_kb
nvme0n1 nvme0n1 sdb     $device

Details: xfs_info

The man page for xfs_info is here. The filesystems were created using default option for mkfs.xfs.


# v3.small

$ xfs_info /dev/nvme0n1

meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=30524162 blks

         =                       sectsz=512   attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1    bigtime=0 inobtcount=0

data     =                       bsize=4096   blocks=122096646, imaxpct=25

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=59617, version=2

         =                       sectsz=512   sunit=0 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0


# v4.small

$ xfs_info /dev/nvme0n1

meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=30524162 blks

         =                       sectsz=512   attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1    bigtime=0 inobtcount=0

data     =                       bsize=4096   blocks=122096646, imaxpct=25

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=59617, version=2

         =                       sectsz=512   sunit=0 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0


# GCP

$ xfs_info /dev/sdb

meta-data=/dev/sdb               isize=512    agcount=4, agsize=196608000 blks

         =                       sectsz=4096  attr=2, projid32bit=1

         =                       crc=1        finobt=1, sparse=1, rmapbt=0

         =                       reflink=1    bigtime=0 inobtcount=0

data     =                       bsize=4096   blocks=786432000, imaxpct=5

         =                       sunit=0      swidth=0 blks

naming   =version 2              bsize=4096   ascii-ci=0, ftype=1

log      =internal log           bsize=4096   blocks=384000, version=2

         =                       sectsz=4096  sunit=1 blks, lazy-count=1

realtime =none                   extsz=4096   blocks=0, rtextents=0

Monday, October 24, 2022

Small servers for performance testing, v4

I am setting up my fourth cluster of small servers to test open source database software. Cluster might be an overstatement because each cluster is limited to 2 or 3 servers. The clusters were/are:

  • v1 - Intel NUC5i3ryh (5th gen core i3), 8G RAM, SATA disk for OS, Samsun 850 EVO m.2 for db
  • v2 - Intel NUC7i5bnh (7th gen core i5), 16G RAM, Samsung 850 EVO SATA for OS, Samsung 960 EVO m.2 for db
  • v3 - Intel NUC8i7beh (8th gen core i7), 16G RAM, Samsung 860 EVO SATA for OS, Samsung 970 EVO m.2 for db
  • v4 - Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, WD Blue 1T SATA for OS, Kingston NVMe for db
More details on previous clusters are here for v1 and v2 and then for v3. I use a separate disk for the OS because I expect the database SSD to wear out and I don't want to reinstall the OS when that happens. Posts on monitoring for endurance are here and here but in the past I have neglected to catch that early enough and some SSDs greatly exceeded their endurance ratings.

The CPUs for each cluster:

  • v1 - Intel i3-5010U with 2 cores. I think I left hyperthread enabled to get 4 HW threads.
  • v2 - Intel i5-7260U with 2 cores. Again, I think I left hyperthread enabled to get 4 HW threads. But this was about 2X faster on the compile MySQL 8.1 benchmark. Turbo boost was also disabled to reduce performance variance.
  • v3 - Intel i7-8559U with 4 cores and hyperthread disabled. Turbo boost was also disabled to reduce performance variance so the CPU runs at 2.70 GHz.
  • v4 - AMD Ryzen 7 4700u with 8 cores and 8 HW threads. Turbo core was disabled to reduce performance variance. I am still figuring out what the base clock speed is.
The Beelink (v4) server comes with 16G of RAM and 512G of NVMe m.2 installed. Then I installed a 1T SATA SSD. So putting the HW together was a bit easier with Beelink than with the NUC as the NUC kits I ordered required me to install RAM, m.2 and SATA SSDs.

The SER 4700u web page claims that Crucial DDR4 DRAM and Kingston NVMe m.2 are used. I confirmed that Kingston was provided. I was happy to learn this comes with quality components, and again the price is great.

Why AMD?

I have been happy with my Intel NUC clusters but I chose AMD this time because the prices and reviews for Beelink were great and because I am not the target use case for the new Intel NUCs. The Intel NUC is currently on the 12th generation. I had to go back to the 10th gen to find a NUC that would work for me. The newer ones were either targeted at gaming, home video or had a mix of performance and efficiency cores. More than anything, I need consistent performance and can't make use of both performance and efficiency cores.

The NUCs were reliable for me. Their only weak spot was the wires attached to the base that would have to flex when you remove the base to replace SSDs and the wires on 1 server eventually failed at the flex point. I shipped them back to Intel and received a new one. The Beelink server only has one ribbon for SATA connected to the base and it is much more flexible.

All of the cluster servers claim a low TDP. I like that as I don't want to trip circuit breakers or have them heat the server room. I also want to be able to use them, at least at night, in the summer when it starts to get warm. While the CPU performance is likely to have not that much variance given that I disabled turbo, I still wonder about SSD performance variance due to heat.

Setup

Setting up the Beelink was easy -- remove 4 screws on the bottom, insert a small allen wrench into a gap to pry the base off (the hardest part) and then add the SATA SSD. In the BIOS (hold "delete" on boot) I changed the boot order to move SATA before NVMe. I am not sure if I needed to reorder the USB when I used that to install Linux. 

The server comes with some flavor of Windows on the m.2 SSD and will boot into the Windows setup if you are not careful. This was confusing because the first few screens of that process don't make it clear that you are about to setup Windows. Reboot and hold f7 to quickly get to a screen where you can change the boot order.

One small feature that is extra useful is that the Beelink server lists the BIOS prompt keys on the bottom plate -- delete to get the full BIOS and f7 to get the boot order screen. I wish the NUC had that as I always relearn it by experimenting. I think it is f2. While the Intel NUC BIOS was easier to navigate it is also a visual BIOS so I get a bit more exercise finding a mouse whenever I have to fiddle with it, and I recently had to disable secure boot on the NUCs to make blktrace work.

I installed Ubuntu 22.04 Server via a thumb drive. This was easy. Soon after the install I removed cloud-init as that slows the boot process and adds a bit too much text during boot. I was able to get a wifi connection during installed, but after install the wifi setup step would hang during boot. I am still not sure why that happened -- my unproven but educated guesses were: wifi worked better when the boxes weren't next to each other, wifi worked better when the boxes connected to my wifi base router rather than a wifi extender.

While I can disable turbo boost in the BIOS on the Intel NUCs, with AMD there was no BIOS option to disable turbo core. But there are things that can be done after boot. This is fine for me given that I already run scripts at startup to enable the usage of gdb and mount my database filesystem. The scripts do the following. The last line disabled turbo core.

echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger
mount -o noatime,nodiratime,discard,noauto /dev/nvme0n1 /data
echo '0' > /sys/devices/system/cpu/cpufreq/boost

Debugging

From the output of cpupower frequency-info and turbostat I think the base clock is 2GHz. But I am not certain yet. Modern CPUs are complicated. Example output is here. I haven't used these tools and this post was useful both for the tools and as an introduction into how clock frequency can change (C states and more).

I needed to debug a performance difference -- the per-cpu IOPs from fio was ~20k on one server vs ~45k on the other. I ran perf top and the difference was obvious, read_hpet.0 was 50% of the CPU on the slow server and not visible on the fast server. Also the ratio of user to system CPU time was very different between the servers. 

This line in dmesg output was a strong hint:
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'

More details are here and eventually I found this Reddit post. The fix is below. I don't have an opinion on whether this is a HW issue or whether Linux will become more tolerant when measuring TSC at startup while choosing a clock source. Regardless, it is confusing to debug.
  1. edit /etc/default/grub -> GRUB_CMDLINE_LINUX_DEFAULT="clocksource=tsc tsc=reliable"
  2. update-grub
  3. reboot
Bad SSD

Both servers are getting READ FPDMA QUEUED and WRITE FPDMA QUEUED errors from the SATA SSD (WD Blue SA510). I will guess that the problem is the WD Blue SSD and will soon replace it with a Samsung 870. The errors in dmesg look like this. The error has yet to repro after the replacement, but I also used screws to lock down the drive during the replacement and they were not used prior. So perhaps the lack of screws was the problem.