Monday, October 24, 2022

Small servers for performance testing, v4

I am setting up my fourth cluster of small servers to test open source database software. Cluster might be an overstatement because each cluster is limited to 2 or 3 servers. The clusters were/are:

  • v1 - Intel NUC5i3ryh (5th gen core i3), 8G RAM, SATA disk for OS, Samsun 850 EVO m.2 for db
  • v2 - Intel NUC7i5bnh (7th gen core i5), 16G RAM, Samsung 850 EVO SATA for OS, Samsung 960 EVO m.2 for db
  • v3 - Intel NUC8i7beh (8th gen core i7), 16G RAM, Samsung 860 EVO SATA for OS, Samsung 970 EVO m.2 for db
  • v4 - Beelink SER 4700u with Ryzen 7 4700u, 16G RAM, WD Blue 1T SATA for OS, Kingston NVMe for db
  • v5 - Beelink SER7 7840HS with Ryzen 7 7840HS, 32G RAM, 1T m.2 SSD for OS, 2T Samsung Pro 990 for the database. This is also my first mini PC with a fan and TDP is 65w which is a bit larger than what came before. Passmark single-thread rating is 3771 vs 2532 for the v4 server.
  • v6 - SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, XFS). CPU is Intel Xeon Silver 4214R CPU @ 2.40GHz
  • v7 - Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM, AMD Ryzen Threadripper PRO 5975WX 32-Cores, 2 m.2 SSD (each 2TB, RAID SW 0, XFS)
  • v8 - ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and 2 m.2 slots (one for OS install, one for DB perf tests)
More details on previous clusters are here for v1 and v2 and then for v3. I use a separate disk for the OS because I expect the database SSD to wear out and I don't want to reinstall the OS when that happens. Posts on monitoring for endurance are here and here but in the past I have neglected to catch that early enough and some SSDs greatly exceeded their endurance ratings.

The CPUs for each cluster:

  • v1 - Intel i3-5010U with 2 cores. I think I left hyperthread enabled to get 4 HW threads.
  • v2 - Intel i5-7260U with 2 cores. Again, I think I left hyperthread enabled to get 4 HW threads. But this was about 2X faster on the compile MySQL 8.1 benchmark. Turbo boost was also disabled to reduce performance variance.
  • v3 - Intel i7-8559U with 4 cores and hyperthread disabled. Turbo boost was also disabled to reduce performance variance so the CPU runs at 2.70 GHz.
  • v4 - AMD Ryzen 7 4700u with 8 cores and 8 HW threads. Turbo core was disabled to reduce performance variance. I am still figuring out what the base clock speed is.
  • v5 - AMZ Ryzen 7 7840HS with 8 cores and 16 HW threads. Turbo core was disabled to reduce performance variance.
The Beelink (v4) server comes with 16G of RAM and 512G of NVMe m.2 installed. Then I installed a 1T SATA SSD. So putting the HW together was a bit easier with Beelink than with the NUC as the NUC kits I ordered required me to install RAM, m.2 and SATA SSDs.

The SER 4700u web page claims that Crucial DDR4 DRAM and Kingston NVMe m.2 are used. I confirmed that Kingston was provided. I was happy to learn this comes with quality components, and again the price is great. (Update) After wearing out one of the Kingston SSDs I replaced them with Samsung 980 Pro (1TB) which is a lot faster.

Why AMD?

I have been happy with my Intel NUC clusters but I chose AMD this time because the prices and reviews for Beelink were great and because I am not the target use case for the new Intel NUCs. The Intel NUC is currently on the 12th generation. I had to go back to the 10th gen to find a NUC that would work for me. The newer ones were either targeted at gaming, home video or had a mix of performance and efficiency cores. More than anything, I need consistent performance and can't make use of both performance and efficiency cores.

The NUCs were reliable for me. Their only weak spot was the wires attached to the base that would have to flex when you remove the base to replace SSDs and the wires on 1 server eventually failed at the flex point. I shipped them back to Intel and received a new one. The Beelink server only has one ribbon for SATA connected to the base and it is much more flexible.

All of the cluster servers claim a low TDP. I like that as I don't want to trip circuit breakers or have them heat the server room. I also want to be able to use them, at least at night, in the summer when it starts to get warm. While the CPU performance is likely to have not that much variance given that I disabled turbo, I still wonder about SSD performance variance due to heat.

Setup

Setting up the Beelink was easy -- remove 4 screws on the bottom, insert a small allen wrench into a gap to pry the base off (the hardest part) and then add the SATA SSD. In the BIOS (hold "delete" on boot) I changed the boot order to move SATA before NVMe. I am not sure if I needed to reorder the USB when I used that to install Linux. 

The server comes with some flavor of Windows on the m.2 SSD and will boot into the Windows setup if you are not careful. This was confusing because the first few screens of that process don't make it clear that you are about to setup Windows. Reboot and hold f7 to quickly get to a screen where you can change the boot order.

One small feature that is extra useful is that the Beelink server lists the BIOS prompt keys on the bottom plate -- delete to get the full BIOS and f7 to get the boot order screen. I wish the NUC had that as I always relearn it by experimenting. I think it is f2. While the Intel NUC BIOS was easier to navigate it is also a visual BIOS so I get a bit more exercise finding a mouse whenever I have to fiddle with it, and I recently had to disable secure boot on the NUCs to make blktrace work.

I installed Ubuntu 22.04 Server via a thumb drive. This was easy. Soon after the install I removed cloud-init as that slows the boot process and adds a bit too much text during boot. I was able to get a wifi connection during installed, but after install the wifi setup step would hang during boot. I am still not sure why that happened -- my unproven but educated guesses were: wifi worked better when the boxes weren't next to each other, wifi worked better when the boxes connected to my wifi base router rather than a wifi extender.

While I can disable turbo boost in the BIOS on the Intel NUCs, with AMD there was no BIOS option to disable turbo core. But there are things that can be done after boot. This is fine for me given that I already run scripts at startup to enable the usage of gdb and mount my database filesystem. The scripts do the following. The last line disabled turbo core.

echo -1 > /proc/sys/kernel/perf_event_paranoid
echo 0 > /proc/sys/kernel/yama/ptrace_scope
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"
echo 1 > /proc/sys/kernel/sysrq
echo x > /proc/sysrq-trigger
mount -o noatime,nodiratime,discard,noauto /dev/nvme0n1 /data
echo '0' > /sys/devices/system/cpu/cpufreq/boost

Debugging

From the output of cpupower frequency-info and turbostat I think the base clock is 2GHz. But I am not certain yet. Modern CPUs are complicated. Example output is here. I haven't used these tools and this post was useful both for the tools and as an introduction into how clock frequency can change (C states and more).

I needed to debug a performance difference -- the per-cpu IOPs from fio was ~20k on one server vs ~45k on the other. I ran perf top and the difference was obvious, read_hpet.0 was 50% of the CPU on the slow server and not visible on the fast server. Also the ratio of user to system CPU time was very different between the servers. 

This line in dmesg output was a strong hint:
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'

More details are here and eventually I found this Reddit post. The fix is below. I don't have an opinion on whether this is a HW issue or whether Linux will become more tolerant when measuring TSC at startup while choosing a clock source. Regardless, it is confusing to debug.
  1. edit /etc/default/grub -> GRUB_CMDLINE_LINUX_DEFAULT="clocksource=tsc tsc=reliable"
  2. update-grub
  3. reboot
Bad SSD

Both servers are getting READ FPDMA QUEUED and WRITE FPDMA QUEUED errors from the SATA SSD (WD Blue SA510). I will guess that the problem is the WD Blue SSD and will soon replace it with a Samsung 870. The errors in dmesg look like this. The error has yet to repro after the replacement, but I also used screws to lock down the drive during the replacement and they were not used prior. So perhaps the lack of screws was the problem.

Worn out SSD

The Kingston SSD shipped with the boxes quickly reached the endurance limit. It has been replaced with 1TB Samsung 980 Pro.

Disappearing SSD

While the Samsung 980 Pro endurance has been great, one of the devices reached EoL too soon. The endurance consumed is likely less than 10% but the device disappeared. Time to buy a replacement. Details are here.

systemd

The default behavior for systemd is to remove your files from /dev/shm when you logout, even if a screen session is still running as you -- see here. This removes files that Postgres needs. To avoid that:
  1. add RemoveIPC=no to /etc/systemd/logind.conf
  2. sudo systemctl restart systemd-logind.service
netplan

I always forget these:
sudo vi /etc/netplan/00-installer-config-wifi.yaml
  sudo netplan apply

Problems with the v4 server

I get XFS errors and the device either goes into read-only mode or disappears (see here for details). So I am working on monitoring temperatures on the device and the SSD temp is easy to get but the sensors command provided by the lm_sensors package doesn't find any CPU temp when using Ubuntu 22.04 with the 5.15 kernel. So I upgraded to the HWE kernels to get a 6.2 kernel which is able to read CPU temps. To upgrade the kernel I used: sudo apt install linux-generic-hwe-22.04. I can monitor the NVMe devices via one of:
sudo smartctl -a /dev/nvme0n1
sudo nvme smart-log /dev/nvme0n1
The XFS errors occur for the Samsung 990 Pro device I added but smartctl shows that the error log is empty and the temp never entered the warning range (see here). That devices stores the benchmark database and is very busy. For the NVMe device that has the OS install and only gets benchmark output (not very busy) there are logged errors and the device enters the warning temp range (see here). The logged error message is here. Info from fwupdmgr on the two NVMe devices is here. The BIOS is set to Balanced rather than Performance which means the TDP is lower.

I have been disabling turbo on the CPU via this command. After my current attempt to repro the problem I will try another round without this:
echo '0' > /sys/devices/system/cpu/cpufreq/boost
Next up is repeating the benchmarks with scripts running in the background to monitor temperatures.

Problems with the v5 server: part 1

The v4 Beelink has a Mediatek wifi chip that uses the mt7921e driver. The v5 Beelink uses the Intel AX200 wifi chip and there are problems -- the driver frequently crashes and restarts (see here). There was also a corruption problem with XFS (see here) but that occurred after many wifi crashes so I am now running with wifi disabled to see if the XFS error repeats. 

Possibly related error reports for the wifi problem:
Problems with the v5 server: part 2

The command to get the BIOS info while running is:  sudo dmidecode -t bios

The v5 server I purchased in 2023 has the BIOS version SER7PRO_P5C8V28. The two that I purchased in March 2024 have SER7PRO_P5C8V38. I will call these the v28 and v38 versions. With the v38 version I get many errors (see dmesg output or syslog) about correctable PCI errors when I put a m.2 device in the second m.2 slot (the one above the fan). These errors occurred with devices that support Gen4 PCI (Samsung 980 Pro, Crucial T500) and Gen3 (Samsung 970 EVO). While the device probably work I worry about the performance implications and I also don't want so much log spam.

I stopped using the 2 of the 3 v5 servers and will replace them all with the v8 (ASUS PN53).

Much more info on this is here.

Based on searching the web, one thing that might work is to disable Active State Power Management (ASPM), and if nothing else, disabling ASPM makes the error messages stop. Notes on that are in this thread and the GRUB kernel options I use for the v5 server are set in /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off nosmt amd_pstate=disable"

  • pcie_aspm=off is there to avoid correctable PCI errors (maybe Beelink BIOS needs an update)
  • nosmt disables hyperthreads because BIOS doesn't have an option for that
  • amd_pstate=disable lets me use the acpi-cpufreq driver
BIOS info from the 2023 purchase:

$ sudo dmidecode -t bios
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.5.0 present.

Handle 0x0000, DMI type 0, 26 bytes
BIOS Information
Vendor: American Megatrends International, LLC.
Version: SER7PRO_P5C8V28
Release Date: 08/14/2023
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 32 MB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
ACPI is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.29

Handle 0x002B, DMI type 13, 22 bytes
BIOS Language Information
Language Description Format: Long
Installable Languages: 1
en|US|iso8859-1
Currently Installed Language: en|US|iso8859-1

Problems with the v8 (PN53) server
 
I have 3 of these which will soon replace the v5 (Beelink SER7).

So far they work great with two odd things
  • The power button blinks orange 3 times (pause, repeats) forever when I boot without a keyboard and/or monitor attached. I learned from here that the BIOS does this on purpose and it can't be disabled. This is an odd feature give that the PN53 is designed to be run as a headless server.
  • I installed Ubuntu 22.04 on one of the PN53 servers without a second SSD in the second m.2 slot. Then I added a second SSD and there was no wifi after reboot. The problem is that the device gets named wlp3s0 when there is no second SSD and wlp4s0 when there is a second SSD and that name is used by netplan (see files in /etc/netplan). Editing /etc/netplan files fixed this but it took me an hour to figure out. Perhaps this is done by systemd because I see this in dmesg:  systemd-networkd[855]: wlan0: Interface name change detected, renamed to wlp4s0
Copying large directories from these servers

Sometimes I need to copy large directories of performance test results from my perf servers to my laptop and by large I mean 10GB or more. My home network isn't super fast and is shared so I prefer to use thumbdrives. Alas, copying 10G to a thumbdrive with Ubuntu and then running sync on the command line gets kernel stack traces (see here).

What worked for me was to change VM settings by:
  1. Edit /etc/sysctl.conf
  2. Run: sudo sysctl -p
I used:
vm.dirty_background_bytes = 268435456
vm.dirty_bytes = 536870912
vm.dirty_writeback_centisecs = 20
vm.dirty_expire_centisecs = 50

The original values were
vm.dirty_background_ratio = 10
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000

Finally, when copying files to my Mac laptop
  1. Don't use the GUI (Finder) to drag and drop the files from the thumbdrive to the laptop. MacOS will end up "preparing to copy ..." for a ridiculous amount of time.
  2. Do use the command line (cp)

2 comments:

  1. I don't know if you'll notice the same problem on your model, but with the SER-5 I noticed that usb storage performs very poorly using the front ports, compared to the rear usb ports (nearly unusable if keyboard and usb live installer are both on the front ports)
    I also had to install Fedora, as Debian 12 alpha installer versions have trouble with the wifi (modprobe hangs looking for a firmware patch)
    Otherwise, I am very happy, so far.

    ReplyDelete
    Replies
    1. I didn't notice the USB issue but I rarely use USB for storage except while installing Ubuntu via a thumb drive.

      I have wifi problems but all appear to be due to the servers connecting to a mesh point rather than the base router. I enabled "router steering" and hope that helps.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...