Thursday, March 16, 2023

Huge pages with Postgres & InnoDB: better perf, but a bit more work

I started to use huge pages for the benchmarks I do with Postgres and MySQL/InnoDB. It helps performance but requires a few more steps to get working and this will be another source of failures for production deployments. See here for an example of the perf improvements.

Disclaimer - I am new to this and far from an expert.

Postgres

As always, the Postgres docs are useful:

  1. Figure out how many huge pages will be needed, call this X
  2. Edit /etc/sysctl.conf with vm.nr_huge_pages=$X
  3. sudo sysctl -p
  4. Compact the Linux VM (optional, not in the PG docs)
  5. Add huge_pages=try or huge_pages=on to the Postgres configuration
  6. Start Postgres
To estimate the number of huge pages (the value for X). First figure out the value of Hugepagesize (grep Huge /proc/meminfo) and in the example below that is 2MB. Then X = sizeof(buffer pool) / Hugepagesize.

$ grep Huge /proc/meminfo

AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
I suggest compacting the Linux VM because huge pages might not be available if you don't do this (I learned this from experience) and you can judge that based on the value of HugePages_Total from /proc/meminfo. After starting Postgres check /proc/meminfo again to confirm that huge pages were used. This post has more details

 My script for compacting the VM is:

sync; sync; sync

echo 3 > /proc/sys/vm/drop_caches

echo 1 > /proc/sys/vm/compact_memory

MySQL/InnoDB

The docs are here for InnoDB with huge pages although I think they need to be updated. AFAIK InnoDB always uses mmap now, while it used to use shm (shmget, shmat) prior to modern 8.0. When shm is used you need to worry about /proc/sys/vm/hugetlb_shm_group, /proc/sys/kernel/shmmax and /proc/sys/kernel/shmmall as mentioned in the InnoDB docs. I don't think those need to be updated now that InnoDB appears to always use mmap. For the 5.7 code that uses shm see here and for modern 8.0 that uses mmap see here.

To use huge pages with InnoDB:

  1. Figure out how many huge pages will be needed, call this X. See the Postgres section above.
  2. Edit /etc/sysctl.conf with vm.nr_huge_pages=$X
  3. sudo sysctl -p
  4. Compact the Linux VM (optional, not in the InnoDB docs). See the Postgres section above.
  5. Confirm that innodb_buffer_pool_chunk_size is larger than the huge page size. The default is 128M which is great when the huge page size is 2M.
  6. Add large_pages=ON to my.cnf
  7. Start mysqld
When I first tried this with MySQL 5.6 I forgot to edit /proc/sys/vm/hugetlb_shm_group as described in the InnoDB docs. I was able to start mysqld, but there were errors in the database error log and odd query and connection errors afterwards. Fortunately, I don't think you need to edit hugetlb_shm_group with modern 8.0 because it uses mmap instead of shm.

Update 1: watch out for OOM

Have been using this for sysbench and the insert benchmark on large GCP servers (c2-standard-60 with 240G RAM and 30 cores). Works great for Postgres but I have intermittent problems with MySQL/InnoDB -- see here.

A description of the /proc/meminfo contents is here.

The server has 240G of RAM, I set vm.nr_hugepages=108000, Hugepagesize from /proc/meminfo is 2MB. Also, a script is run to compact the Linux VM prior to starting mysqld.

From the following I don't understand why HugePages_Rsvd shrinks from 92153 to 62071, but something similar happens with Postgres and I have yet to experience intermittent problems there.

Example contents of /proc/meminfo immediately after starting mysqld is:
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:   108000
HugePages_Free:    105113
HugePages_Rsvd:    92153
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        221184000 kB

And then after running a full benchmark (a few hours of work):
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:   108000
HugePages_Free:    75031
HugePages_Rsvd:    62071
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        221184000 kB

Looking at dmesg output I see the OOM killer has been active and killed mysqld -- see here.

From this I assume the server has 2 sockets and the HugePages_Total (108000) is split evenly between them. I am curious how that interacts with the large allocations done by InnoDB. Does it need to know to split them evenly between NUMA nodes? See here.
  Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
  Node 0 hugepages_total=54000 hugepages_free=53741 hugepages_surp=0 hugepages_size=2048kB
  Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
  Node 1 hugepages_total=54000 hugepages_free=53173 hugepages_surp=0 hugepages_size=2048kB

I assume the problem is the large value for anon-rss here and the workaround would be to reduce the size of the InnoDB buffer pool from 180G, maybe to 150G.
  Out of memory: Killed process 81153 (mysqld) total-vm:206822548kB, anon-rss:24688304kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:54752kB oom_score_adj:0

Using huge pages feels brittle, options are:
  1. Make buffer pool as large as normal, risk OOM
  2. Make buffer pool smaller than normal, avoid OOM
I prefer to use InnoDB with O_DIRECT, but if #2 is the workaround then perf suffers unless I switch to buffered IO.
Update 2: odd failures with Postgres

Huge pages continue to be a huge-ly brittle. Now I get intermittent failures in the benchmark with Postgres. The failures are from a failure to connect to Postgres, for example this happens a few hours into a benchmark after the initial load (insert without secondary indexes), index create, and then insert with secondary index steps all completed with success. The server has 240G of RAM, from ps I see that Postgres VSZ is ~160G while RSS is small. I don't see problems in dmesg output but I lost the Postgres logfile and will try to reproduce the problem.

Looks like /dev/shm is configured to get up to 119G on this server, which oversubscribes memory given that I have vm.nr_hugepages=85000 which takes 166G of the 240G on the server. But after starting PG I see that only 1.1M is used from /dev/shm. Does something try to put 80G of data there (what, why)? Or is there some source of pressure that just reclaims memory, and removes the files created in /dev/shm by Postgres long before it uses 80G of RAM?

$ bin/psql
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: could not open shared memory segment "/PostgreSQL.1400317704": No such file or directory
And from Python psycopg2.OperationalError: connection to server at "127.0.0.1", port 5432 failed: FATAL: could not open shared memory segment "/PostgreSQL.1400317704": No such file or directory
From /proc/meminfo long after the problem started ... $ grep -i huge /proc/meminfo AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 85000 HugePages_Free: 6610 HugePages_Rsvd: 90 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 174080000 kB
And /proc/meminfo immediately after starting Postgres
grep -i huge /proc/meminfo
AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 85000 HugePages_Free: 83492 HugePages_Rsvd: 76972 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 174080000 kB

Started to read these posts:
  • https://postgrespro.com/list/thread-id/2494899
  • https://postgreshelp.com/postgresql-dynamic-shared-memory-posix-vs-mmap/
  • https://postgreshelp.com/postgresql-kernel-parameters
And I am using POSIX for dynamic_shared_memory_type.

ib=# select name, setting from pg_settings where name like 'dynamic_shared_memory_type';
            name            | setting
----------------------------+---------
 dynamic_shared_memory_type | posix

From ipcs -a
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000404 557079     mcallaghan 600        56         6

------ Semaphore Arrays --------
key        semid      owner      perms      nsems

And from /dev/shm
$ ls -lrt /dev/shm
total 1052
-rw------- 1 mcallaghan mcallaghan   26976 May  1 15:48 PostgreSQL.4270028686
-rw------- 1 mcallaghan mcallaghan 1048576 May  1 15:50 PostgreSQL.837122846


2 comments:

  1. One of the lines above states 'Edit /etc/sysctl.conf with vm.nr_huge_pages=$X' but it should actually be vm.nr_hugepages (this is a bug in the MySQL docs too, I'm filing an issue).

    ReplyDelete

Speedb vs RocksDB on a large server

I am happy to read about storage engines that claim to be faster than RocksDB. Sometimes the claims are true and might lead to ideas for mak...