Friday, June 24, 2022

Fixing mmap performance for RocksDB

RocksDB inherited support for mmap from LevelDB. Performance was worse than expected because filesystem readahead fetched more data than needed as I explained in a previous post. I am not a fan of the standard workaround which is to tune kernel settings to reduce readahead because that has an impact for everything running on that server. The DBMS knows more about the IO patterns and can use madvise to provide hints to the OS, just as RocksDB uses fadvise for POSIX IO.

Good news, issue 9931 has been fixed and the results are impressive. 

Benchmark

I used db_bench with an IO-bound workload - the same as was used for my previous post. Two binaries were tested:

  • old - this binary was compiled at git hash ce419c0f and does not have the fix for issue 9931
  • fix - this binary was compiled at git hash 69a32ee and has the fix for issue 9931.
Note that git hash ce419c0f and 69a32ee are adjacent in the commit log.

The verify_checksums option was false for all tests. The CPU overhead would be much larger were it true because checksum verification would be done on each block access. Tests were repeated with cache_index_and_filter_blocks set to true and false. That didn't have a big impact on results.

Results

The graphs have results for these binary+config pairs:

  • cache0.old - cache_index_and_filter_blocks=false, does not have fix for issue 9931
  • cache0.fix - cache_index_and_filter_blocks=false, has fix for issue 9931
  • cache1.old - cache_index_and_filter_blocks=true, does not have fix for issue 9931
  • cache1.fix - cache_index_and_filter_blocks=true, has fix for issue 9931
The improvements from the fix are impressive for benchmark steps that do reads for user queries -- see the green and red bars. The average value for the average read request size (rareq-sz in iostat) is:
  • for readwhilewriting: 115kb without the fix, 4kb with the fix
  • for fwdrangewhilewriting: 79kb without the fix, 4kb with the fix

Tell me how you really feel about mmap + DBMS

It hasn't been great for me. Long ago I did some perf work with an mmap DBMS and Linux 2.6 kernels suffered from severe mutex contention in VM code. Performance was lousy back then. But I didn't write this to condemn mmap and IO-bound workloads where the read working set is much larger than memory might not be the best choice for mmap.

For the results above if you compare the improved mmap numbers with the POSIX/buffered IO numbers in my previous post -- peak QPS for the IO-bound tests (everything but fillseq and overwrite) is ~100k/second with mmap vs ~250k/second with buffered IO.

From the vmstat results collected during the benchmark I see:
  • more mutex contention with mmap based on the cs column
  • more CPU overhead with mmap based on the us and sy columns

Legend:
* qps - throughput from the benchmark
* cs - context switches from the cs column in vmstat
* us - user CPU from the us column in vmstat
* sy - system CPU from the sy column in vmstat

Average values
IO      qps     cs      us      sy
mmap     91757  475279  15.0    7.0
bufio   248543  572470  13.8    7.0

Values per query (column divided by QPS, for us and sy result is multiplied by 1000)
IO      qps     cs      us      sy
mmap    1       5.2     163     76
bufio   1       2.3      55     28

Thursday, June 16, 2022

Insert Benchmark for Postgres 12, 13, 14 and 15: part 2

This has graphs of throughput vs time for three of the Insert Benchmark steps. The goal is to determine whether there is too much variance. A common source of variance is checkpoint stalls when using a B-Tree. This is a follow up to my blog post on the Insert Benchmark for Postgres versions 12.11, 13.7, 14.3 and 15b1. 

The benchmark steps for which graphs are provided are:

  • l.i0 - load in PK order without secondary indexes
  • l.i1 - load in PK order with 3 secondary indexes
The benchmark is repeated for two workloads -- cached and IO-bound. 

Cached

The database fits in memory for the cached workload.

There isn't much variance for the l.i0 workload.
The graph for the l.i1 workload is more exciting which is expected. For the l.i0 workload the inserts are in PK order and there are no secondary indexes. So each insert makes R/P pages dirty where R is the row size, P is the page size and R/P is much less than 1. But for l.i1 each insert is likely to make (3 + R/P) pages dirty so there is much more stress on page writeback and storage. The "3" in the estimate is from doing secondary index maintenance for each of the 3 secondary indexes. The drops in the graph are usually stalls when writeback falls behind.

IO-bound

The database is much larger than memory for the IO-bound workload.

The graph for l.i0 shows occasional drops that I assume are from writeback falling behind. However they can also be from the SSD that I use.

The graph for l.i1 is more interesting which is expected because it does more IO to storage per insert. The pattern is very regular and the insert rate gradually rises from ~2000/s to ~2800/s and this repeats every ~300 seconds. Were I to bet, this is caused by Postgres rather than my SSD. Perhaps one day I will do experiments to determine whether tuning the Postgres config can reduce the variance.


Insert Benchmark for Postgres 12, 13, 14 and 15

I ran the Insert Benchmark for Postgres versions 12.11, 13.7, 14.3 and 15b1. Reports are provided for cached and IO-bound workloads. The benchmark is run on Intel NUC servers at low concurrency. The goal is to determine how performance changes over time. A description of the Insert Benchmark is here.

tl;dr

  • I am not a Postgres expert
  • regressions are small in most cases
  • the l.i0 benchmark step has regressions that are worth investigating in versions 14 and 15b1
    • these regressions might have been fixed, see the perf report for the patch (15b1p1)
Updates:
  • added links to the configuration files
  • part 2 with throughput vs time graphs is here
  • provided data on the percentage of time in parse, analyze, optimize, execute
  • added command lines 
  • added the index + relation sizes after each test step
  • added links to performance reports when prepared statements were used for queries
  • added links to performance reports with a patch to improve version 15b1
  • added a second set of results for 15b1p1 (15b1 with the patch) for cached and IO-bound. Results are similar to the first set.
  • livestreaming my debugging efforts, see Debugging the regression below
The regression

After I shared this blog post a possible cause for the regression was discussed here and the community quickly provided a patch for me to test. The results are excellent as 15b1p1 (15b1 with the patch) almost matches the results for version 13.7 on the cached workload. For the IO-bound workload the patch fixes the regression for all but one of the benchmark steps (l.i1).

The improvement is obvious in the cached workload Summary. The improvement is also there for the IO-bound workload, but less obvious. In the IO-bound workload Summary the throughput for l.i0 (load without secondary indexes) and q100.1 are better than results for version 14.3 and almost match version 13.7. While the throughput for l.i1 (inserts with secondary index maintenance) did not improve. From the IO-bound metrics for l.i1 I see that CPU (cpupq == CPU per statement) has been reducing with each major release, but with the patch (15b1p1) the value increased to match the value for version 13.7. I don't see this effect in the l.i1 metrics with the cached workload.

Summer has arrived and thermal throttling might be an issue for my SSDs which would be an unfortunate source of variance. The next step is to repeat the IO-bound workload at night when it is cooler and then look at CPU profiles for l.i1.

A guide to the performance reports is here. For explaining performance the most interesting metric is cpupq (CPU/query or CPU/insert depending on the benchmark step). From the guide, cpupq includes CPU by the client, DBMS and anything else running on the server. The benchmark client, written in Python, uses a lot of CPU. That can make it harder to see how CPU changes from the DBMS. With MySQL, I can also measure the CPU time used by mysqld. With Postgres I am still figuring out how to provide a DBMS-only measurement of the CPU overheads.

Test setup

Tests are run on Intel NUC servers I have at home. The OS is Ubuntu 20.04, hyperthreading and turbo-boost are disabled. All binaries are compiled using the same gcc/glibc toolchain and all tests used the same server.

The benchmark is repeated for a cached and IO-bound workload. The database fits in memory for the cached workload and is larger than memory for the IO-bound workload. The Postgres configuration files are here for 12.11, 13.7, 14.3 and 15b1.

The Insert Benchmark description has more details. In summary the benchmark has several steps and I focus on the ones where performance changes:

  • l.i0 - load in PK order, the table that has no secondary indexes
  • l.x - create three secondary indexes
  • l.i1 - insert in PK order, the table that has 3 secondary indexes
  • q100.1 - measure range query QPS from one client while another does 100 inserts/s
  • q500.1 - measure range query QPS from one client while another does 500 inserts/s
  • q1000.1 - measure range query QPS from one client while another does 1000 inserts/s
The l.i0 step inserts 20m (500m) rows for the cached (IO-bound) workload. The l.i1 step inserts 20m (10m) rows for the cached (IO-bound) workload. The q100.1, q500.1 and q1000.1 steps each run for 2 hours.

Command lines to try out the insert benchmark:
# old school, no prepared statements
python3 iibench.py --db_user=foo --db_host=127.0.0.1 \
  --db_password=bar --dbms=postgres --db_name=ib --setup \
  --query_threads=1 --max_rows=1000000 --secs_per_report=1

# try out the new support for prepared statements
python3 iibench.py --db_user=foo --db_host=127.0.0.1 \
  --db_password=bar --dbms=postgres --db_name=ib --setup \
  --query_threads=1 --max_rows=1000000 --secs_per_report=1 \
  --use_prepared_insert --use_prepared_query

Cached workload

The reports are here for ibench without and with prepared statements used for queries. Next is a report for prepared statements and a patch for 15b1 (I call it 15b1p1) that fixes a CPU regression. Postgres query throughput improves by ~1.5X when prepared statements are used for queries. Prepared statements are not yet used for inserts because I need to debug CPU perf issues on the ibench client.

I use relative throughput to document performance changes. It is: my_throughput / base_throughput where base_throughput is the result from Postgres 13.7. The interesting changes are below. The results aren't surprising. In most cases the regressions are small because Postgres is good at avoiding them. The regressions for 15b1 are larger because it is a beta and a newer release.

Throughput relative to 13.7
version l.i0    l.i1    q100.1
14.3    0.934   0.971   0.971
15b1    0.856   0.946   0.953

The percentage of samples from pg_plan_queries, pg_parse_query, pg_analyze_and_rewrite and query execution from v14.3. Prepared statements were not used.

step    plan    parse   analyze execute
l.i0    20      11      25      44
l.i1    8       5       10      77
q100.1  33      6       10      51

Cached - l.i0

The largest regressions are from the l.i0 step. While this test doesn't run for a long time, the results are similar for the IO-bound workload. Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. For 15b1 I renamed ExecInitExprInternal to ExecInitExpr to match the names used in older versions.

I also have a benchmark report here for Postgres versions 14.0, 14.1, 14.2 and 14.3. There is little change across them so any regressions from 13.7 to 14.3 are likely in 13.7 to 14.0.

Before looking at the differential flamegraphs I compared the output by hand. For functions called by PostgresMain not much changes between versions in pg_plan_queries and pg_parse_queries. There is a small improvement for pg_analyze_and_rewrite and a small regression for PortalRun. The rest of the drill down is here with a focus on functions called under PortalRun. There are regressions in a few functions when ExecScan is on the call stack.

Percentage of samples for children of PostgresMain
12.11   13.7    14.3    15b1    function
-----   -----   -----   -----   --------
21.50   20.27   20.44   19.84   pg_plan_queries
11.56   12.19   11.34   10.34   pg_parse_queries
25.76   25.70   24.92   22.64   pg_analyze_and_rewrite
33.02   33.06   34.94   37.31   PortalRun

Cached - l.i1

Flamegraphs are here for 12.11, 13.7, 14.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. I didn't do the summary by hand because the differences are small.

Cached - q100.1

Flamegraphs are here for 12.1113.714.3 and 15b1. Differential flamegraphs are here for 13.7 vs 14.3 and 14.3 vs 15b1. I didn't do the summary by hand because the differences are small. ReadyForQuery is much less prominent in 15b1 relative to earlier versions. It isn't clear that the differential graph for 14.3 vs 15b1 is useful. Perhaps too much code has changed.

Table and index sizes

These are the sizes from pg_index_sizes, pg_relation_size and pg_total_relation_size. I suspect that the improvements (size reduction) in recent Postgres versions don't reproduce here because the workload is insert only.

Legend
* all sizes are in MB unless indicated with G (for GB)
* index is value from pg_index_sizes
* rel is value from pg_relation_size
* total is value from pg_total_relation_size

l.i0 - after load in PK order without secondary indexes
version index   rel     total
12.11   428     1531    1959
13.7    428     1531    1959
14.3    428     1531    1959
15b1    428     1531    1959

l.x - after create 3 secondary indexes
version index   rel     total
12.11   2243    1538    3782
13.7    2243    1538    3782
14.3    2243    1538    3782
15b1    2243    1538    3782

l.i1 - after load more data with 3 secondary indexes to maintain
version index   rel     total
12.11   5299    3069    8368
13.7    5308    3069    8378
14.3    5308    3069    8378
15b1    5308    3069    8378

q3.1000 - at the end of the read+write steps
version index   rel     total
12.11   8338    3951    12G
13.7    8333    3950    12G
14.3    8333    3950    12G
15b1    8333    3950    12G

IO-bound workload

The reports are here for ibench without and with prepared statements used for queries. Next is a report for prepared statements and a patch for 15b1 (I call it 15b1p1) that fixes a CPU regression.

Postgres query throughput improves by ~1.5X when prepared statements are used for queries. Prepared statements are not yet used for inserts because I need to debug CPU perf issues on the ibench client. I still need to figure out why more IO isn't done per query for this workload given that it is IO bound and index indexes are much larger than memory.

I don't think the IO-bound report needs additional analysis as the regression for l.i0 is analyzed above. I also have a benchmark report here for Postgres versions 14.0, 14.1, 14.2 and 14.3. There is little change across them so any regressions from 13.7 to 14.3 are likely in 13.7 to 14.0.

Table and index sizes

These are the sizes from pg_index_sizes, pg_relation_size and pg_total_relation_size. I suspect that the improvements (size reduction) in recent Postgres versions don't reproduce here because the workload is insert only.

Legend
* all sizes are in GB
* index is value from pg_index_sizes
* rel is value from pg_relation_size
* total is value from pg_total_relation_size

l.i0 - after load in PK order without secondary indexes
version index   rel     total
12.11   10      37      48
13.7    10      37      48
14.3    10      37      48
15b1    10      37      48

l.x - after create 3 secondary indexes
version index   rel     total
12.11   55      37      92
13.7    55      37      92
14.3    55      37      92
15b1    55      37      92

l.i1 - after load more data with 3 secondary indexes to maintain
version index   rel     total
12.11   55      38      94
13.7    55      38      94
14.3    55      38      94
15b1    55      38      94

q3.1000 - at the end of the read+write steps
version index   rel     total
12.11   57      39      96
13.7    57      39      96
14.3    57      39      96
15b1    57      39      96

Debugging the regression

This section has details from trying to debug the regression while I collaborate with a few experts from the Postgres community. For the regression in the l.i1 benchmark step (load in PK order with 3 secondary indexes) on the IO-bound workload there is:

  • ~7% increase in KB written per insert(wkbpi)
  • ~7% increase in KB read per insert (rkbpq)
  • ~12% increase in CPU per insert (cpupq), but note that CPU here includes everything on the host, including the benchmark client

From pg_stat_bgwriter there are differences between 15b1 and 15b1p1 ...

select * from pg_stat_bgwriter is here for pg15b1p1:

checkpoints_timed     | 32

checkpoints_req       | 6

checkpoint_write_time | 7734090

checkpoint_sync_time  | 6471

buffers_checkpoint    | 12444695

buffers_clean         | 13580345

maxwritten_clean      | 0

buffers_backend       | 7593943

buffers_backend_fsync | 0

buffers_alloc         | 19970817

stats_reset           | 2022-06-23 21:45:38.834884-07

And then for pg15b1:


checkpoints_timed     | 37

checkpoints_req       | 2

checkpoint_write_time | 8283432

checkpoint_sync_time  | 6109

buffers_checkpoint    | 12753126

buffers_clean         | 13702867

maxwritten_clean      | 0

buffers_backend       | 6500038

buffers_backend_fsync | 0

buffers_alloc         | 20002787

stats_reset           | 2022-06-20 16:55:15.35976-07

And finally the graph of throughput vs time for l.i1 shows some kind of stall for 15b1p1 that doesn't occur for 15b1, and this stall is repeatable. It occurs in both runs of the benchmark I did for 15b1p1.

First is the graph for 15b1 which don't have the stall. The x-axis is time, the y-axis is inserts/s and the results are reported by the benchmark client at 1-second intervals.

Next is the graph for 15b1p1 which has a stall.





















Thursday, May 26, 2022

The benefit of lz4 and zstd for Postgres WAL compression

In 2020 I published benchmark results to understand the impact of DBMS tuning options for the insert benchmark with MySQL and Postgres (here and here). One of the results was that wal_compression = pglz used more CPU than expected. I asked aloud whether support for lz4 would help.

Well, the Postgres community has delivered and either lz4 or zstd can be used in place of pglz in the upcoming Postgres version 15. Docs are here. Support for lz4 to compress TOAST was added in version 14.

tl;dr - wal_compression=lz4 or =zstd are a big improvement over =pglz

Details

Postgres optionally writes page images (the pre-image) to the WAL the first time a page is to be written back after a checkpoint. This provides protection from torn-page writes. InnoDB solves this via the doublewrite buffer. An LSM like RocksDB doesn't modify previously written database pages so it doesn't have something similar.

Writing pages to the redo log increases the frequency at which log files must be rotated. A possible side effect of that is to trigger checkpoints more often which means more page writeback IO and this can hurt performance. 

Compressing those page images can reduce the write rate which then reduces the rate at which checkpoint happens (and also reducing the rate at which checkout IO happens). Prior to Postgres version 15 the only option for wal_compression was pglz. With Postgres version 15 there are new options -- lz4 and zstd. My github pages repo has a benchmark report for the Insert Benchmark run on a small server with an IO-bound workload to understand the impact of this option.

Results for IO-bound

This section has results for a benchmark where the database is much larger than RAM. The report is here. For this benchmark I used four configurations:

  • cx5 - wal_compression=off
  • cx5a - wal_compression=pglz
  • cx5b - wall_compression=lz4
  • cx5c - wall_compression=zstd

Summarizing the summary from the (shell script generated) report:

  • Loads in PK order without secondary indexes have similar performance regardless of the wal_compression setting. This isn't a surprise as PK order loads should have the least writeback per insert. The metrics are here.
  • Create secondary index is ~1.7X faster with wal_compression=lz4 compared to =pglz. From the metrics with pglz the CPU overhead is almost 2X larger than with lz4 (see cpupq column). The performance with =lz4 is the same as with =off.
  • Inserts that are in PK order but random order for the 3 secondary indexes are faster with =lz4 or =zstd than with =off or =pglz where =lz4 is ~1.4X faster than =pglz. From the metrics the use of =lz4 or =zstd reduces IO somewhat (see wkbpi column) and CPU a lot (see cpupq column).
Results for cached

This section has results for a benchmark where the database fits in the Postgres buffer pool. The report is here. From the summary the results are similar to the IO-bound results above. The CPU and IO improvements from lz4 and zstd here are also similar to the IO-bound results above -- see the metrics sections for create secondary index (l.x) and load with secondary index maintenance (l.i1)

Defining "X has too many options"

This is another post in my Well, actually series. Sadly, I didn't think of naming that series until now so the previous posts might take a while to identify.

This post is inspired by the claims I read that RocksDB has too many options. I agree that it does, but the claim is true for every database that I have used.

This claim is too vague. I can make it less vague via multiple levels:

  • level 0 - X has too many options and they all matter
  • level 1 - X has too many options and the ones that matter change based the workload. Figuring out which options matter per workload can be expensive.
  • level 2 - X has too many options but only a small subset matters regardless of workload. Figuring out which options matter here can be expensive but is usually a one time cost. I have done some option sensitivity analysis benchmarks for MyRocks and RocksDB.
  • level 3 - X has too many options but none of them matter
Any system in level 0 is going to be a problem. RocksDB is between levels 1 and level 2. For me it is closest to level 2. A few blog posts that cover basic configuration are here and here.

Tuesday, May 10, 2022

RocksDB internals: LRU

RocksDB uses LRU to manage the size of the block cache. The LRU can suffer from mutex contention so I describe the implementation in detail here.

An interesting question is whether clock is a better choice than LRU assuming it suffers less from mutex contention. Until that is answered I hope to determine how much improvement can be made by changing the code to not move a block to the head of the LRU after each reference.

Updates:

  • fix a bogus claim about cache_index_and_filter_blocks

The big picture

The LRU is sharded and the number of shards is 2**num_shard_bits where num_shard_bits is an argument to the LRUCacheOptions constructor. With db_bench you can set the --cache_numshardbits option and the default is 6 (64 shards). The use of the LRUCacheOptions constructor in db_bench is here. The cache is created by a call to NewLRUCache.

When cache_index_and_filter_blocks is true then index and filter blocks are stored in the block cache. Doing this provides better control over the amount of memory used by RocksDB. The cache can do midpoint insertion to guard frequently read blocks from being evicted by long range scans. This is enabled by setting cache_index_and_filter_blocks_with_high_priority=true and possibly setting high_pri_pool_ratio in the LRUCacheOptions constructor. From db_bench and the readrandom benchmark with 32 client threads and uniform key access I get 5% more throughput when pin_l0_index_and_filter_blocks_in_cache is true.

From a comment for NewLRUCache, the total capacity is divided and evenly assigned to each shard. Each shard also gets a mutex, a hash table in which blocks are stored and an LRU list. The mutex is locked when the cache is searched for a block, a block is inserted into the cache or blocks are evicted from the cache. 

The per-shard mutex and per-shard LRU pointers can suffer from contention. On a search the found block is removed from the LRU list and later, when the searcher is done with the block, the block is inserted to the head of the LRU list. Thus, using a block means going through the hot spot twice -- lock/unlock the per shard mutex each time, update the LRU hot pointer when inserting the block at the LRU head.

Finally, the block cache stores data blocks, index blocks and filter blocks. With leveled compaction a point lookup (Get operation) the work done per level of the LSM tree for levels > 0 is below. A typical search accesses the block cache once per level for levels on which the key does not exist and then 3 times for the level on which the key exists.

  1. Search to find the SST for which the searched for key is between the min and max key for the SST
  2. Get the filter block for that SST from the block cache. Stop if the bloom filter check means the key isn't in that SST and go to step 1 for the next level of the LSM tree.
  3. Get the index block for that SST from the block cache. Search the index block to find the data block that might contain the key.
  4. Get the data block for that SST from the block cache. If they key is found there then the search is done else go to step 1 for the next level of the LSM tree.

Implementation overview

These are my notes from reading the code after looking at PMP stack traces that show contention in the 3 places where the block cache is searched - for data, index and filter blocks. Reformatted traces are here.

The code is mostly in 4 classes/structs:

LRUHandle

LRUHandle has great comments that I summarize here:
  • each entry is a variable length heap-allocated structure
  • entries are referenced by cache and/or by any external entity
  • the cache keeps all its entries in a hash table. Some elements are also stored on LRU list.

LRUHandle in one of three states:
  1. Referenced externally AND in hash table
    • the entry is *not* in the LRU list
    • refs >= 1 && in_cache == true
  2. Not referenced externally AND in hash table
    • the entry is in the LRU list and can be freed
    • refs == 0 && in_cache == true
  3. Referenced externally AND not in hash table
    • entry is not in the LRU list and not in hash table
    • entry can be freed when refs becomes 0
    • refs >= 1 && in_cache == false
All newly created LRUHandles are in state 1.
  • If you call LRUCacheShard::Release on entry in state 1, it will go into state 2.
  • To move from state 1 to state 3, either call LRUCacheShard::Erase or LRUCacheShard::Insert with the same key (but possibly different value).
  • To move from state 2 to state 1, use LRUCacheShard::Lookup.
  • Before destruction, make sure that no handles are in state 1. This means that any successful LRUCacheShard::Lookup/LRUCacheShard::Insert have a matching LRUCache::Release (to move into state 2) or LRUCacheShard::Erase (to move into state 3).
Explaining the stack traces

The rest of this post explains what I saw in the stack traces that show contention for LRUCacheShard::Lookup and LRUCacheShard::Release. A block search is done in Lookup and the block is pinned (referenced) if found. The block is unpinned by Release when the search is finished with it. Both require a lock/unlock of the per-shard mutex. And Release can also write to the (contended) head of the LRU list. I also have edited PMP output for those traces.

A cache search starts with ShardedCache::Lookup. That computes hash to determine the shard and then calls LRUCacheShard::Lookup on that shard. When the searcher is done with the block it calls 

  lock LRUCacheShard::mutex_
  if LRUHandleTable::Lookup returns an entry (e)
    if not LRUHandle::HasRefs():
      LRU_Remove(e) // entry was in LRU since it has references
    LRUHandle::Ref (for e)    // does refs++
    LRUHandle::SetHit (for e) // does flags |= ...
  unlock LRUCacheShard::mutex_


  removes an LRUHandle from double linked LRU list
  calls LRUCacheHandle::CalcTotalCharge
  updates memory accounting

  if (metadata_charge_policy == kFullChargeCacheMetadata)
    call malloc_usable_size to update meta_charge

I was curious about the overhead from frequent calls to malloc_usable_size which can be done when the per-shard mutex is locked. Results are here. Preventing these calls doesn't solve the mutex contention problem.

  lock LRUCacheShard::mutex_
  call e->Unref()
  
  if ref count is 0 and e->InCache()
    if cache_full
      LRUHandleTable::Remove(e)
      e->SetInCache(False)
    else
      LRU_Insert(e)
  
  if (no longer referenced):
    call e->CalcTotalCharge
  unlock LRUCacheShard::mutex_
  
  if (no longer referenced):
    // This is done after the mutex has been unlocked
    e->Free()

  // LRUCacheShard::mutex_ held by caller
  call e->CalcTotalCharge()
  if high_pri_pool enabled && (e->IsHighPri() || e->HasHit())
    move e to head of LRU list
    MaintainPoolSize()
  else
    move e to head of low-pri pool

  while high pri pool too big
    move entry from tail of high pri LRU to low-pri pool

  call FindPointer
  remove from double linked list

  walk a list to find the LRUHandle for a key

The number of entries on the list that FindPointer will encounter should be ~1 on average and  LRUHandleTable::Insert calls LRUHandleTable::Resize to make that likely. I am curious whether calls to Resize are a source of stalls because a per-shard mutex is locked when that happens. That won't be a common stall.

Performance bugs: p50 vs p99

One way to classify performance bugs is p50 vs p99. 

  • A p50 performance bug occurs for all workloads
  • A p99 performance bug only happens for some workloads or in some contexts. By some workloads I mean it might only occur during hot backup or load. By some contexts I mean it might only happen under high concurrency or load spikes which don't represent the steady state.
The impact from fixing a p50 perf bug is easy to predict. Everything gets slightly faster and/or more efficient. But p99 perf bugs are outliers -- they don't make everything slower so fixing them doesn't make everything faster.

The impact from fixing a p99 bug is less obvious. On average there might be little impact with respect to average throughput and efficiency. There can be a huge impact with respect to response time goals. The challenge is to quantify the benefit and I have learned over the years that motivating the fix for p99 perf bugs is a hard problem. For MySQL I wrote many blog posts to market open bug reports.

One aspect is that there are easy workarounds for p50 perf bugs. If a bug makes something 10% less efficient then buy 10% more hardware. There might be no workarounds for a p99 perf bug. If something causes a long stall once/day or ruins performance for one infrequent part of your workload then the choices are to either tolerate it (someone suffers) or replace the system.

I spent a few years iterating on p99 bugs in MySQL back in my web-scale MySQL+InnoDB days. It was interesting work and I benefited from production which was a great source of p99 bugs and a colleague named Domas who was great at documenting such bugs.