Monday, June 8, 2015

RocksDB & ForestDB via the ForestDB benchmark: IO-bound and disks

This has performance results for RocksDB and ForestDB using the ForestDB benchmark. The focus for this test is an IO-bound workload with a disk array. The database is about 3X larger than RAM. The server has 24 hyperthread cores, 144G of RAM and 6 disks (10k RPM SAS) using HW RAID 0. Background reading is in a previous post.

While RocksDB does much better in the results here I worked on this to understand differences in performance rather than to claim that RocksDB is superior. Hopefully the results here will help make ForestDB better.

Test setup

The test pattern was described in the previous post. Here I use shorter names for each of the tests:
  • load - Load
  • ows.1 - Overwrite-sync-1
  • ows.n - Overwrite-sync-N
  • pqw.1 - Point-query-1-with-writer
  • pqw.n - Point-query-N-with-writer
  • rqw.1 - Range-query-1-with-writer
  • rqw.n - Range-query-N-with-writer
  • pq.1 - Point-query-1
  • pq.n - Point-query-N
  • rq.1 - Range-query-1
  • rq.n - Range-query-N
  • owa.1 - Overwrite-async-1
  • owa.n - Overwrite-async-N
I used these command lines with my fork of the ForestDB benchmark:
bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 rocksdb 20 no 1
bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 fdb 20 no 64

The common options include:
  • load 2B documents
  • use 32G for the database cache. The server has 144G of RAM.
  • use N=10 for the tests with concurrency
  • use a 600 second warmup and then run for 3600 seconds
  • limit the writer thread to 1000/second for the with-writer tests
  • range queries fetch ~20 documents
  • do not use periodic_commit for the load
The RocksDB specific options include:
  • use a 64M write buffer for all tests
  • use one LSM tree
The ForestDB specific options include:
  • use 64 database files to reduce the max file size. This was done to give compaction a better chance of keeping up and to avoid temporarily doubling the size of the database during compaction.
Test results
The first result is the average throughput during the test as the operations/second rate. I have written previously about benchmarketing vs benchmarking and average throughput leaves out the interesting bits like response time variance. Alas, my time to write this is limited too.

ForestDB is slightly faster for the load. Even with rate limiting RocksDB incurs too much IO debt during this load. I don't show it here but the compaction scores for levels 0, 1 and 2 in the LSM were higher than expected given the rate limits I used. We have work-in-progress to fix that.

For the write-only tests (ows.1, ows.n, owa.1, owa.n) RocksDB is much faster than ForestDB. From the rates below it looks like ForestDB might be doing a disk read per write because I can get ~200 disk reads / second from 1 thread. I collected stack traces from other tests that showed disk reads in the commit code path so I think that is the problem here. I will share the stack traces in a future post.

RocksDB does much better on the range query tests (rqw.1, rqw.n, rq.1, rq.n). With ForestDB data for adjacent keys is unlikely to be adjacent in the database file unless it was loaded in that order and not updated after the load. So range queries might do 1 disk seek per document. With RocksDB we can assume that all data was in cache except for the max level of the LSM. And for the max level data for adjacent keys is adjacent in the file. So RocksDB is unlikely to do more than 1 disk seek per short range scan.

I don't have a good explanation for the ~2X different in point query QPS (pqw.1, pqw.n, pq.1, pq.n). The database is smaller with RocksDB, but not small enough to explain this. For pq.1, the single-threaded point-query test, both RocksDB and ForestDB were doing ~184 disk reads/second with similar latency of ~5ms/read. So ForestDB was doing almost 2X more disk reads / query. I don't understand ForestDB file structures well enough to explain that.

It is important to distinguish between logical and physical IO when trying to explain RocksDB IO performance. Logical IO means that a file read is done but the data is in the RocksDB block cache or OS cache. Physical IO means that a file read is one and the data is not in cache. For this configuration all levels before the max level of the LSM are in cache for RocksDB and some of the max level is in cache as the max level has 90% of the data.

For the tests that used 1 writer thread limited to 1000 writes/second RocksDB was able to sustain that rate. For ForestDB the writer thread only did ~200 writes/second.

operations/second for each step
        RocksDB  ForestDB
load      58137     69579
ows.1      4251       289
ows.n     11836       295
pqw.1       232       123
pqw.n      1228       654
rqw.1      3274        48
rqw.n     17770       377
pq.1        223       120
pq.n       1244       678
rq.1       2685       206
rq.n      16232       983
owa.1     56846       149
owa.n     49078       224

I looked at write-amplification for the ows.1 test. I measured the average rates for throughput and write-KB/second from iostat and divide the IO rate by the throughput as write-KB/update. The IO write-rate per update is about 2X higher with RocksDB.

           throughput  write-KB/s  write-KB/update
RocksDB        4252       189218      44.5
ForestDB        289         6099      21.1

The next result is the size of the database at the end of each test step. Both were stable for most tests but RocksDB had trouble with the owa.1 and owa.n tests. These tests used threshold=50 for ForestDB which allows for up to 2X space amplification per database file. There were 64 database files. But we don't see 2X growth in this configuration.

Size in GB after each step
        RocksDB  ForestDB
load        498       776
ows.1       492       768
ows.n       500       810
pqw.1       501       832
pqw.n       502       832
rqw.1       502       832
rqw.n       503       832
pq.1        503       832
pq.n        503       832
rq.1        503       832
rq.n        503       832
owa.1       529       832
owa.n       560       832


No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...