Small Datum: RocksDB & ForestDB via the ForestDB benchmark: IO-bound and disks

This has performance results for RocksDB and ForestDB using the ForestDB benchmark. The focus for this test is an IO-bound workload with a disk array. The database is about 3X larger than RAM. The server has 24 hyperthread cores, 144G of RAM and 6 disks (10k RPM SAS) using HW RAID 0. Background reading is in a previous post.

While RocksDB does much better in the results here I worked on this to understand differences in performance rather than to claim that RocksDB is superior. Hopefully the results here will help make ForestDB better.

Test setup

The test pattern was described in the previous post. Here I use shorter names for each of the tests:

load - Load
ows.1 - Overwrite-sync-1
ows.n - Overwrite-sync-N
pqw.1 - Point-query-1-with-writer
pqw.n - Point-query-N-with-writer
rqw.1 - Range-query-1-with-writer
rqw.n - Range-query-N-with-writer
pq.1 - Point-query-1
pq.n - Point-query-N
rq.1 - Range-query-1
rq.n - Range-query-N
owa.1 - Overwrite-async-1
owa.n - Overwrite-async-N

I used these command lines with my fork of the ForestDB benchmark:

bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 rocksdb 20 no 1

bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 fdb 20 no 64

The common options include:

load 2B documents
use 32G for the database cache. The server has 144G of RAM.
use N=10 for the tests with concurrency
use a 600 second warmup and then run for 3600 seconds
limit the writer thread to 1000/second for the with-writer tests
range queries fetch ~20 documents
do not use periodic_commit for the load

The RocksDB specific options include:

use a 64M write buffer for all tests
use one LSM tree

The ForestDB specific options include:

use 64 database files to reduce the max file size. This was done to give compaction a better chance of keeping up and to avoid temporarily doubling the size of the database during compaction.

Test results

The first result is the average throughput during the test as the operations/second rate. I have written previously about benchmarketing vs benchmarking and average throughput leaves out the interesting bits like response time variance. Alas, my time to write this is limited too.

ForestDB is slightly faster for the load. Even with rate limiting RocksDB incurs too much IO debt during this load. I don't show it here but the compaction scores for levels 0, 1 and 2 in the LSM were higher than expected given the rate limits I used. We have work-in-progress to fix that.

For the write-only tests (ows.1, ows.n, owa.1, owa.n) RocksDB is much faster than ForestDB. From the rates below it looks like ForestDB might be doing a disk read per write because I can get ~200 disk reads / second from 1 thread. I collected stack traces from other tests that showed disk reads in the commit code path so I think that is the problem here. I will share the stack traces in a future post.

RocksDB does much better on the range query tests (rqw.1, rqw.n, rq.1, rq.n). With ForestDB data for adjacent keys is unlikely to be adjacent in the database file unless it was loaded in that order and not updated after the load. So range queries might do 1 disk seek per document. With RocksDB we can assume that all data was in cache except for the max level of the LSM. And for the max level data for adjacent keys is adjacent in the file. So RocksDB is unlikely to do more than 1 disk seek per short range scan.

I don't have a good explanation for the ~2X different in point query QPS (pqw.1, pqw.n, pq.1, pq.n). The database is smaller with RocksDB, but not small enough to explain this. For pq.1, the single-threaded point-query test, both RocksDB and ForestDB were doing ~184 disk reads/second with similar latency of ~5ms/read. So ForestDB was doing almost 2X more disk reads / query. I don't understand ForestDB file structures well enough to explain that.

It is important to distinguish between logical and physical IO when trying to explain RocksDB IO performance. Logical IO means that a file read is done but the data is in the RocksDB block cache or OS cache. Physical IO means that a file read is one and the data is not in cache. For this configuration all levels before the max level of the LSM are in cache for RocksDB and some of the max level is in cache as the max level has 90% of the data.

For the tests that used 1 writer thread limited to 1000 writes/second RocksDB was able to sustain that rate. For ForestDB the writer thread only did ~200 writes/second.

operations/second for each step

RocksDB ForestDB

load 58137 69579

ows.1 4251 289

ows.n 11836 295

pqw.1 232 123

pqw.n 1228 654

rqw.1 3274 48

rqw.n 17770 377

pq.1 223 120

pq.n 1244 678

rq.1 2685 206

rq.n 16232 983

owa.1 56846 149

owa.n 49078 224

I looked at write-amplification for the ows.1 test. I measured the average rates for throughput and write-KB/second from iostat and divide the IO rate by the throughput as write-KB/update. The IO write-rate per update is about 2X higher with RocksDB.

throughput write-KB/s write-KB/update

RocksDB 4252 189218 44.5

ForestDB 289 6099 21.1

The next result is the size of the database at the end of each test step. Both were stable for most tests but RocksDB had trouble with the owa.1 and owa.n tests. These tests used threshold=50 for ForestDB which allows for up to 2X space amplification per database file. There were 64 database files. But we don't see 2X growth in this configuration.

Size in GB after each step

RocksDB ForestDB

load 498 776

ows.1 492 768

ows.n 500 810

pqw.1 501 832

pqw.n 502 832

rqw.1 502 832

rqw.n 503 832

pq.1 503 832

pq.n 503 832

rq.1 503 832

rq.n 503 832

owa.1 529 832

owa.n 560 832

Small Datum

Monday, June 8, 2015

RocksDB & ForestDB via the ForestDB benchmark: IO-bound and disks

No comments:

Post a Comment

Postgres 18 beta2: large server, Insert Benchmark, part 2