While RocksDB does much better in the results here I worked on this to understand differences in performance rather than to claim that RocksDB is superior. Hopefully the results here will help make ForestDB better.
Test setup
The test pattern was described in the previous post. Here I use shorter names for each of the tests:
- load - Load
- ows.1 - Overwrite-sync-1
- ows.n - Overwrite-sync-N
- pqw.1 - Point-query-1-with-writer
- pqw.n - Point-query-N-with-writer
- rqw.1 - Range-query-1-with-writer
- rqw.n - Range-query-N-with-writer
- pq.1 - Point-query-1
- pq.n - Point-query-N
- rq.1 - Range-query-1
- rq.n - Range-query-N
- owa.1 - Overwrite-async-1
- owa.n - Overwrite-async-N
I used these command lines with my fork of the ForestDB benchmark:
The common options include:
I looked at write-amplification for the ows.1 test. I measured the average rates for throughput and write-KB/second from iostat and divide the IO rate by the throughput as write-KB/update. The IO write-rate per update is about 2X higher with RocksDB.
bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 rocksdb 20 no 1
bash rall.sh 2000000000 log data 32768 64 10 600 3600 1000 1 fdb 20 no 64
The common options include:
- load 2B documents
- use 32G for the database cache. The server has 144G of RAM.
- use N=10 for the tests with concurrency
- use a 600 second warmup and then run for 3600 seconds
- limit the writer thread to 1000/second for the with-writer tests
- range queries fetch ~20 documents
- do not use periodic_commit for the load
- use a 64M write buffer for all tests
- use one LSM tree
The ForestDB specific options include:
- use 64 database files to reduce the max file size. This was done to give compaction a better chance of keeping up and to avoid temporarily doubling the size of the database during compaction.
The first result is the average throughput during the test as the operations/second rate. I have written previously about benchmarketing vs benchmarking and average throughput leaves out the interesting bits like response time variance. Alas, my time to write this is limited too.
ForestDB is slightly faster for the load. Even with rate limiting RocksDB incurs too much IO debt during this load. I don't show it here but the compaction scores for levels 0, 1 and 2 in the LSM were higher than expected given the rate limits I used. We have work-in-progress to fix that.
For the write-only tests (ows.1, ows.n, owa.1, owa.n) RocksDB is much faster than ForestDB. From the rates below it looks like ForestDB might be doing a disk read per write because I can get ~200 disk reads / second from 1 thread. I collected stack traces from other tests that showed disk reads in the commit code path so I think that is the problem here. I will share the stack traces in a future post.
For the write-only tests (ows.1, ows.n, owa.1, owa.n) RocksDB is much faster than ForestDB. From the rates below it looks like ForestDB might be doing a disk read per write because I can get ~200 disk reads / second from 1 thread. I collected stack traces from other tests that showed disk reads in the commit code path so I think that is the problem here. I will share the stack traces in a future post.
RocksDB does much better on the range query tests (rqw.1, rqw.n, rq.1, rq.n). With ForestDB data for adjacent keys is unlikely to be adjacent in the database file unless it was loaded in that order and not updated after the load. So range queries might do 1 disk seek per document. With RocksDB we can assume that all data was in cache except for the max level of the LSM. And for the max level data for adjacent keys is adjacent in the file. So RocksDB is unlikely to do more than 1 disk seek per short range scan.
I don't have a good explanation for the ~2X different in point query QPS (pqw.1, pqw.n, pq.1, pq.n). The database is smaller with RocksDB, but not small enough to explain this. For pq.1, the single-threaded point-query test, both RocksDB and ForestDB were doing ~184 disk reads/second with similar latency of ~5ms/read. So ForestDB was doing almost 2X more disk reads / query. I don't understand ForestDB file structures well enough to explain that.
It is important to distinguish between logical and physical IO when trying to explain RocksDB IO performance. Logical IO means that a file read is done but the data is in the RocksDB block cache or OS cache. Physical IO means that a file read is one and the data is not in cache. For this configuration all levels before the max level of the LSM are in cache for RocksDB and some of the max level is in cache as the max level has 90% of the data.
For the tests that used 1 writer thread limited to 1000 writes/second RocksDB was able to sustain that rate. For ForestDB the writer thread only did ~200 writes/second.
I don't have a good explanation for the ~2X different in point query QPS (pqw.1, pqw.n, pq.1, pq.n). The database is smaller with RocksDB, but not small enough to explain this. For pq.1, the single-threaded point-query test, both RocksDB and ForestDB were doing ~184 disk reads/second with similar latency of ~5ms/read. So ForestDB was doing almost 2X more disk reads / query. I don't understand ForestDB file structures well enough to explain that.
It is important to distinguish between logical and physical IO when trying to explain RocksDB IO performance. Logical IO means that a file read is done but the data is in the RocksDB block cache or OS cache. Physical IO means that a file read is one and the data is not in cache. For this configuration all levels before the max level of the LSM are in cache for RocksDB and some of the max level is in cache as the max level has 90% of the data.
For the tests that used 1 writer thread limited to 1000 writes/second RocksDB was able to sustain that rate. For ForestDB the writer thread only did ~200 writes/second.
operations/second for each step
RocksDB ForestDB
load 58137 69579
ows.1 4251 289
ows.n 11836 295
pqw.1 232 123
pqw.n 1228 654
rqw.1 3274 48
rqw.n 17770 377
pq.1 223 120
pq.n 1244 678
rq.1 2685 206
rq.n 16232 983
owa.1 56846 149
owa.n 49078 224
I looked at write-amplification for the ows.1 test. I measured the average rates for throughput and write-KB/second from iostat and divide the IO rate by the throughput as write-KB/update. The IO write-rate per update is about 2X higher with RocksDB.
throughput write-KB/s write-KB/update
RocksDB 4252 189218 44.5
ForestDB 289 6099 21.1
The next result is the size of the database at the end of each test step. Both were stable for most tests but RocksDB had trouble with the owa.1 and owa.n tests. These tests used threshold=50 for ForestDB which allows for up to 2X space amplification per database file. There were 64 database files. But we don't see 2X growth in this configuration.
Size in GB after each step
RocksDB ForestDB
load 498 776
ows.1 492 768
ows.n 500 810
pqw.1 501 832
pqw.n 502 832
rqw.1 502 832
rqw.n 503 832
pq.1 503 832
pq.n 503 832
rq.1 503 832
rq.n 503 832
owa.1 529 832
owa.n 560 832
No comments:
Post a Comment