Friday, April 12, 2019

A research paper on Optane performance

I just read Basic Performance Measurements of the Intel Optane DC Persistent Memory Module published by the NVSL at UCSD. It is worth reading. I appreciate the rigor in testing and the separation of the summary (first 10 pages) from the many details. This is too incomplete to be a review of the paper. It is really a collection of my comments.

Comments:

  • When using the device in cached mode where RAM is the cache the cache block size is 4kb. I assume that a cache miss does 16 256-byte reads from the Optane device before returning to the user.
  • The paper doesn't explain the endurance for the device. The word "endurance" doesn't occur in the paper. I read elsewhere that Optane might provide ~60 DWPD. Update - I assume that endurance isn't mentioned because the vendor has yet to disclose that info.
  • The paper states that the Optane DIMM uses a protocol that supports variable response time but doesn't explain how much it varies. How does response time variance in Optane compare to a NAND-flash SSD where the stalls can be bad?
  • The Optane DIMM does 256 byte reads and writes. I wonder if that prevents 4kb page writes from being atomic when this is used for a filesystem assuming copy-on-write isn't done internally, as it might be for Nova.
  • There is wear-leveling. I am not sure whether that has a name yet. I saw one blog post that called it the XTL to match the FTL used by NAND flash. I am also curious about the latency impact from doing lookups on the XTL to determine locations for 256 byte blocks. The XTL is cached in RAM and a 256g device needs ~4g of RAM assuming each 256 byte block uses 4 bytes in the XTL.
  • Nova does much better than XFS and ext4 on Optane. Nova is a research filesystem from NVSL that exploits new features in Optane.
  • They modified RocksDB to make the memtable persistent and avoid the need for a WAL. It will be interesting to learn whether that turns out to be useful.

Requests I have for the next Optane performance paper:
  • For mixed and concurrent workloads include response time latencies -- average+variance or a histogram. This paper reports the average latency for single-threaded read-only and write-only. For mixed+concurrent workloads this paper reports average throughput which combines read and write performance. It is hard to determine whether reads or writes degrade more from concurrency and a mixed workload. 
  • For any workload include information about response time variation whether that is variance or a histogram
  • Provide numbers to accompany graphs because some of the graphs are hard to understand without numbers when the lines converge in one part and diverge in another because the range for the y-axis is large. Figure 18 is one example. 

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...