The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.
The patent is for a new in-memory sort algorithm that needs a name. Features include:
- common prefix skipping
- skips comparing the prefix of of key bytes when possible
- adaptive
- switches between quicksort and most-significant digit radix sort
- key substring caching
- reduces CPU cache misses by caching the next few bytes of the key
- produces results before sort is done
- sorted output can be produced (to the rest of the query, or spilled to disk for an external sort) before the sort is finished.
- the sort algorithm needs a name and common prefix skipping adaptive quicksort is much too long. So I suggest Orasort.
How it came to be
From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.
While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.
So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:
- were the algorithm able to remember the length of the common prefix as it descends then it can skip the common prefix during comparisons to save on CPU overhead
- were the algorithm able to learn when the length of the common prefix grows then it can switch from quicksort to most-significant digit (MSD) radix sort using the next byte beyond the common prefix and then switch back to quicksort after doing that
- the algorithm can cache bytes from the key in an array, like Alphasort. But unlike Alphasort as it descends it can cache the next few bytes it will need to compare rather than only caching the first few bytes of the key. This provides much better memory system behavior (fewer cache misses).
- my new sort was much faster than other algorithms when keys were larger than 8 bytes
- my new sort was faster on my old Pentium 3 CPU than on the Sun UltraSPARC IV
Real implementation
- the old sort was stable, the new sort was not
- I don't remember how this concern was addressed
- the new sort has a bad, but unlikely, worst-case
- The problem here is the worst-case when quicksort picks the worst pivot every time it selects a pivot. The new sort wasn't naive, it used the median from a sample of keys each time to select a pivot (the sample size might have been 5). So I did the math to estimate the risk. Given that the numbers are big and probabilities are small I needed a library or tool that supported arbitrary-precision arithmetic and ended up using a Scheme implementation. The speedup in most cases justified the risk in a few cases.
After leaving Oracle, much of my time was spent on making MySQL better. Great open-source DBMS, like MySQL and PostgreSQL, were not good for Oracle's new license revenue. Oracle is a better DBMS, but not everyone needs it or can afford it.
> A promotion or bonus would have to wait as you had to play the long-game in your career at Oracle
ReplyDeleteWhat's that long game is then?
Ignoring the good times during the dot-com boom, rewards (mostly promotions) tended to be delayed relative to what I experienced elsewhere. So if you do good work and stay there long enough then you will be promoted, but your peers at other companies were moving up faster.
DeleteThe parallel sort problem is still interesting, one of my most recent changes to RonDB was to implement a parallel sorted index scan. RonDB uses sharded partitions (from an ordered index point of view). So each partition delivers rows in sorted order, but a merge sort is required to provide the results in global sorted order. Previously this led to a "single-threaded" sorted index scan, but now it is parallelised by implementing a capability for partitions to continue scanning while waiting for the "next" message. However the merge sort itself can probably be improved and made multi-threaded. I will read your blog to see if there are some interesting tidbits that can improve the speed even more.
ReplyDeleteYes, parallel is interesting. Given the Oracle architecture for parallel operations, I didn't have to solve for that.
DeleteFor fast merge:
1) there are optimizations for an N-way heap and RocksDB uses some of them, Google AI had a nice overview for me via https://www.google.com/search?q=rocksdb+n-way+heap+merge
2) offset-value coding can help, but wasn't widely known. Thankfully, Goetz Graefe is making it known again and if you check the references to IBM at the end of the paper there are things that make merges faster, especially reference 7 -- https://arxiv.org/abs/2210.00034
Thx, will have a look
Delete