Thursday, June 13, 2019

Interesting indexing in Rockset and MongoDB

I was lazy today and asked about new indexing features in Rockset and MongoDB. They share a valuable goal which is better indexing for the document data model (think less schema, not schema-less). How do you index documents when you don't know all of the attributes that will be used? MongoDB now supports this via a wildcard index and Rockset via converged indexing.

Wildcard indexing in MongoDB lets you specify that an index should be maintained on all, most or some attributes in a document. By most I mean there are options to exclude attributes from a wildcard index. By some I mean there are options to limit this to attributes that start with certain prefixes. Read the docs for more.

Converged indexing in Rockset indexes all attributes. There are no options to include or exclude attributes. This makes the product easier to use at the cost of more IO for index maintenance and more storage for larger indexes. Note that Rockset uses the RocksDB LSM which reduces the cost of index maintenance and might also use the excellent ZStandard compression.

Wildcard and converged indexes do not support compound indexes. For the document { a:1, b:2 } there will be two index entries: a=1 and b=2. There is no way to get an index entry for (a=1, b=2) or (b=2, a=1). If you want a compound index with MongoDB the existing index features can be used. See below (editorial 1) for compound indexes and Rockset.

Implementation details

This section is an educated guess. I don't know enough MongoDB and Rockset internals to claim this with certainty. I ignore the complexity of support for datatypes. In the ideal world all values can be compared via memcmp.

For a traditional index limited to a specified attribute the index entries are of the form (value, pointer) where pointer points to the row and can be the primary key value or (file name, file offset).

This is more interesting for wildcard/converged indexes. I assume that the attribute name is the leading field in each index entry so that the entry is of the form (attribute name, value, pointer). The common way to use such an index is to have an equality predicate on attribute name which is satisfied when the index is queried with predicates like attributeName relOp value. Examples of such predicates are a=2, a>2 and a<=2.

A smart person (Dr Pavlo) mentioned the use of skip scan for these indexes. That could be used to query the index and find documents with any attribute equal to a specific value. That is a less likely use case but still interesting.

Wildcard/converged indexes aren't free. Putting the attribute name in every index entry makes index entries larger and consume more space in memory and on storage. Block compression reduces some of this overhead. Index prefix compression in WiredTiger and RocksDB also helps but at the cost of more CPU overhead.

Storage differences

Up to now I have been describing the search index. In this section I will describe the document storage.

MongoDB stores the document via the storage engine which will soon be WiredTiger only although I hope MongoRocks returns. I assume that WiredTiger with MongoDB is row-wise so that each document is (usually) a contiguous sequence of bytes on some disk page.

Rockset stores each document twice -- row-wise and column-wise. Alas, this gets complicated. The row-wise format is not the traditional approach with one thing in the storage engine per document. Instead there is one thing per attribute per document. This is similar to the CockroachDB approach. I prefer to still call this row-wise given that attributes from a document will be co-located in the LSM SSTs. I am also grateful for the many great blog posts from CockroachDB that I can reference.

With two copies of each document in the base storage there is more storage overhead. Fortunately that overhead is reduced courtesy of the write efficiency and compression friendliness of an LSM.

The Rockset blog post does a great job of explaining this with pictures. I do a worse job here without pictures. For the document { pk:1, a:7, b:3 } when the primary key is pk then the keys for row-wise are R.1.a and R.1.b and for column-wise are C.a.1 and C.b.1. The row-wise format clusters all attributes for a given document. The column-wise format clusters all values across documents for a given attribute. The row-wise format is efficient when most attributes for a document must be retrieved. The column-wise format is efficient for analytics when a given attribute across all documents must be retrieved.

Editorial 1

I interpret the MongoDB docs to mean that when a query uses a wildcard index it cannot use any other index and the wildcard index will only be used for a predicate on a single attribute. I expect that to limit the utility of wildcard indexes. I also expect MongoDB to fix that given how fast they reduce their tech debt. The limitations are listed below. The 1st won't be fixed. The 2nd and 3rd can be fixed.
  1. Compound wildcard indexes are not supported
  2. MongoDB cannot use a non-wildcard index to satisfy one part of a query predicate and a wildcard index to satisfy another.
  3. MongoDB cannot use one wildcard index to satisfy one part of a query predicate and another wildcard index to satisfy another.
I assume that Rockset can combine indexes during query evaluation given their focus on analytics. Thanks to the Rockset team I learned it supports index intersection. It also supports composite indexes via field mappings (functional indexes).

Editorial 2

An open question is whether an LSM can do clever things to support analytics. There has been some work to show the compression benefit from using column-wise storage within a RocksDB SST for the larger levels of the RocksDB LSM. Alas, the key RocksDB workloads have been OLTP. With the arrival of Rockset there is more reason to reconsider this work. There can be benefits in the compression ratio and reduced overhead during query processing. Vertica showed that it was useful to combine a write-optimized store for recent writes with a read-optimized store for older writes. An LSM already structures levels by write recency. Perhaps it is time to make the larger levels read-optimized especially when column-wise data is inserted to the LSM.

Update - read paper years ago then forgot that Kudu combines LSM + columnar.

Editorial 3

The previous section is mostly about being clever when storing column-wise data in an LSM to get better compression and use less CPU during query evaluation. This section is about being clever when storing the search index. 

The search index is likely to have many entries for some values a given attribute. Can an LSM be enhanced to take advantage of that for analytics workloads? In other storage engines there are two approaches -- bitmap indexes and RID-lists. Adapting these for an LSM is non-trivial but not impossible. It is likely that such an adaptation would only be done for the larger levels of the LSM tree.

2 comments:

  1. If resource is not a problem, I would just replicate my data to X different databases (row based, column based, inverted index based, etc). When reading the data, I can send the query to all the databases, and wait for the fastest response...

    ReplyDelete
    Replies
    1. Rockset does that within the same instance and stores in row-wise and column-wise format in addition to the search index. See above. I think that is an interesting idea.

      For workloads I cared about that solution used too much storage, but not everyone has my constraints.

      Delete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...