Thursday, June 20, 2019

Open Core Summit and OSS glue code

I am optimistic about the Open Core Summit. It can be something that benefits users, startups, the source-available community and the OSS crowd. The summit has many of the important people from the open core community. It is an opportunity for them to collaborate -- form a foundation, reduce open core license proliferation, discuss the next license to be reviewed by OSI and most importantly -- fix the open core marketing approach.

I appreciate that startups put so much time and money into building interesting system software that is shared either as OSS or source-available. I haven't enjoyed much of the marketing about the challenges that cloud creates for VC-funded OSS. I am sure cloud makes it harder but the approach has been too negative with too much snark directed towards OSI and AWS. There is a better way.

Glue code

It is all about the glue code.

Stateful services are a hard problem. It is not trivial to scale MySQL by enabling a small DBA team to support a growing database deployment. I am sure the same is true for other DBMS. This is done with some changes to MySQL and a huge amount of glue code. While the GPL doesn't require sharing of the diffs to MySQL my teams have always been happy to do that. But the glue code is the secret sauce and neither the GPL nor the AGPL require it to be shared. The SSPL would change that, although I am wary of the SSPL given uncertainty about how far down the stack the sharing requirement extends.

While the glue code is the secret sauce I wonder whether it has any value outside of a web-scale company.
  1. The glue code isn't portable as it depends on other systems internal to a web-scale company.
  2. Documentation for internal systems is frequently not good and success depends on the shared knowledge of the current teams. Even with access to the internal dependencies the glue code is unusable without the team that knows how to use it.
Therefore I am more interested in OSS glue code that is useful to a community and less interested in licenses that force the (unusable to me) glue code to be published. The glue code should assume environments open to the public -- public clouds and k8s.

What is the state of OSS glue code for MySQL? MySQL needs it to remain competitive. Much of the glue code is baked into MongoDB so that it is easier to scale MongoDB clusters even if you aren't an Atlas customer.

Thursday, June 13, 2019

Interesting indexing in Rockset and MongoDB

I was lazy today and asked about new indexing features in Rockset and MongoDB. They share a valuable goal which is better indexing for the document data model (think less schema, not schema-less). How do you index documents when you don't know all of the attributes that will be used? MongoDB now supports this via a wildcard index and Rockset via converged indexing.

Wildcard indexing in MongoDB lets you specify that an index should be maintained on all, most or some attributes in a document. By most I mean there are options to exclude attributes from a wildcard index. By some I mean there are options to limit this to attributes that start with certain prefixes. Read the docs for more.

Converged indexing in Rockset indexes all attributes. There are no options to include or exclude attributes. This makes the product easier to use at the cost of more IO for index maintenance and more storage for larger indexes. Note that Rockset uses the RocksDB LSM which reduces the cost of index maintenance and might also use the excellent ZStandard compression.

Wildcard and converged indexes do not support compound indexes. For the document { a:1, b:2 } there will be two index entries: a=1 and b=2. There is no way to get an index entry for (a=1, b=2) or (b=2, a=1). If you want a compound index with MongoDB the existing index features can be used. See below (editorial 1) for compound indexes and Rockset.

Implementation details

This section is an educated guess. I don't know enough MongoDB and Rockset internals to claim this with certainty. I ignore the complexity of support for datatypes. In the ideal world all values can be compared via memcmp.

For a traditional index limited to a specified attribute the index entries are of the form (value, pointer) where pointer points to the row and can be the primary key value or (file name, file offset).

This is more interesting for wildcard/converged indexes. I assume that the attribute name is the leading field in each index entry so that the entry is of the form (attribute name, value, pointer). The common way to use such an index is to have an equality predicate on attribute name which is satisfied when the index is queried with predicates like attributeName relOp value. Examples of such predicates are a=2, a>2 and a<=2.

A smart person (Dr Pavlo) mentioned the use of skip scan for these indexes. That could be used to query the index and find documents with any attribute equal to a specific value. That is a less likely use case but still interesting.

Wildcard/converged indexes aren't free. Putting the attribute name in every index entry makes index entries larger and consume more space in memory and on storage. Block compression reduces some of this overhead. Index prefix compression in WiredTiger and RocksDB also helps but at the cost of more CPU overhead.

Storage differences

Up to now I have been describing the search index. In this section I will describe the document storage.

MongoDB stores the document via the storage engine which will soon be WiredTiger only although I hope MongoRocks returns. I assume that WiredTiger with MongoDB is row-wise so that each document is (usually) a contiguous sequence of bytes on some disk page.

Rockset stores each document twice -- row-wise and column-wise. Alas, this gets complicated. The row-wise format is not the traditional approach with one thing in the storage engine per document. Instead there is one thing per attribute per document. This is similar to the CockroachDB approach. I prefer to still call this row-wise given that attributes from a document will be co-located in the LSM SSTs. I am also grateful for the many great blog posts from CockroachDB that I can reference.

With two copies of each document in the base storage there is more storage overhead. Fortunately that overhead is reduced courtesy of the write efficiency and compression friendliness of an LSM.

The Rockset blog post does a great job of explaining this with pictures. I do a worse job here without pictures. For the document { pk:1, a:7, b:3 } when the primary key is pk then the keys for row-wise are R.1.a and R.1.b and for column-wise are C.a.1 and C.b.1. The row-wise format clusters all attributes for a given document. The column-wise format clusters all values across documents for a given attribute. The row-wise format is efficient when most attributes for a document must be retrieved. The column-wise format is efficient for analytics when a given attribute across all documents must be retrieved.

Editorial 1

I interpret the MongoDB docs to mean that when a query uses a wildcard index it cannot use any other index and the wildcard index will only be used for a predicate on a single attribute. I expect that to limit the utility of wildcard indexes. I also expect MongoDB to fix that given how fast they reduce their tech debt. The limitations are listed below. The 1st won't be fixed. The 2nd and 3rd can be fixed.
  1. Compound wildcard indexes are not supported
  2. MongoDB cannot use a non-wildcard index to satisfy one part of a query predicate and a wildcard index to satisfy another.
  3. MongoDB cannot use one wildcard index to satisfy one part of a query predicate and another wildcard index to satisfy another.
I assume that Rockset can combine indexes during query evaluation given their focus on analytics. Thanks to the Rockset team I learned it supports index intersection. It also supports composite indexes via field mappings (functional indexes).

Editorial 2

An open question is whether an LSM can do clever things to support analytics. There has been some work to show the compression benefit from using column-wise storage within a RocksDB SST for the larger levels of the RocksDB LSM. Alas, the key RocksDB workloads have been OLTP. With the arrival of Rockset there is more reason to reconsider this work. There can be benefits in the compression ratio and reduced overhead during query processing. Vertica showed that it was useful to combine a write-optimized store for recent writes with a read-optimized store for older writes. An LSM already structures levels by write recency. Perhaps it is time to make the larger levels read-optimized especially when column-wise data is inserted to the LSM.

Update - read paper years ago then forgot that Kudu combines LSM + columnar.

Editorial 3

The previous section is mostly about being clever when storing column-wise data in an LSM to get better compression and use less CPU during query evaluation. This section is about being clever when storing the search index. 

The search index is likely to have many entries for some values a given attribute. Can an LSM be enhanced to take advantage of that for analytics workloads? In other storage engines there are two approaches -- bitmap indexes and RID-lists. Adapting these for an LSM is non-trivial but not impossible. It is likely that such an adaptation would only be done for the larger levels of the LSM tree.

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...