Small Datum: mongodb

Showing posts with label mongodb. Show all posts

Wednesday, February 19, 2025

My database communities

I have been working on databases since 1996. In some cases I just worked on the product (Oracle & Informix), in others I consider myself a member of the community (MySQL, Postgres & RocksDB). And for MongoDB I used to be in the community.

I worked on Informix XPS in 1996. I chose Informix because I could live in Portland OR and walk to work. I was fresh out of school, didn't know much about DBMS, but got a great starter project (star query optimization). The company wasn't in great shape so I left by 1997 for Oracle. I never used Informix in production and didn't consider myself as part of the Informix community.

I was at Oracle from 1997 to 2005. The first 3 years were in Portland implementing JMS for the app server team and the last 5 years at Oracle HQ working on query execution. I fixed many bugs, added support for ieee754 types, rewrote sort and maintained the sort and bitmap index row sources. The people there were great and I learned a lot but I did not enjoy the code base and left for a startup. I never used Oracle in production and don't consider myself as part of the Oracle community.

I lead the MySQL engineering teams at Google for 4 years and at Facebook/Meta for 10 years. I was very much immersed in production and have been active in the community since 2006. The MySQL teams got much done at both Google (GTID, semi-sync, crash-safe replication, rewrote the InnoDB rw lock) and Facebook/Meta (MyRocks and too many other things to mention). Over the years at FB/Meta my job duties got in the way of programming so I used performance testing as a way to remain current. I also filed many bugs might still be in the top-10 for bug reports. While Oracle has been a great steward for the MySQL project I have been critical about the performance regressions from older MySQL to newer MySQL. I hope that eventually stops because it will become a big problem.

I contributed some code to RocksDB, mostly for monitoring. I spent much more time doing performance QA for it, and filing a few bugs. I am definitely in the community.

I don't use Postgres in production but have spent much time doing performance QA for it over the past ~10 years. A small part of that was done while at Meta, I had a business case, and was able to use some of their HW and my time. But most of this has been a volunteer effort -- more than 100 hours of my time and 10,000+ hours of server time. Some of those server hours are in public clouds (Google, Hetzner) so I am also spending a bit on this. I found a few performance bugs. I have not found large performance regressions over time which is impressive. I have met many of the contributors working on the bits I care about, and that has been a nice benefit.

I used to be a member of the MongoDB community. Like Postgres, I never supported it in production but I spent much time doing performance QA with it. I wrote mostly positive blog posts, filed more than a few bugs and even won the William Zola Community Award. But I am busy enough with MySQL, Postgres and RocksDB so I haven't tried to use it for years. Regardless, I continue to be impressed by how fast they pay down tech debt, with one exception (no cost-based optimizer).

Wednesday, August 11, 2021

On storage engines

I wish it were easier to implement new storage engines for MySQL, Postgres and MongoDB and other OSS databases. There is so much innovation that we miss out on - FASTER is one example. All (MySQL, MongoDB, Postgres) have storage engine APIs but there are not many OSS implementations of them.

MyRocks, MySQL Aurora and MySQL HeatWave are examples of the benefits. But they also show that it helps to have the backing of a well-funded company because this is a huge undertaking.

The API for Postgres is the most recent and perhaps that will be the most popular. The API for MySQL is the oldest and has become harder as more requirements (like partitioning) are pushed into it. The MongoDB API is friendlier than the MySQL API, but MongoRocks was deprecated when the API was enhanced to support transactions and RocksDB wasn't able to support user-provided commit timestamps.

One problem is naming. MongoDB and WiredTiger are databases, but WiredTiger is also a component of MongoDB. To avoid confusion I will use storage engine for things like WiredTiger, RocksDB and InnoDB and database for the systems that provide query languages, replication and more.

I have been involved in three such engines -- MyRocks, MongoRocks and a read-only MySQL engine used long-ago at Google. MyRocks benefited from a large team and Sergey Petrunya at MariaDB and MongoRocks benefited from amazing support from MongoDB the company.

Two interesting things are:

The impact of database design decisions on the storage engine API. See what is needed for transactions in the MongoDB API.
The impact of the original storage engine on the storage engine API.

MongoDB doesn't take advantage of clustered indexes in WiredTiger or RocksDB. An extra (hidden) index must be used to map between DiskLoc and the PK index. See SERVER-14569.
I have more research to do before I understand the impact of vacuum and 32-bit transactions IDs on the Postgres API.

I also wonder whether the RocksDB API could be the universal storage engine API. In theory any storage engine implementing that would be able to replace RocksDB in MyRocks or MongoRocks. The goal is then to implement the RocksDB API glue once per database and then be able to use a variety of storage engines based on that. Of course, XKCD has a great take on standards and this might just be naive and wishful thinking on my part.

Thursday, August 6, 2020

Over fetching in a DBMS

By over fetching I mean fetching irrelevant documents or fields while processing a query. By fetching I mean the data examined by the DBMS, not the result set returned to the user. By irrelevant I mean documents and fields that won't change the query result if they don't exist. This also applies to a SQL DBMS after substituting column for field, row for document and collection for table. In the rest of this post I mostly use MongoDB names (documents, fields and collections). I might reuse some of the terminology from a post on predicates from Use The Index, Luke. That post is interesting.

I can refine the over fetching definition into four parts:

co-located
single-table
join
aggregation
other

Co-located

Co-located over fetching occurs when irrelevant fields are read from a document or indexes. Columnar storage is a one way to avoid this for analytic workloads. Sometimes a covering index can reduce this for OLTP workloads.

For a collection C with fields a1, a2, ..., aN then db.C.find({}, {a1:1, a2:1, _id:0}) does co-located over fetching when a collection scan is used. An index on (a1, a2) avoids over fetching because the fields a3 ... aN are not in the index. The SQL version of the query is select a1, a2 from C.

Co-located over fetching also occurs when an index has fields that are needed by the query. For the query in the previous paragraph if there is an index on (a1, a2, a3, a4) and the index is used for the query then over fetching occurs because a3 and a4 are read by the DBMS but not needed.

Single-table

Single-table over fetching occurs when irrelevant documents are read. A good index is one way to avoid this, especially for OLTP workloads.

For this example the collection C has the documents:

{ _id: 0, a1:7, a2:8, a3:9 }
{ _id: 1, a1:7, a2:18, a3:19 }
{ _id: 2, a1:27, a2:28, a3:29 }

The query db.C.find({a1: 7}) does single-table over fetching when there isn't an index on a1. In that case a collection scan is done that examines all docs but the doc with _id:2 is irrelevant. With an index on a1 then only the docs with _id:0 and _id:1 are examined and there is no over fetching. The SQL version of this query is select * from C where a1=7.

Single-table over fetching can occur with indexes. For the query db.C.find({ a1: 7, a2: 8}) with an index on a1 then docs with _id:0 and _id:1 are examined but _id:1 is irrelevant because the predicate on a2 excludes it. An index on (a1, a2) avoids the over fetching. The SQL version of this query is select * from C where a1=7 and a2=8.

Join

Join over fetching occurs when join predicates filter documents. The docs that were filtered by the join predicate don't change the query result and figuring out how to avoid examining them prior to the join might improve performance.

With MongoDB $lookup is a left outer join and over fetching cannot occur for the input documents. But it can occur for docs in the from collection when there is no index, or no good index, on it. But explain doesn't show the access path for the from collection -- until SERVER-22622 is fixed.

Whether join over fetching occurs in SQL depends on the join (inner, left outer, right outer, etc). The example uses tables C1 and C2 and the query select * from C1, C2 where C1.x1 = C2.y1

C1 has the rows: (x1:1, x2:2), (x1:11, x2:12), (x1:21, x2:22)
C2 has the rows: (y1:1, y2:2), (y1:11, y2:12), (y1:31, y2:32)

Assume this is evaluated by scanning C1 and then probing C2 (nested loops join). If there is no index on C2.y1 then join over fetching occurs for C2 because (y1:31, y2:32) is examined by a table scan but filtered by the join predicate. Join over fetching occurs for C1 whether or not there is an index on C2.y1 because (x1:21, x2:22) is examined but filtered by the join predicate.

Aggregation

Aggregation over fetching occurs when docs are filtered by the aggregation operator semantics. The obvious ones are $max and $min in MongoDB (max and min in SQL).

For this example the collection C has the documents { _id: 0, a1:7, a2:8 }, { _id:1, a1:7, a2:9 } and the queries are:

MongoDB: db.C.aggregate([ { $group : { _id: "$a1", maxval : { $max : "$a2" } } } ])
SQL: select max(a2), a1 from C group by a1

Aggregation over fetching occurs for the doc with _id:0 because it doesn't have the max value of a2 for the group with a1:7. An index on (a1, a2) can avoid the aggregation over fetching but you should consult the DBMS documentation to understand whether that optimization has been implemented. MySQL has loose index scan for min/max and distinct, Postgres has recursive CTE (here and here) while MongoDB has DISTINCT_SCAN (for $first but not for $min or $max - see SERVER-40090).

Other

Sometimes index-only queries aren't as index-only as you want them to be. Postgres relies on bits being set in the visibility map or it will fetch the base row from the heap. InnoDB relies on a different mechanism but there are cases where it too must fetch base row images from the PK (clustered) index for queries that appear to be index only.

Wednesday, July 15, 2020

Indexing and write-heavy workloads

When I see impressive numbers for the insert rate that a DBMS can sustain I wonder what indexes exist and whether the inserts are in sequential or random order with respect to each index. One way to explain this is in terms of the numbers of points in the index at which the inserts occur. Although I use streams rather than insert points in what follows.

I am writing this in part so that I can reference this post in future performance reports when describing workloads. It isn't sufficient to state that inserts are in PK order. They can be in ascending or descending PK order. When ascending the point at which the inserts are done can be at the right end of the index (inserted keys > than existing keys) or somewhere in the middle of the index. When descending the inserts can be done at the left end of the index (inserted keys < existing keys) or somewhere in the middle of the index.

Explaining insert patterns

There are four attributes per index that can explain such insert patterns. The attributes are:

nAsc - number of streams for which inserts occur in ascending order WRT the index
nDesc - number of streams for which inserts occur in descending order WRT the index
nLHS - the number of descending streams that are at the left end of the index
nRHS - the number of ascending streams that are at the right end of the index

Constraints:

nAsc >= 0, nDesc >= 0 and (nAsc + nDesc) >= 1
nLHS and nRHS must be 0 or 1
if nLHS is 1 then nDesc must be >= 1
if nRHS is 1 then nAsc must be >= 1

There is one exception. When the insert pattern is random WRT the index then inf is used instead of the four attributes.

Geek Code

This is a geek code for explaining insert patterns. The attributes are specified per index. When there is only a PK index, named pk, and inserts occur in PK order at the right end of the index (right growing) then the geek code is:

pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)

When there is only a PK index but the inserts are in random order WRT the PK then the geek code is:

pk=(inf)

To improve readability I omit attributes for which the value is 0. So these mean the same thing:

pk=(nAsc:1, nDesc:0, nLHS:0, nRHS:1)
pk=(nAsc:1, nRHS:1)

Why?

I am interested in this for three reasons. First, index maintenance has a big impact on insert performance whether or not the working set is in memory. Second, there are optimizations that a DBMS can do for some insert patterns and I suspect there is room for even more optimizations. Many storage engines optimize for right-growing inserts. In that case RocksDB with leveled compaction will have write amplification of 2 -- write once for the WAL, write again for the memtable flush, no compaction. Finally, this makes it easier to explain write-heavy workloads.

Steams and insert points

I use ordered arrays rather than indexes to explain streams (insert points). Assume the array starts as: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0], this represents the keys in the PK index and there are no secondary indexes. Some of the examples showwhy I use streams to describe this.

Examples:

Random

pk=(inf)
insert sequence: 1.5, 6.5, 1.7, 8.1, 0.0, 4.5, ...

Right growing

pk=(nAsc:1, nRHS:1)
insert sequence: 10.0, 11.0, 12.0, ...

Left growing

pk=(nDesc:1, nLHS:1)
insert sequence: 0.0, -1.0, -2.0, ...

Left & right growing

pk=(nAsc:1, nDesc:1, nLHS:1, nRHS:1)
insert sequence: 10.0, 0.0, -1.0, 11.0, 12.0, -2.0
insert sequence as interleaved streams: [10.0, 11.0, 12.0] and [0.0, -1.0, -2.0]

1 middle ascending

pk=(nAsc:1, nRHS:0)
insert sequence: 8.1, 8.11, 8.111, 8.1111, ...

1 middle descending

pk=(nDesc:1, nLHS:1)
insert sequence: 7.9, 7.89, 7.889, 7.8889, ...

1 middle ascending, 1 middle descending

pk=(nAsc:1, nDesc:1)
insert sequence: 8.1, 7.9, 8.11, 7.89, 8.111, 7.889, ...
insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [7.9, 7.89, 7.889]

2 middle ascending:

pk=(nAsc:2)
insert sequence: 8.1, 6.1, 8.11, 6.11, 8.111, 6.111, ...
insert sequence as interleaved streams: [8.1, 8.11, 8.111] and [6.1, 6.11, 6.111]

N middle ascending

pk=(nAsc:N) for some finite value N

Explaining the insert benchmark

Until recently I ran the insert benchmark by first creating a PK index and 3 secondary indexes per table (or collection) and then doing inserts. Informally, the inserts were in PK order but random WRT to each secondary index. More formally, the insert pattern is the following when the secondary indexes are named s1, s2 and s3:

pk=(nAsc:1, nRHS:1)
s1=(inf)
s2=(inf)
s3=(inf)

The insert benchmark can become extremely IO-bound because of the random insert patterns for each of the secondary indexes. In the worst case with a B-Tree there is one page read and one page written back per secondary index per insert (3 pages read, 3 pages written back with 3 secondary indexes).

I recently changed the way that I run the insert benchmark to create the PK index, load some data, create secondary indexes and then load more data. In this case the insert pattern during load some data is:

pk=(nAsc:1, nRHS:1)

And then during load more data (with secondary indexes in place) is:

pk=(nAsc:1, nRHS:1)
s1=(inf)
s2=(inf)
s3=(inf)

Monday, March 16, 2020

Insert Benchmark v3

I expect to replace the insert benchmark later this year. The insert benchmark I have been using might be ibench v2 as ibench v1 came from Tokutek. So the replacement I write would be ibench v3.

The reasons for a replacement include:

Switch from Python to something more performant (Java or Go) as it is difficult to sustain high QPS rates (100k / second or more) with Python and a Java/Go client will consume less CPU. I assume the DBMS I care about all have good golang clients (MongoDB, MySQL, Postgres).
Make the benchmark less synthetic. Early ibench was insert-only for a table with a PK and 3 secondary indexes. Eventually I added support for short range queries. It has always used uniform distribution. The plan for ibench v3 is to model a monitoring workload -- many inserts, some updates, many queries. It will use a PK and 1 or 2 secondary indexes.
Continue to support multiple DBMS -- MongoDB, MySQL and Postgres.
As a bonus, don't ignore coordinated omission

I still need to confirm whether this is worth implementing or whether I have a case of NIH and should be using TSBS. I browsed TSBS and the indexes I describe here might match the indexes that TSBS would use. However, TSBS has more query diversity. My focus is on the storage engine and a smaller set of queries is sufficient. Regardless, I need to spend more time reading about TSBS.

Logical Schema

The logical schema describes the data in one insert operation. The physical schema describes how that is stored in the DBMS. I focus on the logical schema here.

The logical schema is: timestamp, deviceID, [metricID, metricValue]+

This is the data from one insert for a given device. It has data for 1+ metrics.
timestamp is a 64-bit integer that will be mostly increasing on insert. Clocks are not synchronized across clocks, some devices will have bad/stuck clocks and some devices will have clock drift.
deviceID is a 64-bit integer. The number of devices will be used to scale the workload. There can be millions of devices for large scale factors.
Each (metricID, metricValue) pair represents the value for a metricID from a given device at a given point in time. metricID is a 32-bit integer. The number of metrics will also be used to scale the workload. There will be fewer metrics than devices. metricValue is a 64-bit integer

Workload

Fields are abbreviated when describing the workload: t=timestamp, d=deviceID, m=metricID, v=metricValue. Operations are described in SQL, except for insert.

The workload is a combination of:

GetOne: select v, t from C where m=M and d=D and t between T1 and T2 order by t

Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N

GetAll: select v, t from C where d=D and t between T1 and T2 order by t

Query variants are: t >= T1 order by t limit N, t <= T1 order by t desc limit N
Like GetOne but missing the equality predicate on m
If m isn't known then an extra secondary index on (d,t) might be needed unless the DBMS supports index skip scan via an index on (d,m,t,...).

Max, Min: select v, t, d from C where m=M and t between T1 and T2 order by v asc/desc limit N
Compare: call GetOne for m in (...) and d=D and t between T1 and T2
Insert: insert into C values (t, d, [(m1, v1),(m2, v2),...])

SQL above is not standard and assumes a schema that might not be used.
(t,d,m) is unique. I am still deciding whether the benchmark client can generate intermittent duplicate inserts and how that should be handled.

Rollup: Uses Compare to get all data for d in (...) in time range, then replaces that data with values aggregated over coarser granularity. Assume that base data is rolled up into 5 minute, then 1 hour and then 1 day granularity.
Backfill: inserts missing data that might be hours or days old

The value of N in GetOne, GetAll, Max and Min above will vary. For Max and Min it is likely to be <= 100 and frequently <= 10. For GetOne/GetAll in some cases the user wants all data in a time range, but in other cases they will set N to be <= 10,000 and not get all data in the time range.

Open Issues

Spammy data - is it worth simulating devices that misbehave? This includes sending data too frequently, sending incorrect values and timestamps and not sending data for some time periods.
Is the aggregated data stored inline with the base data or in a separate table? If stored inline do entries need a tag to indicate granularity like base, hourly, daily, etc?
Should the benchmark include queries and writes to correct bad data?
Can there be a way for the benchmark to run faster than wall clock time so that per-hour and per-day rollup is done more frequently than per real hour and per real day? This avoids the need to run the benchmark for too long.
How does a scale factor affect the size of the database, the write load and the query load?

Physical Schema - Write Optimized

The goal here is to make writes fast. That is done by using a unique index for (t,d). Inserts are to the right end of the index assuming they usually arrive in t order. While this makes inserts fast it makes queries too slow. The (t,d) index over-fetches because only the range predicate on t can be used for the index access path:

For the GetOne and Compare queries it over-fetches by a factor of |m| * |d|. If there are 1M devices and 10 metrics per insert it fetches 10M times more index entries than needed.
For the Max/Min queries it over-fetches by a factor of |m|.

The large over-fetching for GetOne and Compare makes it infeasible to run them with the write optimized schema. The Max and Min queries could still be run -- it would be painful but not infeasible given that |m| is likely to be <= 100.

Implementation notes:

MongoDB - There are 3 choices for the unique index. The first choice is to use the subdocument {t, d} as the value for _id and make sure that all clients encode it in the same way. The second choice is to concatenate t and d into a string for the value of _id. This avoids the need to worry about clients encoding as in the first choice at the cost of storing t and d twice per doc. The third choice is to use ObjectId as the value of _id and then have a unique compound index on {t:1, d:1}. The cost for this is an extra index but inserts are in index order for both.
SQL with document datatypes - Use one row per insert and document datatype support in MySQL or Postgres to store all of the (m,v) pairs in that row.
SQL without document datatypes - Use one row per metric per insert. The unique index is defined on (t,d,m) rather than (t,d). The columns in the row are t, d, m, v.

Physical Schema - Read Optimized

These indexes are important to make queries efficient.

GetOne and Compare need one of (d,m,t) or (m,d,t).
Max/Min need one of (m,t,v,d) or (m,v,t,d)

From the perspective of one GetOne query there isn’t a difference between the index choices because there are equality predicates on d and m. But from a cache perspective one might be better than the other as they cluster on the first indexed field -- by d or m. If some metrics are less likely to be queried then clustering by m can be a better choice.

The unique constraint is on (d,m,t) or (m,d,t). If v is also in that index then uniqueness cannot be enforced. If v is not in the index then it might not be covering. With Postgres I can use the INCLUDE clause to get v in the index. With InnoDB and MyRocks the table is clustered on the PK, so it is sufficient to make (d,m,t) or (m,d,t) the PK. With MongoDB if I want a covering index then I have to create two indexes -- unique on (m,d,t) and then non-unique but covering with (m,d,t,v) or I can skip the covering index to reduce the write cost but make queries slower. Finally, by unique for MongoDB I mean that this is the value for _id, and then we are back to the question of using a subdocument for _id.

For Max and Min:

The (m,t,v,d) index uses index predicates for m and t and then does a sort. So this has an extra cost from the sort and can’t benefit from stopping the index scan once the limit N has been satisfied. Some DBMS have optimizations to make top-N sort fast, but there is no way to avoid the cost of not stopping the index scan early.
The (m,v,t,d) index uses index predicates only for m but avoids the sort and can stop the index scan once the limit N has been satisfied. The range predicate on t would be a filter predicate - evaluated for each entry read from the index.

There are cases where (m,t,v,d) is a better choice, but I think that (m,v,t,d) will be better more of the time. The index on (m,t,v,d) would be better when the selectivity from using t as an index predicate offsets the cost of the sort and the data is skewed so that limit N isn’t satisfied until most of the index has been scanned.

Write efficiency in the read optimized schema

Below I use streams of writes to mean the number of leaf pages that will be subject to read-modify-write from index maintenance during inserts at any point in time. If all inserts are to the right end of the index then there is 1 stream. If the index is on (client-ID, time), there are 10 clients active concurrently and each client inserts in time order then there are 10 streams of writes. When there is one leaf page in cache for each stream then IO efficiency is better. Otherwise each insert is more likely to force a page write back followed by a page read.

For the indexes used by GetOne and Compare -- (d,m,t,v) or (m,d,t,v) -- there will be |d|*|m| streams of writes. The cache demand from this is |d|*|m| pages. A small workload might use |d|=1M and |m|=10 and 10M 8kb pages needs 80G of RAM. Both indexes have a similar amount of cache amplification.

For the indexes used by Max and Min, the (m,t,v,d) index has |m| streams of writes vs |m|*|v| for the (m,v,t,d) index. Thus the (m,v,t,d) provides better query performance at the cost of more random IO and more cache amplification. One index has much less cache amplification.

For the non-unique secondary index used for Max and Min, MyRocks can use read-free index maintenance and the number of write streams isn't an issue for it. But it will be an issue for other DBMS.

Partitioning

Partitioning the tables and collections by time can help with write and read performance as well as making other tasks easier -- removing and rolling up old data. For example there could be a partition per hour for the last N hours. However if a DBMS doesn’t support partitions then there is a burden on the application developer to support cross-partition queries. As a user I expect to run queries across hourly boundaries.

Friday, February 14, 2020

Describing replication

There is an opportunity for confusion when describing replication including physical vs logical and synchronous vs asynchronous. I prefer describing replication as physical, statement or document/row rather than logical vs physical and briefly define them below.

In physical replication the log records the changes to the database pages (change bytes starting at offset X on page Y from this sequence to that sequence). A benefit of this approach is that the database pages will be the same between primary and replica. This approach also avoids the overhead (more CPU than IO) of evaluating the queries that lead to the page changes. A problem from this approach is that the pages need to be the same between the primary and replica. Changes to pages must be deterministic, long running reporting queries on a replica might block replication apply, MVCC GC requires coordination, etc.

In statement based replication (SBR) the log records the statements that modified the database. The benefit from this approach is that it is easier to implement. When one statement changes many rows then this also reduces the size of the log. But for OLTP statements are less likely to change many rows. Unfortunately long running write statements on the primary will be repeated on the slave and that can waste CPU and cause replication lag. It is also difficult to detect dependencies between transactions which makes it harder to replay in parallel on a replica. I did web-scale MySQL for many years with SBR. It worked, but the move to RBR is a good thing.

In document/row replication (aka RBR or DBR) the log has one entry per changed doc/row. This avoids the need for the replica to reevaluate the statement that generated the changes -- it can just fetch the docs/rows by ID and apply changes to them. When the doc/row entries in the log also include PK (_id) values then it is possible to determine which transactions can be replayed in parallel. It is possible to make these changes idempotent in contrast to SBR which reduces the burden of making replay crash safe.

MongoDB oplog

This is an example of the oplog contents for MongoDB. This is document based based on the terminology above:

db.bar.insertMany([{a:1}, {a:2}, {a:3}])

use local
db.oplog.rs.find({"o.msg": {$ne: "periodic noop"}},
{op:1, o:1}).sort({$natural:-1}).limit(3)

{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 3 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 2 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 1 } }

Benchmarks vs real workloads

How do you test a DBMS that you are building for internal usage or the world? This is a good question whether you are the upstream DBMS or maintaining a downstream fork. The choices include real workloads, synthetic benchmarks and throw it over the wall.

There is a great paper from Microsoft on doing live capture & replay at scale for real customers. I hope to add a link for a talk on the live capture & replay that was done for testing at Facebook.

Real workloads

Real workloads are great. Alas real workloads are hard to get (code IP, data privacy). Even if these are resolved a bigger problem is that real workloads are complex and buggy (all software is buggy). If you get a real workload without a team to support it then you will spend a lot of time debugging it and otherwise trying to figure it out. This is probably more true for OLTP than analytics.

Real workloads v2

Capture & replay (C&R) is an easier way to get some of the goodness of a real workload with less effort. This avoids the overhead of dealing with the code from the real workload. This doesn't avoid issues with data privacy.

C&R can be online or offline. Online C&R does the replay online (immediately after the capture). Online C&R is a great choice when you have significant internal workloads that use the DBMS you are building. Running a DBMS in production is a great education for database developers in so many ways. This is one more perk.

Offline C&R archives the captured workload for replay at a later time. This is useful for workloads that can't be shadowed (customer workloads). This is also useful when you want a workload that can be repeated (replayed many times) as online C&R does not allow for that. An interesting problem for offline C&R is making the replay realistic -- respecting the timing and concurrency that was recorded. This is more challenging when the DBMS being tested is a lot faster than the DBMS used for the capture.

One question for C&R is the point at which the capture is done. When capture is done at the DBMS then you don't miss anything. But sometimes it is easier to do the capture at the application tier.

The Facebook MySQL team had a great setup for online C&R via a shadow tier. I don't have much experience with offline C&R but I know it has been useful for others. In my Google MySQL years I built a tool to sample production queries (SELECT, not writes) and replay them against replicas running new and old binaries to compare performance and correctness. That was a fun and useful project for me. It was mostly online C&R. One interesting bug it found was a result of changing the internal datatype for aggregation between decimal and IEEE754.

Synthetic benchmarks

Benchmarks have a bad reputation but I like them. I even enjoy benchmarketing but only when I am not involved. I rarely run benchmarks in isolation. Absolute values (QPS, TPS) don't help me. I am a big fan of relative comparisons (new binary vs old binary, this DBMS vs that DBMS).

Comparisons require sufficient expertise in all of the systems that are tested as the goal is to get results that can be explained and trusted. Expertise is hard to find. Therefore I have less faith in benchmark results that test more than one DBMS -- whether this is work done by myself or others. Perhaps one day someone will do a study and provide a function that estimates truthiness as a function of the number of DBMS tested.

Throw it over the wall

This has to be done at some point in time, otherwise you are building a DBMS that nobody uses. Hopefully the steps above have been done to increase the chance that there won't be too much drama at this point.

Thursday, February 13, 2020

Short guide to MongoDB monitoring

This is short and incomplete. But it is a good start. This is written from the perspective of someone who spends all of their time trying to explain MongoDB performance for benchmarks. I have plenty of experience with databases in production. I have no experience with MongoDB in production.

Notes:

One day I hope for the equivalent of user/table statistics
For profiling use db.setProfilingLevel and db.getProfilingStatus.
To enable debug level messages in the diagnostic log use db.setLogLevel and db.getLogComponents. However I think it is best to start with non-debug messages so you can ignore these for a while.
I wish the set and get names above were symmetric. Just remember the set*Level method names and don't worry about the get* names.
db.setProfilingLevel also determines what can get written to the diagnostic log. The important parameter is slowms. My advice is to set level=0 and slowms to something > 0 and use the diagnostic log.
Learn how to use mtools to process entries from the diagnostic log. The tools are great and the docs are awesome.
The diagnostic log entries for COMMAND and TXN are the most likely entries you will examine. These lines can include many performance metrics but they are only printed when the values are > 0. Examples are below.
Also see db.stats, db.serverStatus, db.collStats, db.collection.stats, db.collection.latencyStats, connPoolStats, replSetGetStatus,
Consider saving the FTDC files from the diagnostic.data directory if you have problems. There are some tools for accessing the data from an FTDC file.
Use explain to understand query plans and execution
Use mongostat and mongotop to see performance from a high level before drilling down
I don't have much experience with them yet but MongoDB has additional tools and services for monitoring.

Diagnostic log examples

These show some of the counters that can be included in diagnostic log entries. Remember that many of the counters are only printed when > 0.

Metrics on writes (bytesWritten, timeWritingMicros)

2020-02-04T18:05:24.787-0800 I COMMAND [conn41] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3139, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868324, 23435), signature: { hash: BinData(0, 9E42A21BD71EA0BB8DFEAD52013CFFF325E60BB6), keyId: 6789772850304647170 } }, lsid: { id: UUID("5d900961-93cf-4f7a-9426-940a37a71eca") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesWritten: 324324, timeWritingMicros: 656 } } protocol:op_msg 401ms

Metrics on reads (bytesRead, timeReadingMicros)

2020-02-04T18:08:52.442-0800 I COMMAND [conn33] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3719, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868532, 2858), signature: { hash: BinData(0, D289506AFBD0E8C8E8184112608CFED62E9A1B7D), keyId: 6789772850304647170 } }, lsid: { id: UUID("7391b47c-fd95-491b-a640-20c953bcacc0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1369, timeReadingMicros: 10 } } protocol:op_msg 389ms

timeWaitingMicros is less frequent. In this case it was ~2.2 seconds. That seems like a lot but the operation wrote more than 4M of data.

2020-02-04T18:22:05.668-0800 I COMMAND [conn46] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 144, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580869323, 2352), signature: { hash: BinData(0, 2CF49EC657832ECB9D7DD0AC5C65E5C3F80CA10B), keyId: 6789781594858061826 } }, lsid: { id: UUID("d88d7d65-f293-4060-a894-17c58123d787") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1881, bytesWritten: 4458441, timeReadingMicros: 15, timeWritingMicros: 5071 }, timeWaitingMicros: { cache: 2215267 } } protocol:op_msg 2621ms

timeAcquiringMicros for locks is also less frequent. I have a few examples from insert stress tests. I have yet to learn whether mongod times all lock waits. It isn't easy to do that without an impact on performance.

2020-02-04T17:52:55.133-0800 I COMMAND [conn43] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 1023, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580867574, 46329), signature: { hash: BinData(0, D158AF9490A8ABAEA2969E809D2C27FE57157B41), keyId: 6789772850304647170 } }, lsid: { id: UUID("fabaaa7b-c2c4-4712-ae16-f2c0f58a03a0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 14042 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{} protocol:op_msg 376ms

Commit appears in the log via TXN

2020-02-05T17:31:54.792-0800 I TXN [conn81] transaction parameters:{ lsid: { id: UUID("cae8daad-5603-4a41-87db-3fda07e90bee"), uid: BinData(0, 6399AB0DAC62F20BFC466753B10FB58FB7E692BEC952C69B84D997021794D1F8) }, txnNumber: 255, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1580952714, 1391) } }, readTimestamp:Timestamp(0, 0), ninserted:10 keysInserted:20 terminationCause:committed timeActiveMicros:604 timeInactiveMicros:1016 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 5 } }, Global: { acquireCount: { r: 3, w: 1 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 23 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 1ms

Friday, January 31, 2020

Durability vs Availability

I don't consider myself an expert on distributed systems but sometimes I try to write on the topic. I hope any mistakes here aren't too serious. I like to describe things so here is my attempt.

Durability

Durability is about the probability of losing data whether that data was written to the DBMS yesterday or recently. I describe this as a probability rather than systems that can or cannot lose data. I have spent too much time with web-scale deployments to claim that transactions will never be lost.

Backups prevent loss of data added yesterday or last year. Making recent changes durable is harder and comes at a cost in complexity and commit latency. The solution is to get commit log entries onto multiple servers before something is considered committed. There are fascinating and complicated ways to do this. Fortunately most of us can make use of Raft and Paxos implementations written by experts.

While my team at Google was the first to implement semisync replication, someone else gets credit for the idea to make it lossless. And that turned out to be a great idea. Another interesting idea is the way that MongoDB has separated the read and write paths to provide strongly consistent reads and highly durable writes without implementing synchronous replication on the commit path (Raft is used elsewhere). This is clever and flexible. Users get the semantics they need while only paying for the stronger semantics when desired.

Availability

Availability is about avoiding big and small downtime. An obvious source of big downtime is the time required to restore from a backup and replay archived commit logs after a server fails. The cost of better availability is more hardware because this is frequently done via replica sets, something that looks like synchronous replication and one of:

Single-primary elected via consensus with fast & automated failover and failure detection
Multi-master - I assume this means no failover, which might be the fastest kind of failover

Small Downtime

Small downtime is another way to reduce availability and deserves more PR. Sources of small downtime include:

Too slow commit - not all clients can wait, so too slow might also imply data loss.
Java GC stalls - I am happy to have little experience with servers written in Java
Flash GC stalls - NAND flash response time latency isn't great once you add enough 9s to percentile response time latency
Flash TRIM stalls - we rarely discuss this in public
Other stalls - more things that don't get enough discussion
Overloaded server - who is at fault for this. It seems like users are more willing to blame the DBMS for being too slow when subject to too much concurrent load than they are willing to blame a storage device for lousy response times when subject to ridiculous numbers of concurrent requests

Implementation Details

I once worked on a self-service MySQL project that was ahead of its time. Unfortunately I left the company before it was deployed. I think it succeeded for many years. Since I was a part time product manager on the effort I think of the cross-product of durability and availability via two levels of durability (more lossy, less lossy) and three levels of availability (low, medium, high).

For durability more lossy is provided via async replication or async log shipping. There are weak bounds on the number of commits that might be lost when the primary disappears. Less lossy durability usually requires synchronous replication.

The availability levels are implemented by:

low - on failure of the primary restore from backup and replay the archived commit log. This can take hours.
medium - use HA storage (EBS, Google Persistent Disk, S3, etc). On failure of the primary setup a new compute node, run crash recovery and continue. This takes less time than the low availability solution. You still need backups and commit log archives as the HA storage will occasionally have problems.
high - use replication to maintain a failover target. Hopefully the failover target can handle queries to get more value from the extra hardware.

At this point most of the combinations of the following are useful. However resources are finite and it might not be a good idea to provide all of them (jack of all trades vs master of none):

(more, less lossy durability) X (low, medium, high availability)

Friday, December 20, 2019

Readahead

Q: What is the best readahead size?
A: O_DIRECT

Perhaps I agree with Dr. Stonebraker. This is my answer which might not be the correct answer. My reasons for O_DIRECT are performance, quality of service (QoS) and manageability and performance might get too much attention. I don't dislike Linux but the VM, buffered IO, readahead and page cache are there for all Linux use cases. They must be general purpose. Complex system software like a DBMS isn't general purpose and can do its own thing when needed. Also, I appreciate that kernel developers have done a lot to make Linux better for a DBMS. One of the perks at FB was easy access to many kernel developers.

Most of my web-scale MySQL/InnoDB experience is with O_DIRECT. While InnoDB can use buffered IO we always chose O_DIRECT. Eventually, RocksDB arrived and it only did buffered IO for a few years. Then O_DIRECT support was added and perhaps one day the web-scale MyRocks team will explain what they use.

I deal with readahead when running benchmarks and a common experience is using the wrong (too large) value and then repeating tests which means I spend more time and more SSD endurance thanks to buffered IO. I have many blog posts with performance results for readahead including at least one for MongoDB. Usually my goal was to find which small value is good enough. I learned that 0 is too small. Readahead can help scan-heavy workloads, but my focus is on OLTP where we avoided most scans except for logical backup.

I understand why buffered IO is used by some DBMS. Early in the product lifecycle it can be a placeholder until more features are added to make O_DIRECT performant. The benefits of the OS page cache include:

Filesystem readahead can be used before the DBMS adds support for prefetching when doing scans. But filesystem readahead is a black box, might differ between filesystems, provides no metrics and will do the wrong thing for some workloads. InnoDB provides a prefetch feature which can help when O_DIRECT is used. I disabled it because OLTP. The Facebook MySQL team (thanks Nizam) added logical readahead to make logical backup faster and more efficient. Filesystem readahead is likely to struggle with index structure fragmentation, so it is best suited for heap-organized tables and will suffer with index scans.
Doing writes to the OS page cache followed by fsync can be used before the DBMS adds support for async IO or background write threads. But Postgres suffered for so long from this approach because calling fsync with an unknown amount of dirty pages in the OS page cache can starve all other pending IO requests for many seconds. The situation is less dire today thanks to work by Jens Axboe to make writeback less annoying. There was much discussion in 2014 at a summit that included Postgres and Linux kernel developers. In addition to Linux improvements, features have been added to Postgres to reduce the impact from writeback storms -- read this to learn about spread checkpoints.
For a DBMS that does compression it is easier to use the DBMS cache for uncompressed pages and the OS page cache for compressed pages. I am familiar with amazing work in InnoDB to manage both in the DBMS cache. We all agree the code is very complex. RocksDB also has an option to cache both in its block cache but I have little experience with the feature. It is hard to figure out the best way to divide the DBMS cache between compressed and uncompressed pages.

Performance advantages for O_DIRECT include:

Does one memory copy to move data from storage to the DBMS while buffered needs two
Avoids CPU and mutex contention overhead in the OS page cache
Avoids wasting memory from double buffering between the DBMS cache and OS page cache

QoS advantages for O_DIRECT include:

Filesystem readahead is frequently wrong and either wastes IO or favors the wrong user leading to worse IO response times for other users
OS page cache will get trashed by other services sharing the host
Writeback storms starve other IO requests. Writeback is usually a background task and can tolerate response time variance. Too much writeback makes user reads and log fsync slower and those operations don't want response time variance.
Reduces stalls - this is a placeholder because my local expert has yet to explain this in public. But you will have a better time with Linux when moving less data through the OS page cache, especially with modern storage devices that can sustain many GB/sec of throughput. And when you aren't having a good time then you can fix the DBMS. The DBMS is involved whether or not it relies on the OS page cache so you always have to make it work.

Manageability advantages for O_DIRECT include:

DBMS prefetch and writeback are documented, tunable and provide metrics. Such claims are less true for filesystem readahead and VM writeback. There is a lot of advice on the web and much disagreement especially on the topic of min_free_kbytes. Domas used to be my source on this but he doesn't blog enough about the Linux VM.

Monday, December 9, 2019

Still web scale

I am excited to start a new job next week working on performance at MongoDB. I have been a fan of the people and product for years and I look forward to contributing from the inside. The reasons I have been a fan include the rate at which the product has improved, WiredTiger and their contribution to MongoRocks.

I look forward to learning the modern performance analysis tool chain courtesy of Brendan Gregg. His BPF book should be ready soon and there is much content on his web site. When I had to understand off-cpu stalls from IO and mutex contention there wasn't much available 10 years ago, thus PMP was born. While it served me well it is time to move on.

I will continue to blog including performance comparisons between database engines. I look forward to writing about MongoDB, especially WiredTiger internals, but I will exclude MongoDB from the performance comparisons on my blog.

I left FB in January and spent most of the year reading applied math books and being a caregiver. I appreciate the opportunity I had there and the people who helped me -- both managers and many awesome technical contributors. I wouldn't have made it here without Domas.

FB asked one trick question in my job interview -- do I want to be hands on or an architect. I gave the correct answer -- hands on. Small & fast-growing companies need people who are hands-on. More architects will be needed post-IPO when the company gets larger. I interviewed at FB because the MySQL team at Google was ended and I had to find a new project. That was a source of stress but it turned out OK as MySQL at FB was my new project.

MyRocks and RocksDB have been in capable hands for a long time, so I expect more good things from those teams.

The MySQL community has been wonderful to me. My career wouldn't have been possible without the contributions of so many people who made MySQL better. Fortunately, clever people arrive to replace the people who leave and MySQL will continue to improve. Perhaps one day upstream will get a write-optimized storage engine. Dare to dream.

Monday, November 25, 2019

Throttling writes: LSM vs B-Tree

Reducing response time variance is important for some workloads. This post explains sources of variance for workloads with high write rates when the index structure is an LSM or a B-Tree. I previously wrote about this in my post on durability debt.

Short summary:

For a given write rate stalls are more likely with a B-Tree than an LSM
Many RocksDB write stalls can be avoided via configuration
Write stalls with a B-Tree are smaller but more frequent versus an LSM
Write stalls are more likely when the redo log isn't forced on commit
The worst case difference between an LSM and B-Tree is larger when the working set isn't cached
Life is easier but more expensive when the working set fits in cache
Less write amplification saves IO for other uses

Less short summary:

Write stalls for an LSM occur when compaction has trouble keeping up with the incoming write rate. The worst stalls occur at write rates that a B-Tree could not sustain. One way to mitigate stalls is to reduce the write rate. Another way is to use an index structure that doesn't support or is inefficient for range scans (see index+log).
The cost from configuring RocksDB to avoid write stalls is more CPU overhead on reads as there will be more data in the upper levels of the LSM. I am partly to blame for the default configuration in RocksDB that throttles writes when the LSM tree gets too much data in the L0, L1 and L2. But that configuration can be changed.
SQLite4 has a clever LSM designed for systems that don't allow background threads. It implements a pay as you go approach to durability debt. A traditional LSM takes the opposite approach - it defers the IO cost to the background. RocksDB has optional write throttling and work has been done to smooth the impact from it but it is not solved. A B-Tree in the worst-case (buffer pool full & mostly dirty, working set not cached) also implements pay as you go approach.
I almost always disable sync-on-commit for benchmarks because I want to observe how the DBMS observes under stress and less commit latency means more writes/second and more IO stress.
See item #6 where I argue that it is good to not have the working set cached.
A common rule of thumb has been to keep all indexes in cache or all of the working set in cache. That simplifies tuning and makes it easier to avoid performance problems. But that also might force a deployment to use 2X more HW than it needs because NAND flash SSDs are everywhere and the response time difference between reading from RAM and reading from NAND flash might not matter for many applications. But if you are using a DBMS in the cloud that charges by the IO, then keeping the working set in RAM might be a good idea.
An LSM usually has less write-amp than a B-Tree. So the IO capacity it saves from that can be used elsewhere to support more read or write transactions.

Worst case behavior

I am wary of faster is better. I prefer nuance but I also know that people don't have time to read long blog posts like this or long performance reports. Here I explain worst case behavior in terms of IO overheads. Worst case behavior isn't the only way to judge an index structure but it helps me to explain performance. Another way is to measure the average amount of IO per transaction (in operations and KB) and treat IO efficiency as important.

I describe worst case behavior for a write operation under a few scenarios. By worst case I mean the largest amount of IO done in the foreground (the thread handling the write) as that determines the response time. I ignore the work done in the background which favors an LSM because that defers more work to the background. For a B-Tree I ignore undo and page splits. The write is a SQL update which is read-modify-write, as opposed to a blind-write like a Put with RocksDB. Finally, I assume the update isn't to an indexed column. The scenarios are:

Cached, PK only - working set cached, PK index only
Not cached, PK only - working set not cached, PK index only
Cached, PK and secondary index - working set cached, PK and non-unique secondary index
Not cached, PK and secondary index - working set not cached, PK and non-unique secondary index

PK only

For the cached, PK only scenario neither an LSM nor a B-Tree do IO in the foreground with the exception of the redo log fsync. Stalls are unlikely for both but more likely with a B-Tree especially when the DBMS storage uses a spinning disk.

An LSM writes the redo log buffer, optionally syncs the redo log and then does an insert into the memtable. Both memtable flush and Ln:Ln+1 compaction are deferred to background threads. If memtable flush were too slow then there are write stalls until flush catches up to avoid too many memtables wasting memory.
A B-Tree modifies a page in the buffer pool, writes the redo log buffer and optionally syncs the redo log. If checkpoint were too slow a full redo log can't be rotated until checkpoint catches up and there are write stalls.

For the not cached, PK only scenario the work done in the foreground is 1 IO/update for an LSM and 2 IO/update for a B-Tree. Here a B-Tree uses a pay as you go model.

An LSM reads a page into the block cache and then repeats the work described in cached, PK only.
A B-Tree finds a dirty page to evict, writes that page back to storage, then reads the desired page into that slot in the buffer pool and repeats the work described in cached, PK only.

PK and secondary index

For the cached, PK and secondary index scenario there is approximately twice as much work to be done per update compared to the cached, PK only scenario. Thus stalls are more likely here. But other than the optional redo fsync there is no foreground IO for the LSM and B-Tree.

An LSM repeats the work explained in the cached, PK only scenario. For the secondary index it does an additional insert to the memtable which is also logged as redo. This can double the demand for compaction.
A B-Tree repeats the work explained in the cached, PK only scenario. For the secondary index it makes an additional page dirty in the buffer pool. This can double the demand for page write back.

For the not cached, PK and secondary index scenario the foreground IO difference between an LSM and B-Tree is more significant -- 1 IO for the LSM vs 4 IO for the B-Tree -- ignoring the redo log overhead. The IO difference is reduced from 1:4 to approximately 1:2 for a B-Tree like InnoDB that implements a change buffer.

An LSM does the union of the work described in not cached, PK only and cached, PK and secondary index scenarios. Ignoring the optional redo fsync the cost is 1 read IO for the PK index and no reads for the secondary index because non-unique secondary index maintenance is read-free.
A B-Tree repeats the work explained in the cached, PK only scenario but this is done for both the PK and secondary indexes. Thus the cost is 2 IOs to write back dirty pages and then 2 IOs to read pages from the PK and secondary indexes into the buffer pool and then make them dirty -- which then requires redo log writes. So the cost for this is 4 IOs ignoring the redo log.

Make writes fast: LSM

Writes can be fast with an LSM because most of the IO cost is deferred but that also increases the need to throttle writes. Life is good as long as that deferred cost can be repaid fast enough, otherwise there will be more response time variance.

Flush and compaction are the deferred cost for an LSM write. Flush means writing the memtable to an SST on storage. Compaction means merging SSTs to move flushed data from the root to leaf of the LSM tree. Compaction costs more than flush. RocksDB can stall writes when compaction doesn't keep up with ingest. Ingest creates durability debt, compaction reduces it and write stalls are there to bound the debt. Write stalls are enabled by default but can be disabled by configuration. Putting a bound on durability debt also puts a bound on read latency by reducing the number of SSTs that can exist in the L0, L1 and L2. So if you want to support extremely high write rates than choose one of: read stalls, write stalls.

Make writes fast: B-Tree

Writes can also be fast with a B-Tree as there are no page reads/writes to/form storage when the working set is cached and background page write back is fast enough. In that case the only IO work in the foreground is the optional redo log fsync.

Page write back is the primary deferred cost for a B-Tree write. Most of my B-Tree experience is with InnoDB which does fuzzy checkpoint. The goal is to flush dirty pages before the current redo log segment gets full. Using larger redo log segments lets InnoDB defer write back for a longer time increasing the chance that more transactions will modify the page -- reducing write amplification and helping performance.

Purge can be an additional deferred cost for a B-Tree write. I use the InnoDB name here as Postgres calls this vacuum. This is the process of reclaiming space from deleted rows that are no longer visible by open MVCC snapshots. The LSM equivalent of purge is checking the open snapshot list during compaction for KV pairs that are not the latest version of a given key to determine whether that version is still needed.

When write back and purge are fast enough then write stalls should be infrequent with a B-Tree. But write back isn't always fast enough. A B-Tree write stall occurs when a write transaction must read a page into the buffer pool prior to modifying that page but 1) the buffer pool is full and 2) write back must be done for a dirty page before the memory can be reused.

Other

A few other comments that didn't have a place above:

In this post I assume the B-Tree uses no-force, but there is at least one nice OSS B-Tree that uses force.
Making commit slower is another way to throttle writes and reduce the chance of stalled writes. Examples of this include redo log fsync, semisync or synchronous replication.
The InnoDB change buffer is a wonderful feature that reduces the IO overhead for write-heavy workloads.
NAND flash GC stalls are another source of write stalls. I wish more were written about this topic.
Stalls during TRIM when using an LSM with NAND flash are another source of stalls. I wish there were more TRIM benchmarks. Smart friends tell me that NAND flash devices vary widely in their ability to handle TRIM. And they display different stall behavior when their TRIM capacity has been exceeded. Some of us were spoiled by FusionIO.

Friday, November 15, 2019

The joy of database configuration

I am wary of papers with performance results for too many products.Too many means including results from systems for which you lack expertise. Wary means I have less faith in the comparison even when the ideas in the paper are awesome. I have expertise in MySQL, MongoDB, RocksDB, WiredTiger and InnoDB but even for them I have made and acknowledged ridiculous mistakes.

Database configuration is too hard. There are too many options, most of them aren't significant and the approach is bottom-up. I an expert on this -- in addition to years of tuning I have added more than a few options to RocksDB and MySQL.

This post was motivated by PostgreSQL. I want to run the insert benchmark for it and need a good configuration. I have nothing against PG with the exception of a few too many why not Postgres comments. The community is strong, docs are great and the product is still improving. But I think PostgreSQL configuration has room to improve -- just like RocksDB (here, here) and MySQL/InnoDB.

Too many options

A non-expert user lacks both the ability to choose good values for options and the ability to understand which options might be useful to set. My solution to too many options and most aren't significant is to use good defaults and split the option name space into two parts -- regular and expert. Regular options are set by most users because they matter for performance and don't have good default values. The amount of memory the DBMS can use is one such option - the default will be small.

Everything else is an expert option. These include options for which the default is great and options that rarely impact performance. There is a reason for expert options -- some workloads benefit from their existence and being able to set that option at runtime might avoid downtime. Options are also added early in the lifecycle of new features to allow developers to evaluate the new feature and choose good default values. But such options don't need to be exposed to all users.

The benefit from doing this is to avoid presenting a new user with tens or hundreds of options to consider. That is a lousy experience. And while X is too hard isn't always a valid complaint -- language (human and database query) is complex because they let us express complex idea -- I don't think we gain much from the current approach.

RocksDB has added functions that simplify configuration and even split the option namespace into two parts -- regular and advanced. This is a step in the right direction but I hope for more. I confirmed that most RocksDB options either have good defaults or aren't significant for my workloads and then published advice on tuning RocksDB.

The performance configurations I use for MongoDB/WiredTiger and MySQL/InnoDB are similar to my experience with RocksDB. I don't have to set too many options to get great performance. Alas, it took a long time to figure that out.

Top-down configuration

Top-down configuration is another approach that can help. The idea is simple - tell the DBMS about the hardware it can use and optionally state a few constraints.

The basic hardware configuration is empty which implies the DBMS gets everything it can find -- all memory, all CPU cores, all IO capacity. When a host does more than run a DBMS it should be easy to enforce that limit with one option for memory consumption, one for CPU, etc. The user shouldn't have to set ten options for ten different memory consumers. It is even worse when these limits are per instance -- limiting how much memory each sort buffer gets is a lousy way to manage total memory usage. IO capacity is interesting. AFAIK there was a tool included in RethinkDB that characterized IO capacity, PostgreSQL has a tool for fsync performance and we can't forget fio. But it is easy to be mislead about SSD performance.

The constraints cover things that are subjective. What is the max recovery time objective? How do you rank read, write, space and memory efficiency?

A great example of this is SQL Memory Management in Oracle 9i -- tell the DBMS how much memory it can use and let it figure out the best way to use it.

What about ML

I hope that ML makes it easier to discover the options that aren't significant and can be moved into the expert options namespace. But I prefer a solution with fewer tuning knobs, or at least fewer visible tuning knobs. I hope to avoid too many knobs (status quota) combined with ML. Lets make smarter database algorithms. If nothing else this should be a source of research funding, interesting PhDs and many papers worth reading.

Update

While I appreciate that someone made the MySQL memory calculator available I wish this weren't needed. Setting memory limits based on peak concurrency means you will under-allocate memory in the normal case or instead you can over-allocate at peak concurrency and get OOM.

Thursday, September 5, 2019

Adapting TPC-C for MongoDB - reviewing a VLDB paper

This is a review of Adapting TPC-C Benchmark to Measure Performance of Multi-Document Transactions in MongoDB which was published in VLDB 2019. I appreciate that MongoDB and Asya Kamsky took the time to get this published. That can be a weekend and nights project when in industry. I also appreciate that this not a benchmarketing effort. The purpose wasn't to overstate performance. The purpose was to show how to get good performance on a TPC-C like workload with MongoDB and realistic hardware and configurations. I hope for a similar effort on MongoDB with Linkbench.

My comments:

Work was done to reduce write-write conflicts which will be more likely given the extra commit latency from using w:majority writeConcern on a 3-node cluster. That work included 1) moving conflicting writes early in the transaction 2) moving writes before reads 3) using findAndModify instead of select/update and 4) batching writes. I wonder if non-stored procedures will be useful.
A small amount of denormalization was done by storing order lines in the order document. Denormalize everything isn't feasible here or in Linkbench because that leads to too-big documents.
Code and details were shared that will allow you to reproduce results.
w:majority was used on a 3-node cluster. The goal was to get realistic results, not a benchmarketing special.

I was confused by two things. First, section 3.3 states that majority write concern guarantees that a write is flushed to disk by any replica that ack'd the write. I thought this was determined by the value of the j option in writeConcern. Second, section 3.5.2 is about causal consistency and that (causal reads feature and logical clocks) seems like overkill when shards aren't used. If you want to avoid going back in time when moving from a primary to a secondary isn't it sufficient to remember the point-in-time at which primary queries are done? But maybe that is just a simple logical clock.

Wednesday, September 4, 2019

Tunable Consistency in MongoDB - reviewing a VLDB paper

This is a review of Tunable Consistency in MongoDB from VLDB 2019. It is worth reading and I appreciate that MongoDB has recently published several conference papers. I am not an expert on this topic. For expertise see Daniel Abadi, Kyle Kingsbury and Peter Bailis. Henrik can be added to the list with a few more blog posts.

MongoDB vs MySQL

MongoDB is a NoSQL DBMS that makes it easy to run sharded replicasets. While the NoSQL part of it is nice the sharded replicaset part of it is amazing. I hope that MySQL eventually gets similar support for sharded replicasets including readConcern and writeConcern options.

I previously compared MySQL semisync replication with MongoDB. With MongoDB the behavior for reads can be tuned separate from writes while MySQL combines them. In the MySQL implementation for lossless semisync a write is not locally visible until a replica acks. In the MongoDB implementation the write is committed locally and then replicated. The replicas apply writes as soon as possible without waiting for a distributed commit protocol. The key to making all of this work is controlling read visibility via an MVCC database engine courtesy of point-in-time reads.

The Review

Highlights:

With many options comes a need for more documentation and tutorials. While I appreciate splitting read and write semantics into separate options, the feature space is large and figuring this out is harder.
The paper states that the gold standard is linearizability. Well, this is a DBMS paper so maybe that should be changed to serializability.
I was confused by the difference between majority and linearizable reads. AFAIK snapshot reads are similar to linearizable (both wait for confirmation that the master really is the master) while majority reads don't have that extra wait.
I was confused by "The transaction commit operation accepts a write concern, which determines ... and its constituent read and write operations" because commit occurs after the reads so how could it effect them. As the paper promises, that is explained later.
MongoDB is a great case study in adding stronger consistency to an async replication system. It continues to use async replication, yet it now provides stronger consistency on demand.
I think that propagated means applied in the description of the w option for writeConcern. This means that a replica acks after applying a change -- either before or after making redo durable depending on the j option. AFAIK the more common option is to ack after shipping the change to the replicas log but before applying the change. However, I prefer what MongoDB does. Maybe commenters will correct my perception of what is more common.
To reduce write-write conflicts MongoDB uses the latest locally committed transaction as the point-in-time to do reads for read/write operations that use w:majority and for multi-statement transactions that use snapshots. The alternative was to use the older latest majority commit point-in-time. See section 5 from the paper. Therefore there is a wait before returning to the user for that locally committed timestamp to be committed to a majority of the replica set. This is true even for read-only transactions. So MongoDB can make reads wait. Obviously it can make writes wait before returning to the user doing the write for w:majority. An excellent CockroachDB blog post explains that it too can make reads wait while Spanner can make writes wait.
Consistency is tunable in MongoDB. With writeConcern you can determine whether a write might wait. With readConcern you can determine whether a read might wait.
Some of the wait for reads is to confirm that the master from which the read has been done is still the master. I wonder if a master lease could have been used to avoid that wait at the cost of making failover slower. Which cost do you prefer?
Replication remains async. Writes are committed locally (and visible to others with the appropriate readConcern options) regardless of the write concern. These are shipped to replicas ASAP and applied by replicas ASAP. This means that a replica can apply a change that has to be undone during master failover because it was never committed to a majority. MongoDB has two ways to rollback that replica -- rollback WiredTiger to an older point in time or undo the extra replication events. Rollback sounds easy while undo is complicated.
Strongly consistent writes don't delay weakly consistent writes. A strongly consistent write is done locally then releases row locks then waits for replicas to ack.
MongoDB doesn't like long running transactions because WiredTiger must keep undo in memory to satisfy all active transactions. MongoDB kills snapshot reads that have been running longer than one minute. For non-snapshot reads it can advance the snapshot in the middle of the read operation. One side-effect of this is that you will have a hard time implementing a consistent logical backup tool like mysqldump.

Thursday, June 20, 2019

Open Core Summit and OSS glue code

I am optimistic about the Open Core Summit. It can be something that benefits users, startups, the source-available community and the OSS crowd. The summit has many of the important people from the open core community. It is an opportunity for them to collaborate -- form a foundation, reduce open core license proliferation, discuss the next license to be reviewed by OSI and most importantly -- fix the open core marketing approach.

I appreciate that startups put so much time and money into building interesting system software that is shared either as OSS or source-available. I haven't enjoyed much of the marketing about the challenges that cloud creates for VC-funded OSS. I am sure cloud makes it harder but the approach has been too negative with too much snark directed towards OSI and AWS. There is a better way.

Glue code

It is all about the glue code.

Stateful services are a hard problem. It is not trivial to scale MySQL by enabling a small DBA team to support a growing database deployment. I am sure the same is true for other DBMS. This is done with some changes to MySQL and a huge amount of glue code. While the GPL doesn't require sharing of the diffs to MySQL my teams have always been happy to do that. But the glue code is the secret sauce and neither the GPL nor the AGPL require it to be shared. The SSPL would change that, although I am wary of the SSPL given uncertainty about how far down the stack the sharing requirement extends.

While the glue code is the secret sauce I wonder whether it has any value outside of a web-scale company.

The glue code isn't portable as it depends on other systems internal to a web-scale company.
Documentation for internal systems is frequently not good and success depends on the shared knowledge of the current teams. Even with access to the internal dependencies the glue code is unusable without the team that knows how to use it.

Therefore I am more interested in OSS glue code that is useful to a community and less interested in licenses that force the (unusable to me) glue code to be published. The glue code should assume environments open to the public -- public clouds and k8s.

What is the state of OSS glue code for MySQL? MySQL needs it to remain competitive. Much of the glue code is baked into MongoDB so that it is easier to scale MongoDB clusters even if you aren't an Atlas customer.

Thursday, June 13, 2019

Interesting indexing in Rockset and MongoDB

I was lazy today and asked about new indexing features in Rockset and MongoDB. They share a valuable goal which is better indexing for the document data model (think less schema, not schema-less). How do you index documents when you don't know all of the attributes that will be used? MongoDB now supports this via a wildcard index and Rockset via converged indexing.

Wildcard indexing in MongoDB lets you specify that an index should be maintained on all, most or some attributes in a document. By most I mean there are options to exclude attributes from a wildcard index. By some I mean there are options to limit this to attributes that start with certain prefixes. Read the docs for more.

Converged indexing in Rockset indexes all attributes. There are no options to include or exclude attributes. This makes the product easier to use at the cost of more IO for index maintenance and more storage for larger indexes. Note that Rockset uses the RocksDB LSM which reduces the cost of index maintenance and might also use the excellent ZStandard compression.

Wildcard and converged indexes do not support compound indexes. For the document { a:1, b:2 } there will be two index entries: a=1 and b=2. There is no way to get an index entry for (a=1, b=2) or (b=2, a=1). If you want a compound index with MongoDB the existing index features can be used. See below (editorial 1) for compound indexes and Rockset.

Implementation details

This section is an educated guess. I don't know enough MongoDB and Rockset internals to claim this with certainty. I ignore the complexity of support for datatypes. In the ideal world all values can be compared via memcmp.

For a traditional index limited to a specified attribute the index entries are of the form (value, pointer) where pointer points to the row and can be the primary key value or (file name, file offset).

This is more interesting for wildcard/converged indexes. I assume that the attribute name is the leading field in each index entry so that the entry is of the form (attribute name, value, pointer). The common way to use such an index is to have an equality predicate on attribute name which is satisfied when the index is queried with predicates like attributeName relOp value. Examples of such predicates are a=2, a>2 and a<=2.

A smart person (Dr Pavlo) mentioned the use of skip scan for these indexes. That could be used to query the index and find documents with any attribute equal to a specific value. That is a less likely use case but still interesting.

Wildcard/converged indexes aren't free. Putting the attribute name in every index entry makes index entries larger and consume more space in memory and on storage. Block compression reduces some of this overhead. Index prefix compression in WiredTiger and RocksDB also helps but at the cost of more CPU overhead.

Storage differences

Up to now I have been describing the search index. In this section I will describe the document storage.

MongoDB stores the document via the storage engine which will soon be WiredTiger only although I hope MongoRocks returns. I assume that WiredTiger with MongoDB is row-wise so that each document is (usually) a contiguous sequence of bytes on some disk page.

Rockset stores each document twice -- row-wise and column-wise. Alas, this gets complicated. The row-wise format is not the traditional approach with one thing in the storage engine per document. Instead there is one thing per attribute per document. This is similar to the CockroachDB approach. I prefer to still call this row-wise given that attributes from a document will be co-located in the LSM SSTs. I am also grateful for the many great blog posts from CockroachDB that I can reference.

With two copies of each document in the base storage there is more storage overhead. Fortunately that overhead is reduced courtesy of the write efficiency and compression friendliness of an LSM.

The Rockset blog post does a great job of explaining this with pictures. I do a worse job here without pictures. For the document { pk:1, a:7, b:3 } when the primary key is pk then the keys for row-wise are R.1.a and R.1.b and for column-wise are C.a.1 and C.b.1. The row-wise format clusters all attributes for a given document. The column-wise format clusters all values across documents for a given attribute. The row-wise format is efficient when most attributes for a document must be retrieved. The column-wise format is efficient for analytics when a given attribute across all documents must be retrieved.

Editorial 1

I interpret the MongoDB docs to mean that when a query uses a wildcard index it cannot use any other index and the wildcard index will only be used for a predicate on a single attribute. I expect that to limit the utility of wildcard indexes. I also expect MongoDB to fix that given how fast they reduce their tech debt. The limitations are listed below. The 1st won't be fixed. The 2nd and 3rd can be fixed.

Compound wildcard indexes are not supported
MongoDB cannot use a non-wildcard index to satisfy one part of a query predicate and a wildcard index to satisfy another.
MongoDB cannot use one wildcard index to satisfy one part of a query predicate and another wildcard index to satisfy another.

I assume that Rockset can combine indexes during query evaluation given their focus on analytics. Thanks to the Rockset team I learned it supports index intersection. It also supports composite indexes via field mappings (functional indexes).

Editorial 2

An open question is whether an LSM can do clever things to support analytics. There has been some work to show the compression benefit from using column-wise storage within a RocksDB SST for the larger levels of the RocksDB LSM. Alas, the key RocksDB workloads have been OLTP. With the arrival of Rockset there is more reason to reconsider this work. There can be benefits in the compression ratio and reduced overhead during query processing. Vertica showed that it was useful to combine a write-optimized store for recent writes with a read-optimized store for older writes. An LSM already structures levels by write recency. Perhaps it is time to make the larger levels read-optimized especially when column-wise data is inserted to the LSM.

Update - read paper years ago then forgot that Kudu combines LSM + columnar.

Editorial 3

The previous section is mostly about being clever when storing column-wise data in an LSM to get better compression and use less CPU during query evaluation. This section is about being clever when storing the search index.

The search index is likely to have many entries for some values a given attribute. Can an LSM be enhanced to take advantage of that for analytics workloads? In other storage engines there are two approaches -- bitmap indexes and RID-lists. Adapting these for an LSM is non-trivial but not impossible. It is likely that such an adaptation would only be done for the larger levels of the LSM tree.