Tuesday, February 18, 2020

New school vs old school DBMS

There are two approaches to DBMS deployment
  1. Old school - pay whatever it takes to keep that one instance running
  2. New school - allow for failure and make failover reliable
At what point do you stop paying to increase availability and durability of one DBMS instance and instead spend that money and energy elsewhere? The old school approach was more popular in the days before cloud and web-scale. Of course early in the cloud and web-scale days we allowed for failure without making failover reliable. Those weren't fun times.

That y must design for failures also doesn't mean you must tolerate lousy hardware. Just because failover should be fast, reliable and lossless doesn't mean you want it running too frequently. There is a difference between commodity and cheap hardware. On the other hand, asking for too much special hardware won't win you friends on the datacenter HW team. If the DBMS is the only thing asking for redundant power, redundant network, special cooling or HW RAID with battery-backed write cache then the DBMS is a problem for web-scale. I know this from experience.

One reason for the old school approach was the large cost of the DBMS licenses and the SMP server on which it ran. Users are motivated to not buy 3X the licenses when the cost is sufficiently large and instead invest in an environment that helps that DBMS keep running (more HA hardware). Any comparison between new school and old school should consider whether the goal is durability or availability because availability costs more than durability. You can try to provide one or both. In this post I am writing about something that needs both.

Storage is one place where you can spend a lot of money to keep that one instance running. That can be done via RAID-10 where the impact is buying twice more storage devices. Or it can be done via a HA storage solution that provides impressive levels of performance, availability and durability.

One thing I am curious about is whether SSD devices need RAID-10. All storage devices have some chance of failure so the question isn't whether RAID-10 is useful. The question is whether the usefulness outweighs the cost. I assume the answer will depend on the device. Samsung advertises fail in place (FIP) as a new feature. But chip fail protection and wear leveling have been here for a long time. For which devices is that sufficient so that RAID-10 isn't recommended? Do vendors make this clear in their docs?

New School

The new school approach must do two things: make durable commits fast, make failover fast and lossless.

Fast durable commits are usually done via sync replication. I think of sync log shipping as a variant of sync replication where the end point is a log archive rather than a replica. But some sync replication solutions already support some of that via a witness.

MongoDB provides fast durable commits via async replication. I was impressed when I first learned of the implementation. The property needed for durable commits is to avoid making writes visible before they are durable. Sync replication is an implementation detail. With MongoDB and the majority read concern a snapshot on the primary is advanced to track the point-in-time at which commits are durable (applied on enough replicas).

Whether durable commits are fast depends on where you place the replicas. Speed of light matters. When replicas are far apart then there will be more commit latency. When replicas are not far apart there might be a larger HW bill. Witnesses (log only replicas) can help here.

Lossless failover assumes a solution for durable commit. Once durable commit has been provided then fast failover is a matter of detecting failures, electing a new primary, promoting the replica to be a primary and then directing traffic to the new primary. There are many details here and plenty of opportunities for mistakes. I know from having helped make some of them. Fortunately the rise of web-scale DBMS means that we get solutions that work.

Things that I ignore in this post:
  • Fast failure is a challenging problem depending on how fast you want it to be. 
  • Systems that allow multiple replicas to initiate writes might not expose failover to a client, but many of the problems described here are still solved under the covers by such systems.
  • Even for systems that don't have explicit failover there is still an impact on clients from failed in-progress transactions. Although Comdb2 hides that from clients.

Friday, February 14, 2020

Describing replication

There is an opportunity for confusion when describing replication including physical vs logical and synchronous vs asynchronous. I prefer describing replication as physical, statement or document/row rather than logical vs physical and briefly define them below.

In physical replication the log records the changes to the database pages (change bytes starting at offset X on page Y from this sequence to that sequence). A benefit of this approach is that the database pages will be the same between primary and replica. This approach also avoids the overhead (more CPU than IO) of evaluating the queries that lead to the page changes. A problem from this approach is that the pages need to be the same between the primary and replica. Changes to pages must be deterministic, long running reporting queries on a replica might block replication apply, MVCC GC requires coordination, etc.

In statement based replication (SBR) the log records the statements that modified the database. The benefit from this approach is that it is easier to implement. When one statement changes many rows then this also reduces the size of the log. But for OLTP statements are less likely to change many rows. Unfortunately long running write statements on the primary will be repeated on the slave and that can waste CPU and cause replication lag. It is also difficult to detect dependencies between transactions which makes it harder to replay in parallel on a replica. I did web-scale MySQL for many years with SBR. It worked, but the move to RBR is a good thing.

In document/row replication (aka RBR or DBR) the log has one entry per changed doc/row. This avoids the need for the replica to reevaluate the statement that generated the changes -- it can just fetch the docs/rows by ID and apply changes to them. When the doc/row entries in the log also include PK (_id) values then it is possible to determine which transactions can be replayed in parallel. It is possible to make these changes idempotent in contrast to SBR which reduces the burden of making replay crash safe.

MongoDB oplog

This is an example of the oplog contents for MongoDB. This is document based based on the terminology above:

db.bar.insertMany([{a:1}, {a:2}, {a:3}])

use local
db.oplog.rs.find({"o.msg": {$ne: "periodic noop"}},
                 {op:1, o:1}).sort({$natural:-1}).limit(3)

{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 3 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 2 } }
{ "op" : "i", "o" : { "_id" : ObjectId("..."), "a" : 1 } }

Benchmarks vs real workloads

How do you test a DBMS that you are building for internal usage or the world? This is a good question whether you are the upstream DBMS or maintaining a downstream fork. The choices include real workloads, synthetic benchmarks and throw it over the wall.

There is a great paper from Microsoft on doing live capture & replay at scale for real customers. I hope to add a link for a talk on the live capture & replay that was done for testing at Facebook.

Real workloads

Real workloads are great. Alas real workloads are hard to get (code IP, data privacy). Even if these are resolved a bigger problem is that real workloads are complex and buggy (all software is buggy). If you get a real workload without a team to support it then you will spend a lot of time debugging it and otherwise trying to figure it out. This is probably more true for OLTP than analytics.

Real workloads v2

Capture & replay (C&R) is an easier way to get some of the goodness of a real workload with less effort. This avoids the overhead of dealing with the code from the real workload. This doesn't avoid issues with data privacy.

C&R can be online or offline. Online C&R does the replay online (immediately after the capture). Online C&R is a great choice when you have significant internal workloads that use the DBMS you are building. Running a DBMS in production is a great education for database developers in so many ways. This is one more perk.

Offline C&R archives the captured workload for replay at a later time. This is useful for workloads that can't be shadowed (customer workloads). This is also useful when you want a workload that can be repeated (replayed many times) as online C&R does not allow for that. An interesting problem for offline C&R is making the replay realistic -- respecting the timing and concurrency that was recorded. This is more challenging when the DBMS being tested is a lot faster than the DBMS used for the capture.

One question for C&R is the point at which the capture is done. When capture is done at the DBMS then you don't miss anything. But sometimes it is easier to do the capture at the application tier.

The Facebook MySQL team had a great setup for online C&R via a shadow tier. I don't have much experience with offline C&R but I know it has been useful for others. In my Google MySQL years I built a tool to sample production queries (SELECT, not writes) and replay them against replicas running new and old binaries to compare performance and correctness. That was a fun and useful project for me. It was mostly online C&R. One interesting bug it found was a result of changing the internal datatype for aggregation between decimal and IEEE754.

Synthetic benchmarks

Benchmarks have a bad reputation but I like them. I even enjoy benchmarketing but only when I am not involved. I rarely run benchmarks in isolation. Absolute values (QPS, TPS) don't help me. I am a big fan of relative comparisons (new binary vs old binary, this DBMS vs that DBMS).

Comparisons require sufficient expertise in all of the systems that are tested as the goal is to get results that can be explained and trusted. Expertise is hard to find. Therefore I have less faith in benchmark results that test more than one DBMS -- whether this is work done by myself or others. Perhaps one day someone will do a study and provide a function that estimates truthiness as a function of the number of DBMS tested.

Throw it over the wall

This has to be done at some point in time, otherwise you are building a DBMS that nobody uses. Hopefully the steps above have been done to increase the chance that there won't be too much drama at this point.

Thursday, February 13, 2020

Short guide to MongoDB monitoring

This is short and incomplete. But it is a good start. This is written from the perspective of someone who spends all of their time trying to explain MongoDB performance for benchmarks. I have plenty of experience with databases in production. I have no experience with MongoDB in production.

Notes:
  • One day I hope for the equivalent of user/table statistics
  • For profiling use db.setProfilingLevel and db.getProfilingStatus.
  • To enable debug level messages in the diagnostic log use db.setLogLevel and db.getLogComponents. However I think it is best to start with non-debug messages so you can ignore these for a while.
  • I wish the set and get names above were symmetric. Just remember the set*Level method names and don't worry about the get* names. 
  • db.setProfilingLevel also determines what can get written to the diagnostic log. The important parameter is slowms. My advice is to set level=0 and slowms to something > 0 and use the diagnostic log. 
  • Learn how to use mtools to process entries from the diagnostic log. The tools are great and the docs are awesome.
  • The diagnostic log entries for COMMAND and TXN are the most likely entries you will examine. These lines can include many performance metrics but they are only printed when the values are > 0. Examples are below.
  • Also see db.stats, db.serverStatus, db.collStatsdb.collection.statsdb.collection.latencyStats, connPoolStats, replSetGetStatus,
  • Consider saving the FTDC files from the diagnostic.data directory if you have problems. There are some tools for accessing the data from an FTDC file.
  • Use explain to understand query plans and execution
  • Use mongostat and mongotop to see performance from a high level before drilling down
  • I don't have much experience with them yet but MongoDB has additional tools and services for monitoring.
Diagnostic log examples
These show some of the counters that can be included in diagnostic log entries. Remember that many of the counters are only printed when > 0.
Metrics on writes (bytesWritten, timeWritingMicros)
2020-02-04T18:05:24.787-0800 I COMMAND [conn41] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3139, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868324, 23435), signature: { hash: BinData(0, 9E42A21BD71EA0BB8DFEAD52013CFFF325E60BB6), keyId: 6789772850304647170 } }, lsid: { id: UUID("5d900961-93cf-4f7a-9426-940a37a71eca") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesWritten: 324324, timeWritingMicros: 656 } } protocol:op_msg 401ms
Metrics on reads (bytesRead, timeReadingMicros)
2020-02-04T18:08:52.442-0800 I COMMAND [conn33] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 3719, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580868532, 2858), signature: { hash: BinData(0, D289506AFBD0E8C8E8184112608CFED62E9A1B7D), keyId: 6789772850304647170 } }, lsid: { id: UUID("7391b47c-fd95-491b-a640-20c953bcacc0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1369, timeReadingMicros: 10 } } protocol:op_msg 389ms

timeWaitingMicros is less frequent. In this case it was ~2.2 seconds. That seems like a lot but the operation wrote more than 4M of data.
2020-02-04T18:22:05.668-0800 I COMMAND [conn46] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 144, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580869323, 2352), signature: { hash: BinData(0, 2CF49EC657832ECB9D7DD0AC5C65E5C3F80CA10B), keyId: 6789781594858061826 } }, lsid: { id: UUID("d88d7d65-f293-4060-a894-17c58123d787") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{ data: { bytesRead: 1881, bytesWritten: 4458441, timeReadingMicros: 15, timeWritingMicros: 5071 }, timeWaitingMicros: { cache: 2215267 } } protocol:op_msg 2621ms
timeAcquiringMicros for locks is also less frequent. I have a few examples from insert stress tests. I have yet to learn whether mongod times all lock waits. It isn't easy to do that without an impact on performance.
2020-02-04T17:52:55.133-0800 I COMMAND [conn43] command linkdb0.linktable command: insert { insert: "linktable", ordered: false, writeConcern: { w: 1, j: false }, txnNumber: 1023, $db: "linkdb0", $clusterTime: { clusterTime: Timestamp(1580867574, 46329), signature: { hash: BinData(0, D158AF9490A8ABAEA2969E809D2C27FE57157B41), keyId: 6789772850304647170 } }, lsid: { id: UUID("fabaaa7b-c2c4-4712-ae16-f2c0f58a03a0") } } ninserted:1024 keysInserted:1024 numYields:0 reslen:230 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 2048 } }, ReplicationStateTransition: { acquireCount: { w: 2049 } }, Global: { acquireCount: { w: 2048 } }, Database: { acquireCount: { w: 2048 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 14042 } }, Collection: { acquireCount: { w: 2048 } }, Mutex: { acquireCount: { r: 3072 } } } flowControl:{ acquireCount: 1024 } storage:{} protocol:op_msg 376ms
Commit appears in the log via TXN

2020-02-05T17:31:54.792-0800 I TXN [conn81] transaction parameters:{ lsid: { id: UUID("cae8daad-5603-4a41-87db-3fda07e90bee"), uid: BinData(0, 6399AB0DAC62F20BFC466753B10FB58FB7E692BEC952C69B84D997021794D1F8) }, txnNumber: 255, autocommit: false, readConcern: { level: "snapshot", afterClusterTime: Timestamp(1580952714, 1391) } }, readTimestamp:Timestamp(0, 0), ninserted:10 keysInserted:20 terminationCause:committed timeActiveMicros:604 timeInactiveMicros:1016 numYields:0 locks:{ ReplicationStateTransition: { acquireCount: { w: 5 } }, Global: { acquireCount: { r: 3, w: 1 } }, Database: { acquireCount: { r: 2, w: 2 } }, Collection: { acquireCount: { w: 2 } }, Mutex: { acquireCount: { r: 23 } }, oplog: { acquireCount: { r: 2 } } } storage:{} wasPrepared:0, 1ms

Wednesday, February 12, 2020

How long should you wait for a JVM JIT before attaching a CPU profiler?

I haven't done serious Java development since 2000. I am happy for comments that help me learn more as a lot has changed since then. I was recently confronted with Java that used too much CPU time (Linkbench client for an OSS DBMS). I started with jPMP and the results were useful. I then found VisualVM. It has a CPU sampler and profiler. The sampler uses wall clock time while the profiler uses CPU time.

How much time does the JIT need to do its magic before I attach a CPU profiler? It depends. This article is great but if some decisions depend on the number of calls to a method then you might want to wait for that method to be called enough times. Thus it depends. If the method is called once per database query then this depends on the query response time.

I added -XX:+PrintCompilation to the Linkbench client command line to observe the JIT in action. It looks like most optimization is done within 3 minutes for a workload doing ~1000 QPS based on the output and that seems to match what the article states about compilation thresholds and tiered compilation.

Be careful when comparing profile output between two runs. If method A accounts for 10% of time with server 1 and 20% of time with server 2 that doesn't always mean method A is a problem for server 2. Assume that QPS is 3X better with server 2, then normalize CPU overheads by QPS (CPU / QPS) and the method A / server 2 combination doesn't look so bad.

Tuesday, February 11, 2020

Websites that report your browsing to Facebook

There are many things I like about the Facebook app but recording my browsing history via Off Facebook Activity isn't one of them. I know that Google and Microsoft also have the ability to collect my browsing history but I assume they do it directly via their browsers (Edge, Chrome) and I can easily disable that by using Firefox or logging out when using their browser. With Facebook I don't have that option. I can and do opt-out but most users won't.

Facebook doesn't sell your data, data brokers do that. That is worthy of attention but off topic here.

Facebook has collected call logs in the past. It still collects contacts, as do many other apps. This rant is only about the collection of browsing history. I think it is creepy. I hope one day that it becomes illegal. All of this data (call logs, address book, browsing history) is high-risk as in there are external parties (lawyers, insurance companies, government agencies) that would love to access that data.

I get why websites send data back to Facebook (FB). It helps them buy ads targeting FB users who visited their website. I get why FB does this. It helps them sell targeted ads. Neither of these benefit the user.

I don't want websites for hospitals and disease advocacy to report visits to FB. But some do. I hope the law changes to prevent this but GDPR and CCPA have not stopped it.

Who does this?

The short answer is many but not all web sites. I did an experiment and visited web sites that are relevant to me to see who reported my visits to FB. I was happy to learn that some web sites did not.

Some results are ironic.

Some results are ironic, but less than above. Progressive advocacy websites that report my visits to FB include actblue.com, emilyslist.orgcitizensforethics.orgmotherjones.com, hrw.org, amnesty.org, greenpeace.org, democrats.org, prochoiceamerica.org, Planned Parenthood Action Fund and thinkprogress.org. I am happy, but not surprised, to learn that eff.org doesn't report visits to FB.

Some results made me sad. Web sites that provide health care services should not report my visits to FB. I prefer that disease advocacy web sites not report visits. The sites listed here are relevant to me. I have visited in the past to make donations and because I was a caregiver for someone with cancer. All of them reported my visits to FB:
I won't list them here but many addiction recovery businesses report visits to FB and so do advocacy web sites for other diseases.

Friday, February 7, 2020

Podcasts I love

There are so many amazing history podcasts. This is a list of some that I have enjoyed and will listen to again. The content is amazing and free but you can also support the creators via Patreon or their web sites.

There is not enough time to listen to all of the great podcasts. The creators have so much talent. I prefer podcasts done by one person with content that provides a big picture so I don't get lost in the details. I am not picky about the part of the world covered or time time period as long as the content is engaging. I am picky about the presentation -- voice and editing.

I still use the Apple Podcasts app on my iPhone. Maybe other apps are better but I am used to it. I gave up using my laptop. Syncing my phone with the laptop was a hassle and It is easy to (re)download content from iTunes. I subscribe and then use custom settings to play from oldest to newest and don't delete after playing (because I frequently repeat episodes). I manually delete episodes otherwise when a podcast has 100+ episodes the Podcast app requires too much scrolling.

So much work goes into these podcasts but when finished there is something that brings much joy to many people, something that will be here for a long time.

The list:
I have a vague memory of listening to a podcast on the history of Korea but I can't find it. I think I read a book on the topic as well so it all blends together at this point. The stories about the naval battles, maybe Admiral Yi Sun-Shin, remain with me.

Evaluating vector indexes in MariaDB and pgvector: part 2

This post has results from the ann-benchmarks with the   fashion-mnist-784-euclidean  dataset for MariaDB and Postgres (pgvector) with conc...