Comments on Small Datum: Overhead of reading the Nth attribute in a document for MongoDB & TokuMX

It's something people think is true, which is ...

2014-05-09T05:22:38.949-07:00

It's something people think is true, which is why I phrased it as "word on the street". However, if you read bsonspec.org, you will see that the bson structure itself makes the belief credible. Basically, each field, including a list or subdocument, starts with its length.

I have not personally verified that the libbson in fact takes advantage of this opportunity to skip unneeded sub-documents, but I find it's likely.

If the overhead is in malloc, it might be interest...

2014-05-06T11:24:49.275-07:00

If the overhead is in malloc, it might be interesting to benchmark with documents that have a single large attribute so that the total document size matched the N=1000 document size -- that might help isolate whether it's actually the BSON parsing or just the total document size stressing the allocator in some way.

Is this something you know to be true because you ...

2014-05-06T04:36:58.441-07:00

Is this something you know to be true because you tested it (like I do here) or just something people think is true?

Let me elaborate... The street smart wisdom claims...

2014-05-06T01:16:07.250-07:00

Let me elaborate... The street smart wisdom claims that it is the number of keys on the same level that makes seeking slow (it's a linear search). So documents like yours with 1000 keys together, or an array of 1000 elements. However, if you'd group them into groups of 10, then you only seek 10 keys per level = max 30 keys to find what you want.

Ie instead of accessing { x123 : 1 } you would get the same value much more efficiently from { x1.y2.z3 : 1 }.

Are you sure that is the workaround?

2014-05-05T14:02:52.798-07:00

Are you sure that is the workaround?

Btw, these results are not "kind of like"...

2014-05-05T13:16:56.779-07:00

Btw, these results are not "kind of like" your previous read-only test. In the previous test mongo24 was clearly faster, here you start with toku leading for small documents. The difference seems large enough to be significant. (which is surprising)

Well, the document data model is great. But it'...

2014-05-05T13:08:04.554-07:00

Well, the document data model is great. But it's well known that it's certainly possible to optimize the "bson encoding" from what it is today. It's just good enough for most use cases, but adding 1k fields on a single level will slow you down. (The workaround is to group things into sub-levels, essentially creating your own b-tree...)

Anyway, what you wrote now as your interpretation would require you to fetch the first (and tenth) field in a document which has 1002 fields. Then you'd see the difference between overhead from seeking vs overhead from the document size.

But I like your point that there is more overhead ...

2014-05-05T12:56:25.932-07:00

But I like your point that there is more overhead from BSON parsing when there are more attributes. I hadn't considered that and the overhead from that usually isn't mentioned (or is conveniently ignored) when extolling the virtues of the document data model.

Everything is in cache -- OS filesystem cache for ...

2014-05-05T12:50:47.498-07:00

Everything is in cache -- OS filesystem cache for MongoDB, TokuMX buffer pool. For TokuMX documents are uncompressed when in the buffer pool. Repeating test with "y" attribute added before all of the x${K} attributes. Expected result in that case is better QPS as K grows.

I like this test because it is something probably ...

2014-05-05T12:38:08.702-07:00

I like this test because it is something probably nobody has published anything on before :-)

Are you sure this measures what you think it measures? I would expect the bson parsing code to be exactly the same in both mongo and toku. If I read your notes correctly, it seems the documents also grow for each test (ie the first test has 3 fields, the last test has 1002 fields?) If that is the case, I would interpret this to say that with 1000 fields - regardless of which fields you are fetching - you see overhead from toku de-compression (or something else, like poorer cache hit rate, probably combined with de-compression...) whereas mongodb degrades slower.