Small Datum: Overhead of reading the Nth attribute in a document for MongoDB & TokuMX

I used the MongoDB sysbench client to measure the overhead from skipping N attributes in a document to access the N+1th attribute for N set to 1, 10, 100 and 1000. As expected there is overhead as N grows, for small N the overhead is less for TokuMX and for large N the overhead is less for MongoDB.

My test used 1 collection with 1M documents and 1 client thread. The sysbench client was changed to add extra attributes immediately after the _id attribute. The extra attributes used names of the form x${K} for K in 1 to 1, 1 to 10, 1 to 100 and 1 to 1000. Each of these attributes were assigned the value of K (x1=1, x2=2, ...). Immediately after these another attribute was added with the name y and assigned the same value as the _id attribute. Each test query matches one document and returns the values of the _id and y attributes after matching the _id and y attributes. The purpose of this test is to determine the overhead from skipping the extra attributes to find the y attribute. For N=1 the results and test was kind-of similar to what I did for a recent cached & read-only test.

This is data for the chart above. Note that TokuMX does much better for small N but MongoDB does much better for large N. I repeated the tests with the y attribute before the x${K} attributes and the results are about the same as below so my assumption was wrong about the overhead for attribute searches and the real overhead is from BSON parsing. Note that I ran extra tests for 200, 400, 500 and 750 attributes to understand the rate at which QPS decreases.

queries per second by #attributes
   1 10 100 200 400 500 750 1000 number of x${K} attributes
14326 13643 10317  8638 6543 5832 4515 3306 tokumx141
9866 9398 8473 8305 7242  6903 6164 4954 mongo249
9119 9530 8547 8613 7890  7577 6602 5289 mongo260

Overhead for 1000 attributes

I looked at top when the tests were running and for all cases mongod was using ~90% of 1 CPU core and Java was using ~10 of another. So MongoDB was using CPU at the same rate as TokuMX while serving many more QPS. I then used the Linux perf utility to understand what was using CPU in mongod. The CPU overhead for TokuMX is from calls to free & malloc. If I had an easy way to get hierarchical profiles I would include that, alas I don't have that. I do have the -g option with perf but the output isn't very interesting. It would be nice if TokuMX figured out how to get more debug symbols into their binary without expecting users to download/install a separate debug package.
This is the output from "perf -g" for TokuMX 1.4.1

14.56% mongod mongod [.] free

--- free

|--91.43%-- 0x7fe27c1c09d8

| (nil)

|--7.66%-- (nil)

--0.91%-- [...]

12.71% mongod mongod [.] malloc

--- malloc

|--76.17%-- mongo::ElementMatcher::~ElementMatcher()

| (nil)

|--16.90%-- operator new(unsigned long)

| 0x100000000

| 0x3031317473657462

|--4.22%-- (nil)

| |

| --100.00%-- 0xffb

|--1.37%-- 0xffffffff00000000

| 0x3031317473657462

--1.34%-- mongo::BSONObj::init

And this is the non-hierarchical output from Linux perf during a 5 second interval of the test.

TokuMX 141

15.83% mongod mongod [.] free

11.64% mongod mongod [.] malloc

11.23% mongod mongod [.] mongo::Projection::transform

6.42% mongod libstdc++.so.6.0.17 [.] _ZNSs4_Rep10_M_disposeERKSaIcE.part.13

6.40% mongod mongod [.] std::_Rb_tree<>::find

4.81% mongod libc-2.5.so [.] strlen

4.01% mongod libstdc++.so.6.0.17 [.] std::basic_string::basic_string

3.49% mongod libstdc++.so.6.0.17 [.] char* std::string::_S_construct

2.43% mongod libc-2.5.so [.] memcmp

2.38% mongod libc-2.5.so [.] memcpy

2.00% mongod libstdc++.so.6.0.17 [.] operator new(unsigned long)

1.93% mongod mongod [.] mongo::Projection::append

1.27% mongod libstdc++.so.6.0.17 [.] std::string::_Rep::_S_create

MongoDB 2.4.9

13.78% mongod mongod [.] mongo::Projection::transform

11.75% mongod mongod [.] mongo::UnorderedFastKeyTable::Area::find

7.95% mongod libc-2.13.so [.] __strlen_sse42

7.28% mongod mongod [.] mongo::Projection::append

4.21% mongod mongod [.] (anonymous namespace)::cpp_alloc

3.86% mongod mongod [.] mongo::BSONElement::size()

2.07% mongod mongod [.] _ZN12_GLOBAL__N_121do_free_with_callback

1.18% mongod mongod [.] mongo::KeyV1::toBson()

0.92% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

0.86% mongod mongod [.] boost::detail::shared_count::~shared_count()

0.71% mongod libpthread-2.13.so [.] pthread_mutex_lock

0.70% mongod libstdc++.so.6.0.16 [.] std::basic_string::~basic_string()

0.69% mongod mongod [.] tc_malloc

0.68% mongod libc-2.13.so [.] __memcpy_ssse3

MongoDB 2.6.0

14.28% mongod mongod [.] mongo::ProjectionStage::transform

10.03% mongod mongod [.] MurmurHash3_x64_128

4.45% mongod mongod [.] mongo::BSONElement::size()

3.93% mongod libc-2.13.so [.] __strlen_sse42

2.83% mongod mongod [.] _ZNKSt3tr110_HashtableIN5mongo10StringDataES2_SaIS2_ESt9_IdentityIS2_ESt8equal_toIS2_ENS2_6HasherENS_8__detail18_Mod_range_hashingENS9_20_Default_ranged_hashENS9_20_Prime_rehash_policyELb0ELb1ELb1EE12_M_find_nodeEPNS9_10_Hash_nodeIS2_Lb0EEERKS2_m.isra.162.constprop.255

2.65% mongod mongod [.] operator new(unsigned long)

1.98% mongod libc-2.13.so [.] __memcmp_sse4_1

1.74% mongod mongod [.] mongo::StringData::Hasher::operator()

1.62% mongod mongod [.] operator delete(void*)

1.46% mongod mongod [.] mongo::KeyV1::toBson()

1.20% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

1.06% mongod libc-2.13.so [.] __memcpy_ssse3

0.83% mongod mongod [.] tc_malloc

0.82% mongod mongod [.] mongo::ps::Rolling::access

0.78% mongod libpthread-2.13.so [.] pthread_mutex_lock

Overhead for 10 attributes

This lists the CPU profile from Linux perf during 5 seconds of the test. Unlike the 1000 attribute result above, here malloc/free or their equivalent are the top two sources for TokuMX and MongoDB.

TokuMX 1.4.1

4.53% mongod mongod [.] malloc

4.12% mongod mongod [.] free

2.89% mongod libtokufractaltree.so [.] _ZNK4toku3omtIP13klpair_structS2_Lb0EE24find_internal_plus_arrayIR9ft_searchXa

dL_Z15wrappy_fun_findIS6_XadL_ZL23heaviside_from_search_tRK10__toku_db

1.80% mongod libc-2.5.so [.] strlen

1.53% mongod mongod [.] mongo::storage::KeyV1::woCompare

1.14% mongod libpthread-2.5.so [.] pthread_mutex_lock

1.02% mongod libc-2.5.so [.] memcpy

0.99% mongod mongod [.] mongo::Projection::transform

0.96% mongod libstdc++.so.6.0.17 [.] _ZNSs4_Rep10_M_dispose

0.87% mongod libstdc++.so.6.0.17 [.] operator new(unsigned long)

0.86% mongod mongod [.] mongo::Projection::init

0.85% mongod libtokufractaltree.so [.] _Z26toku_ft_search_which_childP17__toku_descriptorPFiP9__toku_dbPK10__toku_dbt

S5_EP6ftnodeP9ft_search

0.85% mongod libc-2.5.so [.] memcmp

MongoDB 2.4.9

7.47% mongod mongod [.] cpp_alloc

3.94% mongod mongod [.] _ZN12_GLOBAL__N_121do_free_with_callback

2.24% mongod mongod [.] mongo::KeyV1::toBson()

2.21% mongod libc-2.13.so [.] __strlen_sse42

1.58% mongod mongod [.] boost::detail::shared_count::~shared_count()

1.51% mongod mongod [.] mongo::BSONElement::size()

1.48% mongod libc-2.13.so [.] __memcpy_ssse3

1.40% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

1.39% mongod libstdc++.so.6.0.16 [.] _ZNSs4_Rep10_M_disposeERKSaIcE

1.22% mongod libstdc++.so.6.0.16 [.] std::basic_string::~basic_string()

1.22% mongod libc-2.13.so [.] __memcmp_sse4_1

1.19% mongod mongod [.] tc_malloc

0.99% mongod mongod [.] operator new(unsigned long)

0.95% mongod mongod [.] boost::intrusive_ptr::~intrusive_ptr()

0.94% mongod libpthread-2.13.so [.] pthread_mutex_lock

0.89% mongod libstdc++.so.6.0.16 [.] std::basic_string::basic_string

0.77% mongod libstdc++.so.6.0.16 [.] std::basic_string::basic_string

0.69% mongod mongod [.] mongo::BtreeBucket::customBSONCmp

0.68% mongod mongod [.] mongo::UnorderedFastKeyTable::Area::find

0.64% mongod [ip_tables] [k] ipt_do_table

0.64% mongod mongod [.] CoveredIndexMatcher::matches

MongoDB 2.6.0

4.26% mongod mongod [.] operator new(unsigned long)

2.86% mongod mongod [.] operator delete(void*)

2.70% mongod mongod [.] mongo::KeyV1::toBson() const

1.62% mongod mongod [.] mongo::BucketBasics::KeyNode::KeyNode

1.57% mongod libc-2.13.so [.] __memcpy_ssse3

1.56% mongod libc-2.13.so [.] __strlen_sse42

1.14% mongod libc-2.13.so [.] vfprintf

1.13% mongod libpthread-2.13.so [.] pthread_mutex_lock

1.13% mongod libc-2.13.so [.] __memcmp_sse4_1

1.06% mongod mongod [.] mongo::BSONElement::size()

0.92% mongod libstdc++.so.6.0.16 [.] _ZNSs4_Rep10_M_dispose

0.87% mongod mongod [.] tc_malloc

0.84% mongod libstdc++.so.6.0.16 [.] std::basic_string::~basic_string()

10 comments:

hingoMay 5, 2014 at 12:38 PM
I like this test because it is something probably nobody has published anything on before :-)

Are you sure this measures what you think it measures? I would expect the bson parsing code to be exactly the same in both mongo and toku. If I read your notes correctly, it seems the documents also grow for each test (ie the first test has 3 fields, the last test has 1002 fields?) If that is the case, I would interpret this to say that with 1000 fields - regardless of which fields you are fetching - you see overhead from toku de-compression (or something else, like poorer cache hit rate, probably combined with de-compression...) whereas mongodb degrades slower.
hingoMay 5, 2014 at 1:16 PM
Btw, these results are not "kind of like" your previous read-only test. In the previous test mongo24 was clearly faster, here you start with toku leading for small documents. The difference seems large enough to be significant. (which is surprising)
nelhageMay 6, 2014 at 11:24 AM
If the overhead is in malloc, it might be interesting to benchmark with documents that have a single large attribute so that the total document size matched the N=1000 document size -- that might help isolate whether it's actually the BSON parsing or just the total document size stressing the allocator in some way.