Monday, May 5, 2014

Overhead of reading the Nth attribute in a document for MongoDB & TokuMX

I used the MongoDB sysbench client to measure the overhead from skipping N attributes in a document to access the N+1th attribute for N set to 1, 10, 100 and 1000. As expected there is overhead as N grows, for small N the overhead is less for TokuMX and for large N the overhead is less for MongoDB.

My test used 1 collection with 1M documents and 1 client thread. The sysbench client was changed to add extra attributes immediately after the _id attribute. The extra attributes used names of the form x${K} for K in 1 to 1, 1 to 10, 1 to 100 and 1 to 1000. Each of these attributes were assigned the value of K (x1=1, x2=2, ...). Immediately after these another attribute was added with the name y and assigned the same value as the _id attribute. Each test query matches one document and returns the values of the _id and y attributes after matching the _id and y attributes. The purpose of this test is to determine the overhead from skipping the extra attributes to find the y attribute. For N=1 the results and test was kind-of similar to what I did for a recent cached & read-only test.

This is data for the chart above. Note that TokuMX does much better for small N but MongoDB does much better for large N. I repeated the tests with the y attribute before the x${K} attributes and the results are about the same as below so my assumption was wrong about the overhead for attribute searches and the real overhead is from BSON parsing. Note that I ran extra tests for 200, 400, 500 and 750 attributes to understand the rate at which QPS decreases.

queries per second by #attributes
    1     10    100   200   400   500   750  1000   number of x${K} attributes
14326  13643  10317  8638  6543  5832  4515  3306   tokumx141
 9866   9398   8473  8305  7242  6903  6164  4954   mongo249
 9119   9530   8547  8613  7890  7577  6602  5289   mongo260

Overhead for 1000 attributes

I looked at top when the tests were running and for all cases mongod was using ~90% of 1 CPU core and Java was using ~10 of another. So MongoDB was using CPU at the same rate as TokuMX while serving many more QPS. I then used the Linux perf utility to understand what was using CPU in mongod. The CPU overhead for TokuMX is from calls to free & malloc. If I had an easy way to get hierarchical profiles I would include that, alas I don't have that. I do have the -g option with perf but the output isn't very interesting. It would be nice if TokuMX figured out how to get more debug symbols into their binary without expecting users to download/install a separate debug package.
This is the output from "perf -g" for TokuMX 1.4.1


    14.56%   mongod  mongod                 [.] free                             
             |
             --- free
                |          
                |--91.43%-- 0x7fe27c1c09d8
                |          (nil)
                |          
                |--7.66%-- (nil)
                 --0.91%-- [...]

    12.71%   mongod  mongod                 [.] malloc                                                                               
             |
             --- malloc
                |          
                |--76.17%-- mongo::ElementMatcher::~ElementMatcher()
                |          (nil)
                |          
                |--16.90%-- operator new(unsigned long)
                |          0x100000000
                |          0x3031317473657462
                |          
                |--4.22%-- (nil)
                |          |          
                |           --100.00%-- 0xffb
                |          
                |--1.37%-- 0xffffffff00000000
                |          0x3031317473657462
                |          
                 --1.34%-- mongo::BSONObj::init


And this is the non-hierarchical output from Linux perf during a 5 second interval of the test.

TokuMX 141
    15.83%   mongod  mongod                 [.] free
    11.64%   mongod  mongod                 [.] malloc
    11.23%   mongod  mongod                 [.] mongo::Projection::transform
     6.42%   mongod  libstdc++.so.6.0.17    [.] _ZNSs4_Rep10_M_disposeERKSaIcE.part.13
     6.40%   mongod  mongod                 [.] std::_Rb_tree<>::find
     4.81%   mongod  libc-2.5.so            [.] strlen
     4.01%   mongod  libstdc++.so.6.0.17    [.] std::basic_string::basic_string
     3.49%   mongod  libstdc++.so.6.0.17    [.] char* std::string::_S_construct
     2.43%   mongod  libc-2.5.so            [.] memcmp
     2.38%   mongod  libc-2.5.so            [.] memcpy
     2.00%   mongod  libstdc++.so.6.0.17    [.] operator new(unsigned long)
     1.93%   mongod  mongod                 [.] mongo::Projection::append
     1.27%   mongod  libstdc++.so.6.0.17    [.] std::string::_Rep::_S_create

MongoDB 2.4.9
    13.78%   mongod  mongod               [.] mongo::Projection::transform
    11.75%   mongod  mongod               [.] mongo::UnorderedFastKeyTable::Area::find
     7.95%   mongod  libc-2.13.so         [.] __strlen_sse42
     7.28%   mongod  mongod               [.] mongo::Projection::append
     4.21%   mongod  mongod               [.] (anonymous namespace)::cpp_alloc
     3.86%   mongod  mongod               [.] mongo::BSONElement::size()
     2.07%   mongod  mongod               [.] _ZN12_GLOBAL__N_121do_free_with_callback
     1.18%   mongod  mongod               [.] mongo::KeyV1::toBson()
     0.92%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     0.86%   mongod  mongod               [.] boost::detail::shared_count::~shared_count()
     0.71%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock
     0.70%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::~basic_string()
     0.69%   mongod  mongod               [.] tc_malloc
     0.68%   mongod  libc-2.13.so         [.] __memcpy_ssse3

MongoDB 2.6.0
    14.28%   mongod  mongod               [.] mongo::ProjectionStage::transform
    10.03%   mongod  mongod               [.] MurmurHash3_x64_128
     4.45%   mongod  mongod               [.] mongo::BSONElement::size()
     3.93%   mongod  libc-2.13.so         [.] __strlen_sse42
     2.83%   mongod  mongod               [.] _ZNKSt3tr110_HashtableIN5mongo10StringDataES2_SaIS2_ESt9_IdentityIS2_ESt8equal_toIS2_ENS2_6HasherENS_8__detail18_Mod_range_hashingENS9_20_Default_ranged_hashENS9_20_Prime_rehash_policyELb0ELb1ELb1EE12_M_find_nodeEPNS9_10_Hash_nodeIS2_Lb0EEERKS2_m.isra.162.constprop.255
     2.65%   mongod  mongod               [.] operator new(unsigned long)
     1.98%   mongod  libc-2.13.so         [.] __memcmp_sse4_1
     1.74%   mongod  mongod               [.] mongo::StringData::Hasher::operator()
     1.62%   mongod  mongod               [.] operator delete(void*)
     1.46%   mongod  mongod               [.] mongo::KeyV1::toBson()
     1.20%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.06%   mongod  libc-2.13.so         [.] __memcpy_ssse3
     0.83%   mongod  mongod               [.] tc_malloc
     0.82%   mongod  mongod               [.] mongo::ps::Rolling::access
     0.78%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock

Overhead for 10 attributes

This lists the CPU profile from Linux perf during 5 seconds of the test. Unlike the 1000 attribute result above, here malloc/free or their equivalent are the top two sources for TokuMX and MongoDB.


TokuMX 1.4.1
     4.53%   mongod  mongod                 [.] malloc
     4.12%   mongod  mongod                 [.] free
     2.89%   mongod  libtokufractaltree.so  [.] _ZNK4toku3omtIP13klpair_structS2_Lb0EE24find_internal_plus_arrayIR9ft_searchXa
dL_Z15wrappy_fun_findIS6_XadL_ZL23heaviside_from_search_tRK10__toku_db
     1.80%   mongod  libc-2.5.so            [.] strlen
     1.53%   mongod  mongod                 [.] mongo::storage::KeyV1::woCompare
     1.14%   mongod  libpthread-2.5.so      [.] pthread_mutex_lock
     1.02%   mongod  libc-2.5.so            [.] memcpy
     0.99%   mongod  mongod                 [.] mongo::Projection::transform
     0.96%   mongod  libstdc++.so.6.0.17    [.] _ZNSs4_Rep10_M_dispose
     0.87%   mongod  libstdc++.so.6.0.17    [.] operator new(unsigned long)
     0.86%   mongod  mongod                 [.] mongo::Projection::init
     0.85%   mongod  libtokufractaltree.so  [.] _Z26toku_ft_search_which_childP17__toku_descriptorPFiP9__toku_dbPK10__toku_dbt
S5_EP6ftnodeP9ft_search
     0.85%   mongod  libc-2.5.so            [.] memcmp

MongoDB 2.4.9
     7.47%   mongod  mongod               [.] cpp_alloc
     3.94%   mongod  mongod               [.] _ZN12_GLOBAL__N_121do_free_with_callback    
     2.24%   mongod  mongod               [.] mongo::KeyV1::toBson()
     2.21%   mongod  libc-2.13.so         [.] __strlen_sse42
     1.58%   mongod  mongod               [.] boost::detail::shared_count::~shared_count()
     1.51%   mongod  mongod               [.] mongo::BSONElement::size()
     1.48%   mongod  libc-2.13.so         [.] __memcpy_ssse3
     1.40%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.39%   mongod  libstdc++.so.6.0.16  [.] _ZNSs4_Rep10_M_disposeERKSaIcE
     1.22%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::~basic_string()
     1.22%   mongod  libc-2.13.so         [.] __memcmp_sse4_1
     1.19%   mongod  mongod               [.] tc_malloc
     0.99%   mongod  mongod               [.] operator new(unsigned long)
     0.95%   mongod  mongod               [.] boost::intrusive_ptr::~intrusive_ptr()
     0.94%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock
     0.89%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::basic_string
     0.77%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::basic_string
     0.69%   mongod  mongod               [.] mongo::BtreeBucket::customBSONCmp
     0.68%   mongod  mongod               [.] mongo::UnorderedFastKeyTable::Area::find
     0.64%   mongod  [ip_tables]          [k] ipt_do_table
     0.64%   mongod  mongod               [.] CoveredIndexMatcher::matches

MongoDB 2.6.0
     4.26%   mongod  mongod               [.] operator new(unsigned long)
     2.86%   mongod  mongod               [.] operator delete(void*)
     2.70%   mongod  mongod               [.] mongo::KeyV1::toBson() const
     1.62%   mongod  mongod               [.] mongo::BucketBasics::KeyNode::KeyNode
     1.57%   mongod  libc-2.13.so         [.] __memcpy_ssse3
     1.56%   mongod  libc-2.13.so         [.] __strlen_sse42
     1.14%   mongod  libc-2.13.so         [.] vfprintf
     1.13%   mongod  libpthread-2.13.so   [.] pthread_mutex_lock
     1.13%   mongod  libc-2.13.so         [.] __memcmp_sse4_1
     1.06%   mongod  mongod               [.] mongo::BSONElement::size()
     0.92%   mongod  libstdc++.so.6.0.16  [.] _ZNSs4_Rep10_M_dispose
     0.87%   mongod  mongod               [.] tc_malloc
     0.84%   mongod  libstdc++.so.6.0.16  [.] std::basic_string::~basic_string()  

10 comments:

  1. I like this test because it is something probably nobody has published anything on before :-)

    Are you sure this measures what you think it measures? I would expect the bson parsing code to be exactly the same in both mongo and toku. If I read your notes correctly, it seems the documents also grow for each test (ie the first test has 3 fields, the last test has 1002 fields?) If that is the case, I would interpret this to say that with 1000 fields - regardless of which fields you are fetching - you see overhead from toku de-compression (or something else, like poorer cache hit rate, probably combined with de-compression...) whereas mongodb degrades slower.

    ReplyDelete
    Replies
    1. Everything is in cache -- OS filesystem cache for MongoDB, TokuMX buffer pool. For TokuMX documents are uncompressed when in the buffer pool. Repeating test with "y" attribute added before all of the x${K} attributes. Expected result in that case is better QPS as K grows.

      Delete
    2. But I like your point that there is more overhead from BSON parsing when there are more attributes. I hadn't considered that and the overhead from that usually isn't mentioned (or is conveniently ignored) when extolling the virtues of the document data model.

      Delete
    3. Well, the document data model is great. But it's well known that it's certainly possible to optimize the "bson encoding" from what it is today. It's just good enough for most use cases, but adding 1k fields on a single level will slow you down. (The workaround is to group things into sub-levels, essentially creating your own b-tree...)

      Anyway, what you wrote now as your interpretation would require you to fetch the first (and tenth) field in a document which has 1002 fields. Then you'd see the difference between overhead from seeking vs overhead from the document size.

      Delete
    4. Are you sure that is the workaround?

      Delete
    5. Let me elaborate... The street smart wisdom claims that it is the number of keys on the same level that makes seeking slow (it's a linear search). So documents like yours with 1000 keys together, or an array of 1000 elements. However, if you'd group them into groups of 10, then you only seek 10 keys per level = max 30 keys to find what you want.

      Ie instead of accessing { x123 : 1 } you would get the same value much more efficiently from { x1.y2.z3 : 1 }.

      Delete
    6. Is this something you know to be true because you tested it (like I do here) or just something people think is true?

      Delete
    7. It's something people think is true, which is why I phrased it as "word on the street". However, if you read bsonspec.org, you will see that the bson structure itself makes the belief credible. Basically, each field, including a list or subdocument, starts with its length.

      I have not personally verified that the libbson in fact takes advantage of this opportunity to skip unneeded sub-documents, but I find it's likely.

      Delete
  2. Btw, these results are not "kind of like" your previous read-only test. In the previous test mongo24 was clearly faster, here you start with toku leading for small documents. The difference seems large enough to be significant. (which is surprising)

    ReplyDelete
  3. If the overhead is in malloc, it might be interesting to benchmark with documents that have a single large attribute so that the total document size matched the N=1000 document size -- that might help isolate whether it's actually the BSON parsing or just the total document size stressing the allocator in some way.

    ReplyDelete

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...