Posts

Showing posts from October, 2015

Losing it?

Many years ago the MySQL team at Google implemented semi-sync replication courtesy of Wei Li. The use case for it was limited and the community was disappointed that it did not provide the semantics they wanted. Eventually group commit was implemented for binlog+InnoDB: first by the MySQL team at Facebook, then much better by MariaDB, and finally in upstream MySQL. With group commit some magic was added to give us lossless semisync and now we have automated, lossless and fast failover in MySQL without using extra replicas. This feature is a big deal. I look forward to solutions for MariaDB (via binlog server and MaxScale ) and MySQL (via Fabric and Proxy). MongoDB is ahead of MySQL in features that make scale-out easier to manage and the next release (3.2)  adds a few more features and more robust code for existing features. I hope that some of the features that have been sacrificed in the name of scale-out will eventually arrive in a MongoDB release: per-shard transactions, per-s

Problems not worth fixing

I worked on a closed-source DBMS years ago and the development cycle was 1) code for 6 months 2) debug for 18 months. Part 2 was longer than part 1 because the customers have high expectations for quality and the product delivers on that. But it was also longer because some co-workers might not have been code complete after 6 months and were quietly extending part 1 into part 2. I fixed many bugs during part 2. One that I remember was in the optimizer. Someone added code to detect and remove duplicate expressions ( A and A and B  --> A and B ). Alas the algorithm was O(N*N). Analytics queries can have complex WHERE clauses and all queries were forced to pay the O(N*N) cost whether or not they had any duplicate expressions.  The overhead for this feature was reasonable after I fixed the code but it raises an interesting question. Are some problems not worth the cost of prevention? Several times a year I see feature requests that make me think of this. Forgot to add the bug th

MyRocks versus allocators: glibc, tcmalloc, jemalloc

I used a host with Fedora 19 and Linkbench to determine the ability of memory allocators to avoid fragmentation with MyRocks (MySQL+RocksDB). RocksDB can create pressure on an allocator because allocations for the block cache can have a short lifetime. Memory is allocated on page read and released when a page is evicted from the cache. The allocation is the size of the decompressed page and sometimes the size of the compressed page. RocksDB puts more pressure on the allocator than InnoDB because with InnoDB the buffer pool allocation is done once at process start. After the Linkbench load finished I ran the query test for 24 hours and report the values of VSZ and RSS from ps at the end. From the results below jemalloc & tcmalloc do much better at avoiding fragmentation (value of RSS) compared to glibc 2.1.7. I am still adjusting to larger values of VSZ with jemalloc. It isn't a problem, but it takes time to accept. VSZ       RSS       allocator 22246784  10062100  jemalloc

Wanted: a file system on which InnoDB transparent page compression works

I work on MyRocks, which is a MySQL storage engine with RocksDB. This is an alternative to InnoDB. It might be good news for MyRocks if transparent page compression is the future of InnoDB compression. I got more feedback from the MySQL team that despite the problems I have reported, transparent page compression works. I was just testing the wrong systems. So I asked core developers from Btrfs  whether it was OK to do punch hole per write and they politely told me to go away. Bcachefs  might be a great way to add online compression to a b-tree without doing punch hole per write but it is not ready for production. Someone from MySQL suggested I try ext4 so I setup two servers with Ubuntu 14.04 which is on the list of supported systems. I used XFS on one and ext4 on the other. XFS was still a problem and ext4 was just as bad. The problem is that the unlink() system call takes a ridiculous amount of time after a multi-GB file has been subject to many punch hole writes. By ridiculous