Wednesday, September 11, 2019

Optimal Tuning for an LSM

For a few years I have wanted to use math to determine optimal tuning for an LSM given a specific workload and constraints on efficiency. First I had to (re)learn math. I like calculus now while I didn't enjoy it in high school -- just like spinach.

By specific workload I mean the fraction of operations that are inserts, point queries and range reads. Note that if the workload has no range reads then an LSM might not be the best choice. By constraints on efficiency I mean the max write and space amplification that would be tolerated.

I decided to try constrained optimization and Lagrange Multipliers where the objective function was the average number of CPU comparisons per operation and the decision variables were the per-level fanouts for the LSM tree. In this case I am only optimizing for the CPU overhead, but I have to start somewhere. The eventual goal is to include CPU, memory and storage in the cost function.

I previously used Lagrange Multipliers to show how to set the per-level fanout to minimize write amplification. The result depends on assumptions about the per-level write amplification and this dependency is a new result but the previous result is that the per-level fanout should be the same for all levels. Here I show that CPU overhead can be reduced by using larger per-level fanouts for larger levels.

Summary:
  • Books with titles like Math for Economists are a great resource for learning math
  • Choosing a good value for the number of levels in the LSM tree is more significant than figuring out the optimal per-level fanouts.
  • Take time to understand when a problem is, or is not, worth optimizing

Is this worth solving?

To minimize the CPU overhead of a workload it is much more important to choose the number of levels in the LSM tree and the per-level compaction algorithms (tiered, leveled, leveled-N) then it is to optimize the per-level fanout as I describe here. While the math I describe below isn't the best way to optimize an LSM, the math was interesting to me so this blog post lets me revisit my work.

In the cost function below the CPU cost of a read, Cr, is dominated by the size of the memtable. The per-level fanout values, f1 and f2, won't have a significant impact on the value of Cr. Therefore it isn't clear that it is worth figuring out the optimal values for the per-level fanout. By not significant assume that B=2^20 while f1 and f2 are likely to be <= 32. Then the contribution to Cr from B is 60 while the contribution from f1 and f2 are at most 15. For Ci, the cost of an insert, the contribution from B is likely to be similar to the contribution from f1 and f2.

The problem

The problem below has a nonlinear objective function, nonlinear equality constraint and simple inequality constraints.

Assume:
* LSM memtable, L1 and L2 where there are B entries in memtable, f1 is fanout from memtable to L1 and f2 is fanout from L1 to L2. There are B * f1 * f2 rows in the LSM.

* Cost of read is Cr and cost of insert is Ci, both are the number of compares per operation. Bloom filter is not used and Ci includes compares for memtable insert and compaction.
* Fraction of reads and inserts in workloads is Pr and Pi where Pr + Pi = 1
* The value for X is somewhat arbitrary although X should be >= 2. B should be an integer but I ignore that for now.


Minimize Pr * Cr + Pi * Ci
such that B * f1 * f2 = K
and B >= X, B <= mK where m between 0 and 1
and f1 >= 2 and f2 >= 2

# compares per point-read without a bloom filter
# assume key is found in max level of LSM tree
Cr = memtable-cmp + l1-cmp + l2-cmp

Cr = log2(B) + log2(B*f1) + log2(B*f1*f2)
# Cr can also be expressed as
Cr = 3*log2(B) + 2*log2(f1) + log2(f2)

# compares per insert
Ci = memtable-insert-cmp + compaction-cmp
Ci = log2(B) + f1 + f2


Step 1

First, write the L function that includes the constraints:

# Note that I wrote u in place of lambda
L = Cr * Pr + Ci * Pi + u(K - B*f1*f2)

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
dL/df2 = Pr / (f2 * ln(2)) + Pi - u*f1*f2

Then realize that the inequality constraints make this a lot harder to solve and that complexity gets worse with more levels in the LSM tree. Then admit that you haven't done all of the work to confirm that the math techniques can be used on this problem (constraint qualifications, concave objective function).

Step 2

Without the constraints the goal is to find the min and max values for the objective function by finding values for which the partial derivatives (dL/dB, dL/df1 and dL/df2) are zero. I will do that here and then stop. Note that when the partial derivatives are zero the solution might violate the constraints and thus not be valid. Also note that when the partial derivatives are zero the extremum can be a min or a max. Regardless, the results suggest that fanout should increase from smaller to larger levels to minimize the cost.

First solve for u when dL/dB = 0.

dL/dB = (3*Pr) / (B * ln(2)) + Pi/(B * ln(2)) - u*f1*f2
dL/dB = (3*Pr + Pi) / (B * ln(2)) - u*f1*f2
# given Pr + Pi = 1 ...

dL/dB = (2*Pr + 1) / (B * ln(2)) - u*f1*f2

# now solve when dL/dB = 0
(2*Pr + 1) / (B * ln(2)) = u*f1*f2
# given f1*f2 = K/B
(2*Pr + 1) / (B * ln(2)) = u*K/B
# multiply LHS and RHS by B
(2*Pr + 1) / ln(2) = u*K

# solve for u
u = (2*Pr + 1) / (K * ln(2))

Then  solve for f1 when dL/df1 = 0

dL/df1 = (2*Pr) / (f1 * ln(2)) + Pi - u*B*f2
# when dL/df1 = 0 then
(2*Pr) / (f1 * ln(2)) + Pi = u*B*f2
# given B*f2 = K/f1

(2*Pr) / (f1 * ln(2)) + Pi = u*K/f1
# multiply LHS and RHS by f1
(2*Pr)/ln(2) + f1*Pi = u*K
f1*Pi = u*K - (2*Pr)/ln(2)

# substitute u = (2*Pr + 1) / (K * ln(2))

f1*Pi = (2*Pr + 1) / ln(2) - (2*Pr)/ln(2)
f1*Pi = (2*Pr + 1 - 2*Pr) / ln(2)
f1*Pi = 1 / ln(2)
f1 = 1 / (Pi * ln(2)) 


Then solve for f2 when dL/df2 = 0

# skip the math for now, trust me :)
f2 = (1 + Pr) / (Pi * ln(2)

# were there an f3 then the answer would be...
f3 = (1 + 2*Pr) / (Pi * ln(2) 


Then solve for B, f1 and f2 given values for Pi and Pr

B = K / (f1 * f2)
B = K / (1 / (Pi * ln(2)) * (1+Pr) / (Pi * ln(2)))
B = K / (1 + Pr) / (Pi * ln(2))^2
B = K * (Pi * ln(2))^2 / (1+Pr)

# with Pi=0.5 and Pr=0.5
# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.080, f1=2.89, f2=4.33


# with Pi=0.1 and Pr=0.9# constraints on f1, f2 are OK
# constraints for B are probably OK
B = K * 0.003, f1=14.43, f2=27.41


# with Pi=0.9 and Pr=0.1
# note that f1, f2 are < 2 which violates a constraint

B = K * 0.353, f1=1.60, f2=1.76

# with Pi=0 and Pr=1
# --> flush the memtable, no need to search it

Comparing optimal solution with typical config


For the the 2 cases above for which an optimal solution was found that doesn't violate constraints I compare the value of the objective function (compares/operation) with for the optimal vs a typical config (f1=f2 in typical config) assuming K=1B (1B rows in LSM).

These results explain my comment about confirming that the impact from an optimal config is significant. It isn't in these cases.

# with Pi=0.5 and Pr=0.5
# f1=f2=3.54 in default config
B = K * 0.080, f1=2.89, f2=4.33

optimal and default config do ~47 compares/operation
# with Pi=0.1 and Pr=0.9
B = K * 0.003, f1=14.43, f2=27.41
# f1=f2=19.89 in default config
optimal and default config do ~63 compares/operation


Using solvers

I tried the Mathematica and Matlab and had more success with Matlab. The Minimize function in Mathematica can get unhappy with fractional coefficients in the cost function (values for Pr and Pi).

Below are two functions and the script I used with Matlab.

# costf.m is the objective function
function [f,g] = costf(x)
B = 1000000;
Pq = 0.99;
Pu = 0.01;


f = Pq * (4*log2(B) + 3*log2(x(1)) + 2*log2(x(2)) + log2(x(3))) + Pu * (log2(B) + x(1) + x(2) + x(3));


if nargout > 1
    g = [((Pq * 3) / (x(1) * log(2))) + Pu;
         ((Pq * 2) / (x(2) * log(2))) + Pu;
         ((Pq * 1) / (x(3) * log(2))) + Pu];
end

# nlcon.m is the nonlinear constraint

function [c, ceq] = nlcon(x)
c = [];
ceq = x(1) * x(2) * x(3) - 1000;

# the script
lb = [2,2,2];

x0 = [10,10,10];


S = fmincon(@costf, x0, [], [], [], [], lb, [], @nlcon);

No comments:

Post a Comment

RocksDB on a big server: LRU vs hyperclock, v2

This post show that RocksDB has gotten much faster over time for the read-heavy benchmarks that I use. I recently shared results from a lar...