Possible regression of date histogram starting from 2.12 #13171
Labels
bug
Something isn't working
Performance
This is for any performance related enhancements or bugs
Search:Aggregations
GitHub issue: #13087
pmc workload: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/pmc
Regression query
note: by default, the request cache is disabled on the target index
Reproduce the regression
1 r6g.xlarge node, 5p0r
Using the opensearch benchmark tool
rerun 3 times, get relatively consistent results. One example result:
For 2.11, the result is
Note: service time is the query took time.
Impact
The code for the optimization was released in 2.12.
The pre-conditions for this optimization include:
Profiling results
Using async profiler, 60s when sending the query in a while true loop
2.11
https://github.com/bowenlan-amzn/file-share/blob/75bc67384e4c9fe83792ed1f41b7d49e95768281/211flamegraph.html (GH issue doesn't support html upload, you may need to download and see)
2.12
https://github.com/bowenlan-amzn/file-share/blob/75bc67384e4c9fe83792ed1f41b7d49e95768281/212flamegraph.html
Qualitative Analysis
By default, aggregation iterate over the documents and poll the value of the aggregated field from DocValues and bucketize it. From the profiling result of 2.11, the time consuming parts are reading doc value, doing rounding of the value and adding the value to the results.8
In our optimization, we don’t iterate doc values. Instead we create the bucket ranges first. Then use PointRangeQuery to get the doc count for each bucket. From the profiling result of 2.12, the time consuming part is the ponit count of PointRangeQuery. And inside it, the visitDocValues method takes most of the time. This visitDocValues is to iterate the points of leaf node in BKD and this is the most heavy part when using BKD.
With respect to doing aggregation on one segment, first define some expressions for later deduction
N_R: number of ranges or buckets
N_C: number of leaf nodes that need visitDocValues.
N_Leaf: total number of leaf nodes that have matching documents
N_Doc: total number of documents that matches
The approximate time cost of default method
T_default = C_default x N_Doc
C_default is the cost of handling a doc in default method or DocValues iteration, the heavy part is related to how doc value is read from the file and get added into the result.
The approximate time cost of optimized method
T_optim = C_optim x ((N_C/N_Leaf) x N_Doc)
C_optim is the cost of handling a doc in optimized method or BKD traversal, the heavy part is related to how point value is read read from leaf nodes and a good way to count them into the result.The problem is to decide when to use the optimized method
First important element is
N_C
. N_C is related to N_R, generally more ranges of smaller interval lead to more leaf node visitDocValues. Considering the ranges rewritten from date histogram are connected with each other, we are probably traverse many leaf nodes twice, while ideally we should only do it once. I am looking into a way to not doing range query one by one but just one bkd traversal for all ranges. We cannot avoid traverse some leaf nodes when using the optimized method, so how to calculate theN_C/N_Leaf
is the key. In 1-D scenario, assume we are doing multi-range traversal, N_C is bounded by N_R+1.Now the question become how to estimate
N_Leaf
, if it's a segment level match all scenario, N_Leaf is about (segment's maxDoc/512). If the top level query has a range query, N_leaf calculation is not straightforward. There's an existing light-weighted method of PointValues, estimatePointCount, that can tell how many points falls in the range. Similarly, we can also write a custom intersector to tell how many leaf nodes fall into the range query of top level query. Since this won't involve any leaf node visitDocValues, only tree traversal, I don't see a performance issue here.On the other hand, I distinguish
C_default
andC_optim
because they may have noticeable difference. Currently in IndexOrDocValuesQuery, C_default = 8 x C_optim, which means using DV is considered 8 times costly than using BKD in terms of verifing the matched documents of the lead iterator in a conjunction. The number is probably different in our case since the logic here are very different. We will carry out some experiments to get their ratio.First Experiment
The pmc dataset has 123 months (2006/2/1 ~ 2016/4/1) of data.
Based on the approximate time cost formulas, we can see for the optimized method, the monthly aggregation leads to a higher N_R. Considering the time field is 1 dimensional, an approximate of N_C is 2xN_R. So if we decrease the N_R by increasing the time interval of the agg query, we should observe the performance of optimized method becomes better.
I increase the time interval of the aggregation query from month to year and 1000d, and manually run the query multiple times against the cluster. This is the query took time in ms.
Fix Plan
Update the current threshold 1024 to 24, which is the a relatively safer number in terms of pmc workload.
Add a cluster setting that can be used to easily turn off the optimization. #13179
Develop a clever BKD traversal that only traverse once for any number of bucket ranges and able to count the points in a light weight. #13317
The text was updated successfully, but these errors were encountered: