You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The idea of the filter rewrite optimiaztion is to utilize the index structure instead of iterating over documents to get the buckets results. We are able to know how many buckets before the actual aggregate execution logic begins.
As the bucket counts increase or the number of documents that should be aggregated on decrease, the iterative method may become faster and the filter rewrite method may become slower.
Currently we have a cluster setting to define the supported bucket count but it may not always work. For example, if the dataset only has 3k different values and the aggregation query asks for 1024 buckets, it is too high and wouldn't be better than just iterating over; on the other hand, if the dataset has 100k different values, we can probably support more than 1024 buckets.
This task is to investigate some rules to decide whether the optimization should be used, dynamically depending on the dataset or the index.
The biggest part of overhead normally is when reading the values from documents. The bkd index structure has all the documents as leaf nodes and will only need to be traversed through when the leaf node is intersected with the query.
One idea here is to do a dummy traversal on the bkd tree to tell how many leaf node will be intersected, and how many middle node will be skipped, based on these 2 numbers, we can get a relatively accurate idea about the cost of certain range query.
The text was updated successfully, but these errors were encountered:
Follow up tasks for #13317
The idea of the filter rewrite optimiaztion is to utilize the index structure instead of iterating over documents to get the buckets results. We are able to know how many buckets before the actual aggregate execution logic begins.
As the bucket counts increase or the number of documents that should be aggregated on decrease, the iterative method may become faster and the filter rewrite method may become slower.
Currently we have a cluster setting to define the supported bucket count but it may not always work. For example, if the dataset only has 3k different values and the aggregation query asks for 1024 buckets, it is too high and wouldn't be better than just iterating over; on the other hand, if the dataset has 100k different values, we can probably support more than 1024 buckets.
This task is to investigate some rules to decide whether the optimization should be used, dynamically depending on the dataset or the index.
The biggest part of overhead normally is when reading the values from documents. The bkd index structure has all the documents as leaf nodes and will only need to be traversed through when the leaf node is intersected with the query.
One idea here is to do a dummy traversal on the bkd tree to tell how many leaf node will be intersected, and how many middle node will be skipped, based on these 2 numbers, we can get a relatively accurate idea about the cost of certain range query.
The text was updated successfully, but these errors were encountered: