Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Date Histogram] Investigate the safe number of buckets for which filter rewrite optimization can be applied #13549

Closed
bowenlan-amzn opened this issue May 6, 2024 · 2 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance

Comments

@bowenlan-amzn
Copy link
Member

Follow up tasks for #13317

The idea of the filter rewrite optimiaztion is to utilize the index structure instead of iterating over documents to get the buckets results. We are able to know how many buckets before the actual aggregate execution logic begins.

As the bucket counts increase or the number of documents that should be aggregated on decrease, the iterative method may become faster and the filter rewrite method may become slower.
Currently we have a cluster setting to define the supported bucket count but it may not always work. For example, if the dataset only has 3k different values and the aggregation query asks for 1024 buckets, it is too high and wouldn't be better than just iterating over; on the other hand, if the dataset has 100k different values, we can probably support more than 1024 buckets.

This task is to investigate some rules to decide whether the optimization should be used, dynamically depending on the dataset or the index.

The biggest part of overhead normally is when reading the values from documents. The bkd index structure has all the documents as leaf nodes and will only need to be traversed through when the leaf node is intersected with the query.
One idea here is to do a dummy traversal on the bkd tree to tell how many leaf node will be intersected, and how many middle node will be skipped, based on these 2 numbers, we can get a relatively accurate idea about the cost of certain range query.

@bowenlan-amzn bowenlan-amzn self-assigned this May 6, 2024
@bowenlan-amzn bowenlan-amzn converted this from a draft issue May 6, 2024
@andrross
Copy link
Member

andrross commented May 8, 2024

[Triage - attendees 1 2 3 4]
@bowenlan-amzn Thanks for filing.

@bowenlan-amzn
Copy link
Member Author

close because new issue #14438 will include this.

@github-project-automation github-project-automation bot moved this from Todo to Done in Performance Roadmap Jun 18, 2024
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Search Project Board Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
Archived in project
Status: Done
Development

No branches or pull requests

2 participants