-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(rust, python): Speed up .is_in
and in
#10214
Conversation
Looks interesting... Can you put together some benchmarks to demonstrate/validate the speedups and the choice of boundary conditions that enable them? 👍 |
@@ -63,6 +63,7 @@ rolling_window = [] | |||
rank = [] | |||
diff = [] | |||
pct_change = ["diff"] | |||
search_sorted = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Search sorted should not be moved into polars-core
. Rather is_in
should be moved into polars-ops
.
Happy to! I have some synthetic microbenchmarks used for dev but I wonder if you could suggest some maybe more realistic ones? |
Some benchmarking results:
|
Results for binary search (this script): I hid all rows between 0.85 and 1.15 because I consider them measurement artefacts. You can see that whenever the formula says TRUE (= we should use bin search), actually using bin search is faster, and vice-versa. There is a tiny number of outliers. The "New with limit vs old" column is measured with taking into account the formula, while the "New no limit vs old" always uses binary search. |
I'll need some help with moving
Why do these variants exist? When would you take as input/out a Series/ChunkedArray? When would you make a free function and when would make a trait? |
This PR is outdated as |
This adds some special cases to
is_in
for speedup:==
. It's faster.==
for ordered series where min/max are cheap.