Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(rust, python): Speed up .is_in and in #10214

Closed
wants to merge 1 commit into from

Conversation

jonashaag
Copy link
Contributor

@jonashaag jonashaag commented Jul 31, 2023

This adds some special cases to is_in for speedup:

  • For small lhs, just use a bunch of ==. It's faster.
  • For ordered rhs, use binary search if we estimate it to be faster than constructing the hash set. It uses a complexity theory based formula to choose the algorithm. I experimentally verified that this formula holds surprisingly accurately.
  • For the hash set algorithm, keep the min and max of the values and use that to short circuit hash set lookup. This speeds up lookups when there is little range overlap and doesn't slow down the other cases. NB: This could even be added to == for ordered series where min/max are cheap.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jul 31, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 1, 2023

Looks interesting... Can you put together some benchmarks to demonstrate/validate the speedups and the choice of boundary conditions that enable them? 👍

@@ -63,6 +63,7 @@ rolling_window = []
rank = []
diff = []
pct_change = ["diff"]
search_sorted = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search sorted should not be moved into polars-core. Rather is_in should be moved into polars-ops.

@jonashaag
Copy link
Contributor Author

Looks interesting... Can you put together some benchmarks to demonstrate/validate the speedups and the choice of boundary conditions that enable them? 👍

Happy to! I have some synthetic microbenchmarks used for dev but I wonder if you could suggest some maybe more realistic ones?

@jonashaag
Copy link
Contributor Author

jonashaag commented Aug 1, 2023

Some benchmarking results:

  • I wasn't able to come up with a useful microbenchmark to prove this change, so I'll revert it:

    For the hash set algorithm, keep the min and max of the values and use that to short circuit hash set lookup

  • For the equal() change, see this gist. Results:

    lhs size geomean (lower is better)
    5 0.57
    10 0.67
    15 0.81
    20 0.93
    25 1.02
    30 1.10
    50 1.49

    Notes: rhs size doesn't matter (I checked [1, 1e6])

  • Binary search: TBD

@jonashaag
Copy link
Contributor Author

Results for binary search (this script):

Screenshot 2023-08-01 at 16 51 59

I hid all rows between 0.85 and 1.15 because I consider them measurement artefacts.

You can see that whenever the formula says TRUE (= we should use bin search), actually using bin search is faster, and vice-versa. There is a tiny number of outliers.

The "New with limit vs old" column is measured with taking into account the formula, while the "New no limit vs old" always uses binary search.

@jonashaag
Copy link
Contributor Author

jonashaag commented Aug 1, 2023

I'll need some help with moving is_in to polars-ops. There seem to be various ways to implement a series operation:

  • Free function that returns a series: pub fn approx_unique(s: &Series) -> PolarsResult<Series> (here)

  • Trait on Series

    pub trait ArgAgg {
      fn arg_min(&self) -> Option<usize>;
    }
    
    impl ArgAgg for Series {

    (here)

  • Free function that returns a ChunkedArray: pub fn is_unique(s: &Series) -> PolarsResult<BooleanChunked> (here)

  • Trait on ChunkedArray:

    trait IsIn {
      fn is_in(&self, _other: &Series) -> PolarsResult<BooleanChunked>;
    }
    ...
    impl IsIn for StructChunked {
      fn is_in(&self, other: &Series) -> PolarsResult<BooleanChunked> {

    (here)

Why do these variants exist? When would you take as input/out a Series/ChunkedArray? When would you make a free function and when would make a trait?

@stinodego
Copy link
Member

This PR is outdated as is_in has now moved to polars-ops. The performance benefits of the adjustments in this PR might still be relevant though. If you think so, please rebase your work and open a fresh PR!

@stinodego stinodego closed this Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants