perf(rust, python): Speed up `.is_in` and `in` #10214

jonashaag · 2023-07-31T22:23:43Z

This adds some special cases to is_in for speedup:

For small lhs, just use a bunch of ==. It's faster.
For ordered rhs, use binary search if we estimate it to be faster than constructing the hash set. It uses a complexity theory based formula to choose the algorithm. I experimentally verified that this formula holds surprisingly accurately.
For the hash set algorithm, keep the min and max of the values and use that to short circuit hash set lookup. This speeds up lookups when there is little range overlap and doesn't slow down the other cases. NB: This could even be added to == for ordered series where min/max are cheap.

alexander-beedie · 2023-08-01T04:06:14Z

Looks interesting... Can you put together some benchmarks to demonstrate/validate the speedups and the choice of boundary conditions that enable them? 👍

ritchie46 · 2023-08-01T05:52:29Z

crates/polars-core/Cargo.toml

@@ -63,6 +63,7 @@ rolling_window = []
 rank = []
 diff = []
 pct_change = ["diff"]
+search_sorted = []


Search sorted should not be moved into polars-core. Rather is_in should be moved into polars-ops.

jonashaag · 2023-08-01T07:24:57Z

Looks interesting... Can you put together some benchmarks to demonstrate/validate the speedups and the choice of boundary conditions that enable them? 👍

Happy to! I have some synthetic microbenchmarks used for dev but I wonder if you could suggest some maybe more realistic ones?

jonashaag · 2023-08-01T11:49:41Z

Some benchmarking results:

I wasn't able to come up with a useful microbenchmark to prove this change, so I'll revert it:

For the hash set algorithm, keep the min and max of the values and use that to short circuit hash set lookup
For the equal() change, see this gist. Results:

lhs size geomean (lower is better)

5 0.57

10 0.67

15 0.81

20 0.93

25 1.02

30 1.10

50 1.49

Notes: rhs size doesn't matter (I checked [1, 1e6])
Binary search: TBD

jonashaag · 2023-08-01T14:54:00Z

Results for binary search (this script):

I hid all rows between 0.85 and 1.15 because I consider them measurement artefacts.

You can see that whenever the formula says TRUE (= we should use bin search), actually using bin search is faster, and vice-versa. There is a tiny number of outliers.

The "New with limit vs old" column is measured with taking into account the formula, while the "New no limit vs old" always uses binary search.

jonashaag · 2023-08-01T21:26:08Z

I'll need some help with moving is_in to polars-ops. There seem to be various ways to implement a series operation:

Free function that returns a series: pub fn approx_unique(s: &Series) -> PolarsResult<Series> (here)

Trait on Series

pub trait ArgAgg {
  fn arg_min(&self) -> Option<usize>;
}

impl ArgAgg for Series {

(here)

Free function that returns a ChunkedArray: pub fn is_unique(s: &Series) -> PolarsResult<BooleanChunked> (here)

Trait on ChunkedArray:

trait IsIn {
  fn is_in(&self, _other: &Series) -> PolarsResult<BooleanChunked>;
}
...
impl IsIn for StructChunked {
  fn is_in(&self, other: &Series) -> PolarsResult<BooleanChunked> {

(here)

Why do these variants exist? When would you take as input/out a Series/ChunkedArray? When would you make a free function and when would make a trait?

stinodego · 2024-01-14T22:18:26Z

This PR is outdated as is_in has now moved to polars-ops. The performance benefits of the adjustments in this PR might still be relevant though. If you think so, please rebase your work and open a fresh PR!

WIP

06e13e9

jonashaag requested review from ritchie46, stinodego and alexander-beedie as code owners July 31, 2023 22:23

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jul 31, 2023

ritchie46 reviewed Aug 1, 2023

View reviewed changes

stinodego requested review from MarcoGorelli and orlp as code owners November 12, 2023 04:36

stinodego closed this Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(rust, python): Speed up `.is_in` and `in` #10214

perf(rust, python): Speed up `.is_in` and `in` #10214

jonashaag commented Jul 31, 2023 •

edited

Loading

alexander-beedie commented Aug 1, 2023 •

edited

Loading

ritchie46 Aug 1, 2023

jonashaag commented Aug 1, 2023

jonashaag commented Aug 1, 2023 •

edited

Loading

jonashaag commented Aug 1, 2023

jonashaag commented Aug 1, 2023 •

edited

Loading

stinodego commented Jan 14, 2024

perf(rust, python): Speed up .is_in and in #10214

perf(rust, python): Speed up .is_in and in #10214

Conversation

jonashaag commented Jul 31, 2023 • edited Loading

alexander-beedie commented Aug 1, 2023 • edited Loading

ritchie46 Aug 1, 2023

Choose a reason for hiding this comment

jonashaag commented Aug 1, 2023

jonashaag commented Aug 1, 2023 • edited Loading

jonashaag commented Aug 1, 2023

jonashaag commented Aug 1, 2023 • edited Loading

stinodego commented Jan 14, 2024

perf(rust, python): Speed up `.is_in` and `in` #10214

perf(rust, python): Speed up `.is_in` and `in` #10214

jonashaag commented Jul 31, 2023 •

edited

Loading

alexander-beedie commented Aug 1, 2023 •

edited

Loading

jonashaag commented Aug 1, 2023 •

edited

Loading

jonashaag commented Aug 1, 2023 •

edited

Loading