Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve ArrowGroupBy.__iter__ performances #1334

Merged
merged 4 commits into from
Nov 8, 2024

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

According to Marco's performance benchmarking for plotly, the bottleneck for a few functions seems to be the call we do to ArrowGroupBy.__iter__.

Since pyarrow does not natively support iterating over groups, we (actually pointing finger to myself) implemented a (let's say naive) way of still allowing for that - I remember the use case was for scikit-lego to fully support arrow as well.

This PR tries to improve those performances using native arrow methods and no simple shortcuts. Steps are as follow:

  • Create an array containing the string concatenation of the key values (after casting to string). Null handling is required.
  • Add the column to the original table
  • Return the pair of :
    • key values, obtained as first (and unique) value of filtered table for the key names.
    • sliced dataframe, obtained as filtered table, and dropping the temporary column with string concatenation.

*[nw_namespace.col(k) == v for k, v in zip(self._keys, key_value)]
next(
(
t := self._df._from_native_frame(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite proud of this walrus πŸ˜‡

import pyarrow.compute as pc # ignore-banned-import

col_token = generate_temporary_column_name(n_bytes=8, columns=self._df.columns)
null_token = "__null_token_value__" # noqa: S105
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S105 is "hardcoded-password-string" - this is definitly not a password, but it is hardcoded 😁

Comment on lines +91 to +92
*[pc.cast(table[key], pa.string()) for key in self._keys],
"",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the last *string argument is used as separator

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done, this is really impressive!

@MarcoGorelli MarcoGorelli merged commit 7696b6e into main Nov 8, 2024
22 checks passed
@FBruzzesi FBruzzesi deleted the perf/pyarrow-groupby-iter branch November 8, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants