-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: improve ArrowGroupBy.__iter__
performances
#1334
Conversation
*[nw_namespace.col(k) == v for k, v in zip(self._keys, key_value)] | ||
next( | ||
( | ||
t := self._df._from_native_frame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite proud of this walrus π
import pyarrow.compute as pc # ignore-banned-import | ||
|
||
col_token = generate_temporary_column_name(n_bytes=8, columns=self._df.columns) | ||
null_token = "__null_token_value__" # noqa: S105 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S105 is "hardcoded-password-string" - this is definitly not a password, but it is hardcoded π
*[pc.cast(table[key], pa.string()) for key in self._keys], | ||
"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the last *string
argument is used as separator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below.
According to Marco's performance benchmarking for plotly, the bottleneck for a few functions seems to be the call we do to
ArrowGroupBy.__iter__
.Since pyarrow does not natively support iterating over groups, we (actually pointing finger to myself) implemented a (let's say naive) way of still allowing for that - I remember the use case was for scikit-lego to fully support arrow as well.
This PR tries to improve those performances using native arrow methods and no simple shortcuts. Steps are as follow: