perf: improve `ArrowGroupBy.iter` performances #1334

FBruzzesi · 2024-11-07T22:15:19Z

What type of PR is this? (check all applicable)

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

According to Marco's performance benchmarking for plotly, the bottleneck for a few functions seems to be the call we do to ArrowGroupBy.__iter__.

Since pyarrow does not natively support iterating over groups, we (actually pointing finger to myself) implemented a (let's say naive) way of still allowing for that - I remember the use case was for scikit-lego to fully support arrow as well.

This PR tries to improve those performances using native arrow methods and no simple shortcuts. Steps are as follow:

Create an array containing the string concatenation of the key values (after casting to string). Null handling is required.
Add the column to the original table
Return the pair of :
- key values, obtained as first (and unique) value of filtered table for the key names.
- sliced dataframe, obtained as filtered table, and dropping the temporary column with string concatenation.

FBruzzesi · 2024-11-07T22:15:54Z

narwhals/_arrow/group_by.py

-                    *[nw_namespace.col(k) == v for k, v in zip(self._keys, key_value)]
+                next(
+                    (
+                        t := self._df._from_native_frame(


quite proud of this walrus 😇

FBruzzesi · 2024-11-07T22:17:32Z

narwhals/_arrow/group_by.py

+        import pyarrow.compute as pc  # ignore-banned-import
+
+        col_token = generate_temporary_column_name(n_bytes=8, columns=self._df.columns)
+        null_token = "__null_token_value__"  # noqa: S105


S105 is "hardcoded-password-string" - this is definitly not a password, but it is hardcoded 😁

FBruzzesi · 2024-11-07T22:38:10Z

narwhals/_arrow/group_by.py

+            *[pc.cast(table[key], pa.string()) for key in self._keys],
+            "",


This is because the last *string argument is used as separator

MarcoGorelli

well done, this is really impressive!

FBruzzesi added 4 commits November 7, 2024 22:25

perf: arrow group by iter

24dd1c4

drop_columns -> drop

98fcc97

no kw

1f73403

str -> list[str]

9a1bbf5

github-actions bot added the performance label Nov 7, 2024

FBruzzesi commented Nov 7, 2024

View reviewed changes

MarcoGorelli approved these changes Nov 8, 2024

View reviewed changes

MarcoGorelli merged commit 7696b6e into main Nov 8, 2024
22 checks passed

FBruzzesi deleted the perf/pyarrow-groupby-iter branch November 8, 2024 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve `ArrowGroupBy.iter` performances #1334

perf: improve `ArrowGroupBy.iter` performances #1334

FBruzzesi commented Nov 7, 2024

FBruzzesi Nov 7, 2024

FBruzzesi Nov 7, 2024

FBruzzesi Nov 7, 2024

MarcoGorelli left a comment

		*[pc.cast(table[key], pa.string()) for key in self._keys],
		"",

perf: improve ArrowGroupBy.__iter__ performances #1334

perf: improve ArrowGroupBy.__iter__ performances #1334

Conversation

FBruzzesi commented Nov 7, 2024

What type of PR is this? (check all applicable)

Checklist

If you have comments or can explain your changes, please do so below.

FBruzzesi Nov 7, 2024

Choose a reason for hiding this comment

FBruzzesi Nov 7, 2024

Choose a reason for hiding this comment

FBruzzesi Nov 7, 2024

Choose a reason for hiding this comment

MarcoGorelli left a comment

Choose a reason for hiding this comment

perf: improve `ArrowGroupBy.iter` performances #1334

perf: improve `ArrowGroupBy.iter` performances #1334