chroe: refactor `ArrowDataFrame.with_columns` #1345

FBruzzesi · 2024-11-10T11:31:24Z

What type of PR is this? (check all applicable)

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

While profiling for plotly, py-spy indicates that we spend a lot of time in validate_dataframe_comparand for pyarrow case. This is called only in with_columns methods.

This PR proposed two changes:

Creates constant array via np.full in validate_dataframe_comparand instead of [const] * length
Changes the logic in with_columns to use pyarrow native methods to insert a column value. I expect this to be faster than the current approach of concatenating the already existing columns with the new ones - caveat is if the number of new columns is order(s) of magnitude greater than the existing ones. In majority of scenarios I would expect this to not be the case, but this is the reason I am opening this as RFC

FBruzzesi · 2024-11-10T11:33:03Z

narwhals/_arrow/dataframe.py

+            native_frame = (
+                native_frame.set_column(
+                    columns.index(col_name), field_=col_name, column=column
                )
+                if col_name in columns
+                else native_frame.append_column(field_=col_name, column=column)
            )


If column name exists, then we replace the column, else we append at the end

FBruzzesi · 2024-11-10T11:34:06Z

narwhals/_arrow/utils.py

-            value = other.item()
+            value = other[0]


Avoids item to check again the length of the array, we already know it is 1

MarcoGorelli · 2024-11-10T11:44:43Z

nice - there may be some issue on py38, but the rest looks good!

should we wait until altair / marimo's cis are fixed to merge?

FBruzzesi · 2024-11-10T16:31:49Z

Fixed old versions, what's the deal with TPCH taking so long now 🙈?

MarcoGorelli · 2024-11-10T16:38:18Z

🤔 looks like yesterday it went from 2 minutes to 15 minutes

MarcoGorelli · 2024-11-10T16:53:07Z

ok it was dask, i've removed them from the tpch ci and opened an issue on their repo

MarcoGorelli

really nice, thanks @FBruzzesi - did you test this against the plotly branch locally?

FBruzzesi · 2024-11-10T16:57:35Z

did you test this against the plotly branch locally?

Yes, not big of a change as I would expect though - e.g. I would expect that np.full(1_000_000, 1) to be visibly faster than [1] * 1_000_000.

Edit: Is there a way to run it in isolation with your kaggle notebook? I can clone that

MarcoGorelli · 2024-11-10T17:00:27Z

yeah where there's pip install git+https://github.com/narwhals-dev/narwhals change that to for example pip install git+https://github.com/narwhals-dev/narwhals@perf/pyarrow-with-columns

FBruzzesi · 2024-11-12T20:30:39Z

@MarcoGorelli for 1M rows, 50 columns I cannot see any changes in performance for:

only editing existing columns
only adding columns
editing and adding columns in the same with_columns statement

both when working with chunked array and scalars. The two approaches seem to be equivalent.
I leave it up to you if this syntax is any better than the previous one

MarcoGorelli · 2024-11-13T08:17:52Z

thanks for checking! i think I prefer this one, if you agree let's ship it

FBruzzesi added 3 commits November 9, 2024 10:00

WIP

0597bdd

Merge branch 'main' into perf/pyarrow-with-columns

8c4b62a

use pa.array

f8a591f

FBruzzesi commented Nov 10, 2024

View reviewed changes

FBruzzesi added 2 commits November 10, 2024 15:56

Merge branch 'main' into perf/pyarrow-with-columns

bf5d0fd

merge main, fix lit doctest, rm type

ae9791a

MarcoGorelli closed this Nov 10, 2024

MarcoGorelli reopened this Nov 10, 2024

MarcoGorelli approved these changes Nov 10, 2024

View reviewed changes

FBruzzesi changed the title ~~RFC, perf: pyarrow DataFrame.with_columns~~ chroe: refactor pyarrow DataFrame.with_columns Nov 13, 2024

FBruzzesi changed the title ~~chroe: refactor pyarrow DataFrame.with_columns~~ chroe: refactor ArrowDataFrame.with_columns Nov 13, 2024

MarcoGorelli merged commit 8f2e4a9 into main Nov 13, 2024
30 checks passed

FBruzzesi deleted the perf/pyarrow-with-columns branch November 13, 2024 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chroe: refactor `ArrowDataFrame.with_columns` #1345

chroe: refactor `ArrowDataFrame.with_columns` #1345

FBruzzesi commented Nov 10, 2024 •

edited

Loading

FBruzzesi Nov 10, 2024

FBruzzesi Nov 10, 2024

MarcoGorelli commented Nov 10, 2024

FBruzzesi commented Nov 10, 2024

MarcoGorelli commented Nov 10, 2024

MarcoGorelli commented Nov 10, 2024

MarcoGorelli left a comment

FBruzzesi commented Nov 10, 2024 •

edited

Loading

MarcoGorelli commented Nov 10, 2024

FBruzzesi commented Nov 12, 2024

MarcoGorelli commented Nov 13, 2024

chroe: refactor ArrowDataFrame.with_columns #1345

chroe: refactor ArrowDataFrame.with_columns #1345

Conversation

FBruzzesi commented Nov 10, 2024 • edited Loading

What type of PR is this? (check all applicable)

Checklist

If you have comments or can explain your changes, please do so below.

FBruzzesi Nov 10, 2024

Choose a reason for hiding this comment

FBruzzesi Nov 10, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Nov 10, 2024

FBruzzesi commented Nov 10, 2024

MarcoGorelli commented Nov 10, 2024

MarcoGorelli commented Nov 10, 2024

MarcoGorelli left a comment

Choose a reason for hiding this comment

FBruzzesi commented Nov 10, 2024 • edited Loading

MarcoGorelli commented Nov 10, 2024

FBruzzesi commented Nov 12, 2024

MarcoGorelli commented Nov 13, 2024

chroe: refactor `ArrowDataFrame.with_columns` #1345

chroe: refactor `ArrowDataFrame.with_columns` #1345

FBruzzesi commented Nov 10, 2024 •

edited

Loading

FBruzzesi commented Nov 10, 2024 •

edited

Loading