Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chroe: refactor ArrowDataFrame.with_columns #1345

Merged
merged 5 commits into from
Nov 13, 2024

Conversation

FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Nov 10, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

While profiling for plotly, py-spy indicates that we spend a lot of time in validate_dataframe_comparand for pyarrow case. This is called only in with_columns methods.

This PR proposed two changes:

  • Creates constant array via np.full in validate_dataframe_comparand instead of [const] * length
  • Changes the logic in with_columns to use pyarrow native methods to insert a column value. I expect this to be faster than the current approach of concatenating the already existing columns with the new ones - caveat is if the number of new columns is order(s) of magnitude greater than the existing ones. In majority of scenarios I would expect this to not be the case, but this is the reason I am opening this as RFC

Comment on lines +323 to 329
native_frame = (
native_frame.set_column(
columns.index(col_name), field_=col_name, column=column
)
if col_name in columns
else native_frame.append_column(field_=col_name, column=column)
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If column name exists, then we replace the column, else we append at the end

Comment on lines 186 to 184
value = other.item()
value = other[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids item to check again the length of the array, we already know it is 1

@MarcoGorelli
Copy link
Member

nice - there may be some issue on py38, but the rest looks good!

should we wait until altair / marimo's cis are fixed to merge?

@FBruzzesi
Copy link
Member Author

Fixed old versions, what's the deal with TPCH taking so long now πŸ™ˆ?

@MarcoGorelli
Copy link
Member

πŸ€” looks like yesterday it went from 2 minutes to 15 minutes

@MarcoGorelli
Copy link
Member

ok it was dask, i've removed them from the tpch ci and opened an issue on their repo

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice, thanks @FBruzzesi - did you test this against the plotly branch locally?

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Nov 10, 2024

did you test this against the plotly branch locally?

Yes, not big of a change as I would expect though - e.g. I would expect that np.full(1_000_000, 1) to be visibly faster than [1] * 1_000_000.

Edit: Is there a way to run it in isolation with your kaggle notebook? I can clone that

@MarcoGorelli
Copy link
Member

yeah where there's pip install git+https://github.com/narwhals-dev/narwhals change that to for example pip install git+https://github.com/narwhals-dev/narwhals@perf/pyarrow-with-columns

@FBruzzesi
Copy link
Member Author

@MarcoGorelli for 1M rows, 50 columns I cannot see any changes in performance for:

  • only editing existing columns
  • only adding columns
  • editing and adding columns in the same with_columns statement

both when working with chunked array and scalars. The two approaches seem to be equivalent.
I leave it up to you if this syntax is any better than the previous one

@MarcoGorelli
Copy link
Member

thanks for checking! i think I prefer this one, if you agree let's ship it

@FBruzzesi FBruzzesi changed the title RFC, perf: pyarrow DataFrame.with_columns chroe: refactor pyarrow DataFrame.with_columns Nov 13, 2024
@FBruzzesi FBruzzesi changed the title chroe: refactor pyarrow DataFrame.with_columns chroe: refactor ArrowDataFrame.with_columns Nov 13, 2024
@MarcoGorelli MarcoGorelli merged commit 8f2e4a9 into main Nov 13, 2024
30 checks passed
@FBruzzesi FBruzzesi deleted the perf/pyarrow-with-columns branch November 13, 2024 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants