-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chroe: refactor ArrowDataFrame.with_columns
#1345
Conversation
native_frame = ( | ||
native_frame.set_column( | ||
columns.index(col_name), field_=col_name, column=column | ||
) | ||
if col_name in columns | ||
else native_frame.append_column(field_=col_name, column=column) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If column name exists, then we replace the column, else we append at the end
narwhals/_arrow/utils.py
Outdated
value = other.item() | ||
value = other[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoids item
to check again the length of the array, we already know it is 1
nice - there may be some issue on py38, but the rest looks good! should we wait until altair / marimo's cis are fixed to merge? |
Fixed old versions, what's the deal with TPCH taking so long now π? |
π€ looks like yesterday it went from 2 minutes to 15 minutes |
ok it was dask, i've removed them from the tpch ci and opened an issue on their repo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice, thanks @FBruzzesi - did you test this against the plotly branch locally?
Yes, not big of a change as I would expect though - e.g. I would expect that Edit: Is there a way to run it in isolation with your kaggle notebook? I can clone that |
yeah where there's |
@MarcoGorelli for 1M rows, 50 columns I cannot see any changes in performance for:
both when working with chunked array and scalars. The two approaches seem to be equivalent. |
thanks for checking! i think I prefer this one, if you agree let's ship it |
DataFrame.with_columns
DataFrame.with_columns
DataFrame.with_columns
ArrowDataFrame.with_columns
What type of PR is this? (check all applicable)
Checklist
If you have comments or can explain your changes, please do so below.
While profiling for plotly, py-spy indicates that we spend a lot of time in
validate_dataframe_comparand
for pyarrow case. This is called only inwith_columns
methods.This PR proposed two changes:
np.full
invalidate_dataframe_comparand
instead of[const] * length
with_columns
to use pyarrow native methods to insert a column value. I expect this to be faster than the current approach of concatenating the already existing columns with the new ones - caveat is if the number of new columns is order(s) of magnitude greater than the existing ones. In majority of scenarios I would expect this to not be the case, but this is the reason I am opening this as RFC