Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support passing index object directly into maybe_set_index #1319

Merged

Conversation

Riik
Copy link
Contributor

@Riik Riik commented Nov 4, 2024

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

Pandas set_index docs
Pandas set_index test cases
I've renamed the column_names: str | list[str] parameter to keys: str | Series | list[Series | str]. I referenced the Pandas docs where which have keys: label or array-like or list of labels/arrays. Since this is a backwards-compatability function, I figured it is good for it to be close to the Pandas spec, and the original param name column doesn't cover all cases anymore.

@github-actions github-actions bot added the enhancement New feature or request label Nov 4, 2024
@MarcoGorelli
Copy link
Member

thanks for your PR!

this would break backwards-compatibility, so I don't think we can just rename to keys unfortunately

@Riik
Copy link
Contributor Author

Riik commented Nov 5, 2024

That makes sense, I wondered if this was part of the "stable API" or not. Would you advise to stick with the original name (columns) and expand the functionality/meaning, or should we not add this feature and close the issue / PR?

@MarcoGorelli
Copy link
Member

one idea could be to add an extra keyword argument index which you can pass an index object directly to (and the disallow passing both columns and index), that would still be backwards-compatible

@Riik
Copy link
Contributor Author

Riik commented Nov 9, 2024

Right, so we preserve the column_names for backwards comparability, add index, make both optional, and check that they are never both specified. They new signature will then be

def maybe_set_index(df: T, column_names: str | list[str] | None, index: Series | list[Series] | None) -> T

I'll give that a try!

@Riik Riik force-pushed the maybe_set_index_support_direct_index branch from aaa1780 to 90a048d Compare November 9, 2024 20:18
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Riik thanks for the effort! I left a minor comment and a suggestion to simplify the logic a bit.

Additionally, could we handle the case for which df is actually a series? That's the use case I had in mind when I opened the related issue. I.e. if the native_object is a pandas like series, then I would like to modify its index and clearly that's not possible to do unless another Series/Index/iterable is passed.

Since that's the case, I think we should also raise an error if native_object is a series but column_names is provided

narwhals/utils.py Outdated Show resolved Hide resolved
narwhals/utils.py Outdated Show resolved Hide resolved
@Riik
Copy link
Contributor Author

Riik commented Nov 9, 2024

Additionally, could we handle the case for which df is actually a series? That's the use case I had in mind when I opened the related issue. I.e. if the native_object is a pandas like series, then I would like to modify its index and clearly that's not possible to do unless another Series/Index/iterable is passed.

Ah yes, I see the idea now! I think that should work with the current implementation, let me add a test for it

@Riik
Copy link
Contributor Author

Riik commented Nov 9, 2024

@FBruzzesi I've added a test case and also support for setting the index on a Series. It didn't work out of the box since pd.Series (docs) doesn't have a set_index() method, but you can set the index directly by assignment.

Also, there is the extra case where passing in column_names in combination with a Series is now invalid.

I feel like this function is getting a bit unwieldy, i.e. there a lot of conditionals inside the function. Would it be an option to make a separate function maybe_set_index_series instead? That would clean up the logic of this function and keep it simpler. Or maybe there is other ways to clean up this function a bit more. Let me know if you have any ideas!

narwhals/utils.py Outdated Show resolved Hide resolved
@FBruzzesi
Copy link
Member

FBruzzesi commented Nov 9, 2024

@FBruzzesi I've added a test case and also support for setting the index on a Series. It didn't work out of the box since pd.Series (docs) doesn't have a set_index() method, but you can set the index directly by assignment.

That's great, thanks! It looks good, I just tagged Marco asking if Series.set_axis should be prefered instead.

Also, there is the extra case where passing in column_names in combination with a Series is now invalid.

I feel like this function is getting a bit unwieldy, i.e. there a lot of conditionals inside the function. Would it be an option to make a separate function maybe_set_index_series instead? That would clean up the logic of this function and keep it simpler. Or maybe there is other ways to clean up this function a bit more. Let me know if you have any ideas!

It still seems pretty reasonable in scope and size. I will take a deeper look tomorrow to see if there is a way to cut some corners :)

@FBruzzesi
Copy link
Member

FBruzzesi commented Nov 10, 2024

@Riik I just merged with main branch and pushed a few changes to parse index and column_names into keys, so that we avoid branching out for dataframe/series in multiple places.

@Riik
Copy link
Contributor Author

Riik commented Nov 10, 2024

@Riik I just merged with main branch and pushed a few changes to parse index and column_names into keys, so that we avoid branching out for dataframe/series in multiple places.

Thank you! I think the function reads quite nicely now, with the keys internally it is easier to follow the flow.

Anything else that still needs to happen on this PR?

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am (biased and) happy with the state of this! I will leave the last word to Marco for the set_axis vs s.index=index question before merging!

DogHighFiveGIF

narwhals/utils.py Outdated Show resolved Hide resolved
narwhals/utils.py Outdated Show resolved Hide resolved
narwhals/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating - implementation looks good, just not sure about the argument name df

I did a search on https://github.com/search?q=%22maybe_set_index%28%22+language%3APython&type=code&p=1, and nobody would be affected by such a renaming. I think this is rare enough as a function that we can just change it - the only user I found it scikit-lego, who write

nw.maybe_set_index(result, ["fold", "part"])

(i.e. they pass the first argument positionally, not by name)


@FBruzzesi 's suggested obj, as that's what we use in maybe_get_index

Comment on lines 287 to 288
df: object for which maybe set the index (can be either a Narwhals `DataFrame`
or `Series`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a bit odd for this to be called df when it can be a dataframe or series?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Riik and @FBruzzesi !

@MarcoGorelli MarcoGorelli merged commit 0929d52 into narwhals-dev:main Nov 12, 2024
22 checks passed
@rik-vandervlist-sympower

πŸ™Œ Thanks for the feedback and help, it was a fun PR to work on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants