Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Schema.to_(arrow|pandas|polars) #1924

Merged
merged 39 commits into from
Feb 8, 2025

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Feb 3, 2025

Will close #1912

What type of PR is this? (check all applicable)

  • ✨ Feature

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

  • Started with porting nw.functions._from_dict_impl
  • Extended Schema w/ ._version

Will close narwhals-dev#1912

- Starting with porting `nw.functions._from_dict_impl`
- Thinking that `Schema` should have `._version: ClassVar[Version]` to remove the need for user-facing arg (narwhals-dev#1912 (comment))
@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 3, 2025

@MarcoGorelli I've pushed this early to get some thoughts on handling Version.

To avoid needing a user-provided arg, I think we'd need to approach it differently than in here with the public/private function pair:

return _from_dict_impl(
data,
schema,
native_namespace=native_namespace,
version=Version.MAIN,
)
def _from_dict_impl(
data: dict[str, Any],
schema: dict[str, DType] | Schema | None = None,
*,
native_namespace: ModuleType | None = None,
version: Version,
) -> DataFrame[Any]:
from narwhals.series import Series

If you have something like:

class Schema(BaseSchema):
    _version: ClassVar[Version] = Version.MAIN

That could easily be overridden for v1:

class Schema(NwSchema):
    _version = Version.V1

@dangotbanned

This comment was marked as resolved.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 4, 2025

thanks Dan! Yes, having a private _version sounds good to me

Some other comments:

  • I don't think you need backend for to_pandas. I think you just need dtype_backend, whose signature should match the one in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html
  • I'm not sure about to_native, because I might then expect to do:
    df= pd.DataFrame({'a': [1,2,3]})
    nw.from_native(df).schema.to_native()
    and expect to get a pandas schema out the other end. However, that's not what happens, the nw.Schema has no knowledge of the underlying implementation. And I think that's OK. Shall we start with just to_pandas / to_pyarrow / to_polars?

@dangotbanned

This comment was marked as outdated.

@MarcoGorelli
Copy link
Member

yup, thanks!

@dangotbanned
Copy link
Member Author

thanks Dan! Yes, having a private _version sounds good to me

Great, I'll get that added soon

Some other comments:

df= pd.DataFrame({'a': [1,2,3]})
nw.from_native(df).schema.to_native()

and expect to get a pandas schema out the other end.
However, that's not what happens, the nw.Schema has no knowledge of the underlying implementation. And I think that's OK.

I think you've stumbled into an interesting topic here @MarcoGorelli.
So the current signatures deviate from the similar methods for nw.DataFrame - as here backend is required to account for this lack of knowledge.

Mainly this was from adapting nw.from_dict - which (prior to #1931) required a ModuleType and that acted as a guarantee that we had the import.
Now, the main difference between this and from_dict is that backend can't be None - since there isn't some context to do this trick

if backend is None:
for val in data.values():
if isinstance(val, Series):
native_namespace = val.__native_namespace__()
break
else:
msg = "Calling `from_dict` without `backend` is only supported if all input values are already Narwhals Series"
raise TypeError(msg)


Shall we start with just to_pandas / to_pyarrow / to_polars?

Yeah definitely!
Now that I understand whats going on behind the scenes, to_native seems less helpful.

Question

If we drop the backend requirement and to_native(), do we need handling in nw.Implementation for missing imports?

I can keep this local to nw.Schema.to_*.
But I feel like with the push (#1917, #1931) towards backend: ModuleType | Implementation | str - there might be a benefit in some "infallible" paths.
If only to provide consistent errors?

@MarcoGorelli
Copy link
Member

I think in general we need a import_optional_dependency utility function, which can unify error messages for missing imports

but nw.Schema.to_pandas can import pandas (and let it raise if it's not available), nw.Schema.to_polars can import Polars (and let it raise if it's not available), etc...

@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 4, 2025

I think in general we need a import_optional_dependency utility function, which can unify error messages for missing imports

but nw.Schema.to_pandas can import pandas (and let it raise if it's not available), nw.Schema.to_polars can import Polars (and let it raise if it's not available), etc...

https://results.pre-commit.ci/run/github/760058710/1738698879.vjLSyWjmSYiQy0B_ViLJlA

@MarcoGorelli should this fall under the same exception-to-the-rule as the dtypes modules?

Edit

Seems I misunderstood how the check works https://github.com/narwhals-dev/narwhals/blob/71a5bc5d6848da10e658db9d68a4a139f37c099b/utils/import_check.py

Not sure how this could fit in

narwhals/schema.py Outdated Show resolved Hide resolved
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 4, 2025

thanks, looking good!

you can add a # ignore-banned-import where necessary, the check is only there go nudge devs towards double-checking when they perform certain imports that they're really only doing them when strictly needed

@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 4, 2025

#1924 (comment)

Thanks @MarcoGorelli, will give this a try tomorrow

narwhals/typing.py Outdated Show resolved Hide resolved
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dangotbanned

just made some minor edits, and got dtype_backend to match how it currently is in pandas

happy with this?

@dangotbanned
Copy link
Member Author

thanks @dangotbanned

just made some minor edits, and got dtype_backend to match how it currently is in pandas

happy with this?

Thanks @MarcoGorelli, just taking a look now

Comment on lines 208 to 210
if parse_version(pl.__version__) < (1, 0, 0): # pragma: no cover
return dict(schema) # type: ignore[return-value]
return pl.Schema(schema)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli why is the order reversed now?

This is just one small example from typing_extensions:

https://github.com/python/typing_extensions/blob/8184ac61398c187203dad819eb5b9d34005a96ae/src/typing_extensions.py#L547-L552

I find the pattern there is easier to follow:

if parse_version(pl.__version__) >= (1, 0, 0):
    # happy path first
    ...
else:
    # compat path
    ...

Copy link
Member Author

@dangotbanned dangotbanned Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import of cast was blocking me adding it as a suggestion, but this is my suggestion
(3bcebc1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it just became easier to not have type ignore and pragma no cover on the same line (we measure coverage in the ci job with the latest versions - i've found combining coverage jobs to be too unreliable and flaky)

sure, no objections to 3bcebc1


def test_schema_to_pandas_invalid() -> None:
schema = nw.Schema({"a": nw.Int64()})
msg = "Expected one of {None, 'pyarrow', 'numpy_nullable'}, got: 'cabbage'"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

@dangotbanned
Copy link
Member Author

dangotbanned commented Feb 8, 2025

Ready to merge if you're onboard w/ (#1924 (comment)) @MarcoGorelli

Thanks for helping me through this 🎉

Note

Maybe rename the PR to make it more discoverable?
feat: add nw.Schema.to_(arrow|pandas|polars)

@MarcoGorelli
Copy link
Member

sure, looks good, feel free to apply that and ship it

@dangotbanned
Copy link
Member Author

sure, looks good, feel free to apply that and ship it

Ah great, sorry last note in (#1924 (comment)) (if you didn't see it)

@MarcoGorelli
Copy link
Member

changing the title? sure, i think you should have permission to edit

@dangotbanned
Copy link
Member Author

changing the title? sure, i think you should have permission to edit

Oh yeah 🎉

@dangotbanned dangotbanned changed the title feat: add nw.Schema.to_* methods feat: add Schema.to_(arrow|pandas|polars) Feb 8, 2025
@dangotbanned dangotbanned merged commit 365cdbd into narwhals-dev:main Feb 8, 2025
23 of 24 checks passed
@dangotbanned dangotbanned deleted the schema-convert-api branch February 8, 2025 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enh]: nw.(DType|Schema) conversion API
2 participants