Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: Construct DataFrame from Arrow PyCapsule object #1158

Closed
jonmmease opened this issue Oct 9, 2024 · 9 comments · Fixed by #1181
Closed

[Enh]: Construct DataFrame from Arrow PyCapsule object #1158

jonmmease opened this issue Oct 9, 2024 · 9 comments · Fixed by #1181
Labels
enhancement New feature or request high priority

Comments

@jonmmease
Copy link

jonmmease commented Oct 9, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

This request is towards using narwhals to remove the pandas/pyarrow dependencies from VegaFusion 2.0

Please describe the purpose of the new feature or describe the problem to solve.

The flow I'm aiming for with VegaFusion 2.0 is that I'd like to use Narwhals for basic column projection and schema inspection and then use the Arrow PyCapsule API to pass the result to Rust. Then in some cases, the Rust logic will return a new Arrow result in PyCapsule form, and it would be great to be able to use Narwhals to wrap this result using the same backend as the input.

Suggest a solution if possible.

I was picturing perhaps a constructor method in the same family as from_dict, accepting an arrow PyCapsule object.

nw.from_arrow_capsule(cap, native_namespace=nw.get_native_namespace(input_df))

cc @kylebarron for all things Arrow PyCapsule 😄

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

@kylebarron
Copy link

I think there can be a limited use case for passing around raw capsules, but the more general API is if you exported an object from your rust code with an __arrow_c_stream__ dunder method, which then could be imported into narwhals using its existing PyCapsule Interface support. This also reduces the user's reliance on narwhals, and lets them use any Arrow-compatible library of their choosing.

In my own libraries, when I control both sides of the connection, I sometimes do have a from_arrow_capsule method. This can be useful when I want to ensure the user only has one version of arro3.core in their environment, and when I'm using arro3 as the transmission to the user's desired choice of library.

@jonmmease
Copy link
Author

which then could be imported into narwhals using its existing PyCapsule Interface support

Is import already possible in Narwhals? I was under the impression that it was currently only supported on export.

A from_arrow method like you have in arro3 (with an additional namespace argument) would work. But since Narwhals supports wrapping pyarrow already, it seemed like it could be confusing for end users. But maybe this would be a powerful way to convert between libraries.

@kylebarron
Copy link

Oh maybe it's only for export? I'm not up to date.

@MarcoGorelli
Copy link
Member

This is definitely in-scope, thanks for the request, I'll try to put something together soon-ish and we can figure out the details

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Oct 16, 2024

I've given this a go in #1181, does it look alright / is it what you were looking for?

For libraries which don't (yet?) support the PyCapsule interface for import, I'm first using PyArrow (if installed and at least version 14) and then converting from there (the currently-supported libraries all have a way of going directly from pyarrow tables)

@jonmmease
Copy link
Author

Awesome, this looks great. Thanks so much for the quick implementation!

@kylebarron
Copy link

kylebarron commented Oct 16, 2024

@jonmmease are you looking for an API that takes a capsule or an API that takes a Python object that exports a capsule object? Your first example seemed to be the former, but #1181 implements the latter

(I think the latter is more useful for more end users, while the former could be useful from native code specifically) In arro3 I have both from_arrow and from_arrow_pycapsule

@jonmmease
Copy link
Author

Either is fine on my end. In practice I think I'll be returning an arro3.core.Table from Rust to Python. So I'd defer to you and @MarcoGorelli on the best API for building narwhals DataFrame from an arro3.core.Table instance.

@kylebarron
Copy link

In that case just the from_arrow constructor would be fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants