Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow PyCapsule Interface support #12530

Open
3 tasks
wjones127 opened this issue Nov 17, 2023 · 14 comments
Open
3 tasks

Arrow PyCapsule Interface support #12530

wjones127 opened this issue Nov 17, 2023 · 14 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@wjones127
Copy link

wjones127 commented Nov 17, 2023

Description

In the Arrow project, we recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary. For example, DuckDB can read from Polars DataFrames and LazyFrames, but only if PyArrow is installed. One this protocol is implemented, it would be possible to accomplish that integration without PyArrow.

This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.

Polars could implement this in two ways:

  • Add Arrow PyCapsule dunder methods to Polars objects
    • That would be: DataFrame, Series, DataType
  • Support Arrow PyCapsules in polars.from_arrow
  • Support Arrow PyCapsules in polars.DataFrame constructor
    • You already support pd.DataFrame, so it would make logical sense to support reading rectangular-shaped Arrow data.

I'd be happy to contribute this to the repo, if these ideas sound good.

@wjones127 wjones127 added the enhancement New feature or an improvement of an existing feature label Nov 17, 2023
@wjones127
Copy link
Author

Looking through the codebase, it seems there is some basic work that needs to be done to make the Arrow interoperability more generic. Right now the import implementation seems to rely on PyArrow-specific APIs:

def arrow_to_pydf(
data: pa.Table,
schema: SchemaDefinition | None = None,
*,
schema_overrides: SchemaDict | None = None,
rechunk: bool = True,
) -> PyDataFrame:
"""Construct a PyDataFrame from an Arrow Table."""
original_schema = schema
column_names, schema_overrides = _unpack_schema(
(schema or data.column_names), schema_overrides=schema_overrides
)
try:
if column_names != data.column_names:
data = data.rename_columns(column_names)
except pa.lib.ArrowInvalid as e:
raise ValueError("dimensions of columns arg must match data dimensions") from e
data_dict = {}
# dictionaries cannot be built in different batches (categorical does not allow
# that) so we rechunk them and create them separately.
dictionary_cols = {}
# struct columns don't work properly if they contain multiple chunks.
struct_cols = {}
names = []
for i, column in enumerate(data):
# extract the name before casting
name = f"column_{i}" if column._name is None else column._name
names.append(name)
column = coerce_arrow(column)
if pa.types.is_dictionary(column.type):
ps = arrow_to_pyseries(name, column, rechunk=rechunk)
dictionary_cols[i] = wrap_s(ps)
elif isinstance(column.type, pa.StructType) and column.num_chunks > 1:
ps = arrow_to_pyseries(name, column, rechunk=rechunk)
struct_cols[i] = wrap_s(ps)
else:
data_dict[name] = column
if len(data_dict) > 0:
tbl = pa.table(data_dict)
# path for table without rows that keeps datatype
if tbl.shape[0] == 0:
pydf = pl.DataFrame(
[pl.Series(name, c) for (name, c) in zip(tbl.column_names, tbl.columns)]
)._df
else:
pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches())
else:
pydf = pl.DataFrame([])._df
if rechunk:
pydf = pydf.rechunk()
reset_order = False
if len(dictionary_cols) > 0:
df = wrap_df(pydf)
df = df.with_columns([F.lit(s).alias(s.name) for s in dictionary_cols.values()])
reset_order = True
if len(struct_cols) > 0:
df = wrap_df(pydf)
df = df.with_columns([F.lit(s).alias(s.name) for s in struct_cols.values()])
reset_order = True
if reset_order:
df = df[names]
pydf = df._df
if column_names != original_schema and (schema_overrides or original_schema):
pydf = _post_apply_columns(
pydf, original_schema, schema_overrides=schema_overrides
)
elif schema_overrides:
for col, dtype in zip(pydf.columns(), pydf.dtypes()):
override_dtype = schema_overrides.get(col)
if override_dtype is not None and dtype != override_dtype:
pydf = _post_apply_columns(
pydf, original_schema, schema_overrides=schema_overrides
)
break
return pydf

@ritchie46
Copy link
Member

Sorry for the delay. Somehow I missed this. I think this sounds great. Being agnostic to arrow consumer without hard pyarrow dependency sound good.

Does your offer still stand on this?

@wjones127
Copy link
Author

Yes, I’ve started work on this locally but got distracted. I’ll try to get back to it soon :)

@eitsupi
Copy link
Contributor

eitsupi commented Feb 6, 2024

Related to #14208

@paleolimbot
Copy link

I'm still working on the Python part, but ChunkedArray import/export to ArrowArrayStream in C++ just merged, which should make this more useful when applied to a Series: apache/arrow#39455 .

@eitsupi
Copy link
Contributor

eitsupi commented May 7, 2024

FYI, I tried to implement ArrayStream import functionality in r-polars, but found a considerable speed reduction compared to the previous implementation (copied from py-polars), so I reverted (pola-rs/r-polars#1078 (comment)).

@deanm0000
Copy link
Collaborator

deanm0000 commented Jun 5, 2024

I wonder if using the __arrow_c_stream__ method would obviate this #16614

@deanm0000
Copy link
Collaborator

@wjones127 curious if this is still something you're working on?

@wjones127
Copy link
Author

curious if this is still something you're working on?

I haven't had time to finish this, no. I may return to this later this year, if someone else hasn't gotten to it.

@kylebarron
Copy link
Contributor

I started a PR for data export in #17676

@kylebarron
Copy link
Contributor

And a PR for DataFrame import via the C Stream in #17693

@kylebarron
Copy link
Contributor

This is mostly resolved by #17676, #17693, and #17935. Potential follow ups include:

  • Implementing the interface for Schema and DataType objects
  • Casting exported data according to requested_schema

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Oct 17, 2024

As mentioned in the Narwhals PR, and in the original post

Support Arrow PyCapsules in polars.from_arrow

I think this is still missing in polars.from_arrow, right? I could put up a PR for that later this week (unless anyone has time first, in which case, feel free to take it!)

@kylebarron
Copy link
Contributor

kylebarron commented Oct 17, 2024

Supporting the PyCapsule Interface via a top-level from_arrow isn't strictly possible because you don't know how to handle struct-typed arrays.

A struct Series with two float fields, x and y, is transported via the Arrow C Data/Stream interface exactly the same as a DataFrame/Table with two float columns, x and y. So supporting this in a general from_arrow function isn't strictly possible because you don't know whether the user wants to materialize this data as a Series or DataFrame. That's why I only implemented support for this in DataFrame.__init__ and Series.__init__.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

7 participants