Arrow PyCapsule Interface support #12530

wjones127 · 2023-11-17T01:07:05Z

Description

In the Arrow project, we recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary. For example, DuckDB can read from Polars DataFrames and LazyFrames, but only if PyArrow is installed. One this protocol is implemented, it would be possible to accomplish that integration without PyArrow.

This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.

Polars could implement this in two ways:

Add Arrow PyCapsule dunder methods to Polars objects
- That would be: DataFrame, Series, DataType
Support Arrow PyCapsules in polars.from_arrow
- polars.from_arrow
Support Arrow PyCapsules in polars.DataFrame constructor
- You already support pd.DataFrame, so it would make logical sense to support reading rectangular-shaped Arrow data.

I'd be happy to contribute this to the repo, if these ideas sound good.

The text was updated successfully, but these errors were encountered:

wjones127 · 2023-11-18T03:01:55Z

Looking through the codebase, it seems there is some basic work that needs to be done to make the Arrow interoperability more generic. Right now the import implementation seems to rely on PyArrow-specific APIs:

polars/py-polars/polars/utils/_construction.py

Lines 1472 to 1555 in e461ffc

    
           def arrow_to_pydf( 
        
               data: pa.Table, 
        
               schema: SchemaDefinition | None = None, 
        
               *, 
        
               schema_overrides: SchemaDict | None = None, 
        
               rechunk: bool = True, 
        
           ) -> PyDataFrame: 
        
               """Construct a PyDataFrame from an Arrow Table.""" 
        
               original_schema = schema 
        
               column_names, schema_overrides = _unpack_schema( 
        
                   (schema or data.column_names), schema_overrides=schema_overrides 
        
               ) 
        
               try: 
        
                   if column_names != data.column_names: 
        
                       data = data.rename_columns(column_names) 
        
               except pa.lib.ArrowInvalid as e: 
        
                   raise ValueError("dimensions of columns arg must match data dimensions") from e 
        
               data_dict = {} 
        
               # dictionaries cannot be built in different batches (categorical does not allow 
        
               # that) so we rechunk them and create them separately. 
        
               dictionary_cols = {} 
        
               # struct columns don't work properly if they contain multiple chunks. 
        
               struct_cols = {} 
        
               names = [] 
        
               for i, column in enumerate(data): 
        
                   # extract the name before casting 
        
                   name = f"column_{i}" if column._name is None else column._name 
        
                   names.append(name) 
        
                   column = coerce_arrow(column) 
        
                   if pa.types.is_dictionary(column.type): 
        
                       ps = arrow_to_pyseries(name, column, rechunk=rechunk) 
        
                       dictionary_cols[i] = wrap_s(ps) 
        
                   elif isinstance(column.type, pa.StructType) and column.num_chunks > 1: 
        
                       ps = arrow_to_pyseries(name, column, rechunk=rechunk) 
        
                       struct_cols[i] = wrap_s(ps) 
        
                   else: 
        
                       data_dict[name] = column 
        
               if len(data_dict) > 0: 
        
                   tbl = pa.table(data_dict) 
        
                   # path for table without rows that keeps datatype 
        
                   if tbl.shape[0] == 0: 
        
                       pydf = pl.DataFrame( 
        
                           [pl.Series(name, c) for (name, c) in zip(tbl.column_names, tbl.columns)] 
        
                       )._df 
        
                   else: 
        
                       pydf = PyDataFrame.from_arrow_record_batches(tbl.to_batches()) 
        
               else: 
        
                   pydf = pl.DataFrame([])._df 
        
               if rechunk: 
        
                   pydf = pydf.rechunk() 
        
               reset_order = False 
        
               if len(dictionary_cols) > 0: 
        
                   df = wrap_df(pydf) 
        
                   df = df.with_columns([F.lit(s).alias(s.name) for s in dictionary_cols.values()]) 
        
                   reset_order = True 
        
               if len(struct_cols) > 0: 
        
                   df = wrap_df(pydf) 
        
                   df = df.with_columns([F.lit(s).alias(s.name) for s in struct_cols.values()]) 
        
                   reset_order = True 
        
               if reset_order: 
        
                   df = df[names] 
        
                   pydf = df._df 
        
               if column_names != original_schema and (schema_overrides or original_schema): 
        
                   pydf = _post_apply_columns( 
        
                       pydf, original_schema, schema_overrides=schema_overrides 
        
                   ) 
        
               elif schema_overrides: 
        
                   for col, dtype in zip(pydf.columns(), pydf.dtypes()): 
        
                       override_dtype = schema_overrides.get(col) 
        
                       if override_dtype is not None and dtype != override_dtype: 
        
                           pydf = _post_apply_columns( 
        
                               pydf, original_schema, schema_overrides=schema_overrides 
        
                           ) 
        
                           break 
        
               return pydf

ritchie46 · 2024-01-07T20:02:28Z

Sorry for the delay. Somehow I missed this. I think this sounds great. Being agnostic to arrow consumer without hard pyarrow dependency sound good.

Does your offer still stand on this?

wjones127 · 2024-01-08T01:37:41Z

Yes, I’ve started work on this locally but got distracted. I’ll try to get back to it soon :)

eitsupi · 2024-02-06T13:52:01Z

Related to #14208

paleolimbot · 2024-02-07T13:52:31Z

I'm still working on the Python part, but ChunkedArray import/export to ArrowArrayStream in C++ just merged, which should make this more useful when applied to a Series: apache/arrow#39455 .

eitsupi · 2024-05-07T14:35:57Z

FYI, I tried to implement ArrayStream import functionality in r-polars, but found a considerable speed reduction compared to the previous implementation (copied from py-polars), so I reverted (pola-rs/r-polars#1078 (comment)).

deanm0000 · 2024-06-05T18:29:40Z

I wonder if using the __arrow_c_stream__ method would obviate this #16614

deanm0000 · 2024-06-25T00:58:32Z

@wjones127 curious if this is still something you're working on?

wjones127 · 2024-06-25T01:54:46Z

curious if this is still something you're working on?

I haven't had time to finish this, no. I may return to this later this year, if someone else hasn't gotten to it.

kylebarron · 2024-07-16T19:46:32Z

I started a PR for data export in #17676

kylebarron · 2024-07-18T04:52:30Z

And a PR for DataFrame import via the C Stream in #17693

kylebarron · 2024-07-31T14:01:25Z

This is mostly resolved by #17676, #17693, and #17935. Potential follow ups include:

Implementing the interface for Schema and DataType objects
Casting exported data according to requested_schema

MarcoGorelli · 2024-10-17T07:49:01Z

As mentioned in the Narwhals PR, and in the original post

Support Arrow PyCapsules in polars.from_arrow

I think this is still missing in polars.from_arrow, right? I could put up a PR for that later this week (unless anyone has time first, in which case, feel free to take it!)

kylebarron · 2024-10-17T13:36:02Z

Supporting the PyCapsule Interface via a top-level from_arrow isn't strictly possible because you don't know how to handle struct-typed arrays.

A struct Series with two float fields, x and y, is transported via the Arrow C Data/Stream interface exactly the same as a DataFrame/Table with two float columns, x and y. So supporting this in a general from_arrow function isn't strictly possible because you don't know whether the user wants to materialize this data as a Series or DataFrame. That's why I only implemented support for this in DataFrame.__init__ and Series.__init__.

wjones127 added the enhancement New feature or an improvement of an existing feature label Nov 17, 2023

jorisvandenbossche mentioned this issue Dec 13, 2023

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

WillAyd mentioned this issue Jan 25, 2024

Pyarrow directly innobi/pantab#242

Merged

jorisvandenbossche mentioned this issue Mar 25, 2024

[Python] Add date32 support to __dataframe__ protocol apache/arrow#39539

Open

eitsupi mentioned this issue May 7, 2024

feat: import_stream internal method for Series to support Arrow C stream interface pola-rs/r-polars#1078

Merged

This was referenced Jul 11, 2024

Implement Arrow PyCapsule Interface apache/datafusion-python#752

Closed

Support for Arrow PyCapsule Interface Eventual-Inc/Daft#2504

Open

jonmmease mentioned this issue Jul 13, 2024

Support ingesting objects that support the Arrow PyCapsule API vega/vegafusion#498

Open

kylebarron mentioned this issue Jul 16, 2024

feat(python): Implement Arrow PyCapsule Interface for Series/DataFrame export #17676

Merged

kylebarron mentioned this issue Jul 18, 2024

feat(python): Support PyCapsule Interface in DataFrame & Series constructors #17693

Merged

WillAyd mentioned this issue Sep 27, 2024

Allow sinking and scanning of lazyframes innobi/pantab#346

Closed

MarcoGorelli mentioned this issue Oct 17, 2024

feat: add from_arrow (which uses the PyCapsule Interface) narwhals-dev/narwhals#1181

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow PyCapsule Interface support #12530

Arrow PyCapsule Interface support #12530

wjones127 commented Nov 17, 2023 •

edited

Loading

wjones127 commented Nov 18, 2023

ritchie46 commented Jan 7, 2024

wjones127 commented Jan 8, 2024

eitsupi commented Feb 6, 2024

paleolimbot commented Feb 7, 2024

eitsupi commented May 7, 2024

deanm0000 commented Jun 5, 2024 •

edited

Loading

deanm0000 commented Jun 25, 2024

wjones127 commented Jun 25, 2024

kylebarron commented Jul 16, 2024

kylebarron commented Jul 18, 2024

kylebarron commented Jul 31, 2024

MarcoGorelli commented Oct 17, 2024 •

edited

Loading

kylebarron commented Oct 17, 2024 •

edited

Loading

Arrow PyCapsule Interface support #12530

Arrow PyCapsule Interface support #12530

Comments

wjones127 commented Nov 17, 2023 • edited Loading

Description

wjones127 commented Nov 18, 2023

ritchie46 commented Jan 7, 2024

wjones127 commented Jan 8, 2024

eitsupi commented Feb 6, 2024

paleolimbot commented Feb 7, 2024

eitsupi commented May 7, 2024

deanm0000 commented Jun 5, 2024 • edited Loading

deanm0000 commented Jun 25, 2024

wjones127 commented Jun 25, 2024

kylebarron commented Jul 16, 2024

kylebarron commented Jul 18, 2024

kylebarron commented Jul 31, 2024

MarcoGorelli commented Oct 17, 2024 • edited Loading

kylebarron commented Oct 17, 2024 • edited Loading

wjones127 commented Nov 17, 2023 •

edited

Loading

deanm0000 commented Jun 5, 2024 •

edited

Loading

MarcoGorelli commented Oct 17, 2024 •

edited

Loading

kylebarron commented Oct 17, 2024 •

edited

Loading