Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(python): Documentation for Arrow PyCapsule interface integration #17935

Merged
merged 3 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/src/python/user-guide/misc/arrow_pycapsule.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# --8<-- [start:to_arrow]
import polars as pl
import pyarrow as pa

df = pl.DataFrame({"foo": [1, 2, 3], "bar": ["ham", "spam", "jam"]})
arrow_table = pa.table(df)
print(arrow_table)
# --8<-- [end:to_arrow]

# --8<-- [start:to_polars]
polars_df = pl.DataFrame(arrow_table)
print(polars_df)
# --8<-- [end:to_polars]

# --8<-- [start:to_arrow_series]
arrow_chunked_array = pa.array(df["foo"])
print(arrow_chunked_array)
# --8<-- [end:to_arrow_series]

# --8<-- [start:to_polars_series]
polars_series = pl.Series(arrow_chunked_array)
print(polars_series)
# --8<-- [end:to_polars_series]

# --8<-- [start:to_arrow_array_rechunk]
arrow_array = pa.array(df["foo"])
print(arrow_array)
# --8<-- [end:to_arrow_array_rechunk]
98 changes: 97 additions & 1 deletion docs/user-guide/misc/arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,106 @@ bar: [["ham","spam","jam"]]

Importing from pyarrow can be achieved with `pl.from_arrow`.

## Using the Arrow PyCapsule Interface

As of Polars v1.3 and higher, Polars implements the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html), a protocol for sharing Arrow data across Python libraries.

### Exporting data from Polars to pyarrow

To convert a Polars `DataFrame` to a `pyarrow.Table`, use the `pyarrow.table` constructor:

!!! note

This requires pyarrow v15 or higher.

{{code_block('user-guide/misc/arrow_pycapsule','to_arrow',[])}}

```
pyarrow.Table
foo: int64
bar: string_view
----
foo: [[1,2,3]]
bar: [["ham","spam","jam"]]
```

To convert a Polars `Series` to a `pyarrow.ChunkedArray`, use the `pyarrow.chunked_array` constructor.

{{code_block('user-guide/misc/arrow_pycapsule','to_arrow_series',[])}}

```
[
[
1,
2,
3
]
]
```

You can also pass a `Series` to the `pyarrow.array` constructor to create a contiguous array. Note that this will not be zero-copy if the underlying `Series` had multiple chunks.

{{code_block('user-guide/misc/arrow_pycapsule','to_arrow_array_rechunk',[])}}

```
[
1,
2,
3
]
```

### Importing data from pyarrow to Polars

We can pass the pyarrow `Table` back to Polars by using the `polars.DataFrame` constructor:

{{code_block('user-guide/misc/arrow_pycapsule','to_polars',[])}}

```
shape: (3, 2)
┌─────┬──────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ ham │
│ 2 ┆ spam │
│ 3 ┆ jam │
└─────┴──────┘
```

Similarly, we can pass the pyarrow `ChunkedArray` or `Array` back to Polars by using the `polars.Series` constructor:

{{code_block('user-guide/misc/arrow_pycapsule','to_polars_series',[])}}

```
shape: (3,)
Series: '' [i64]
[
1
2
3
]
```

### Usage with other arrow libraries

There's a [growing list](https://github.com/apache/arrow/issues/39195#issuecomment-2245718008) of libraries that support the PyCapsule Interface directly. Polars `Series` and `DataFrame` objects work automatically with every such library.

### For library maintainers

If you're developing a library that you wish to integrate with Polars, it's suggested to implement the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) yourself. This comes with a number of benefits:

- Zero-copy exchange for both Polars Series and DataFrame
- No required dependency on pyarrow.
- No direct dependency on Polars.
- Harder to cause memory leaks than handling pointers as raw integers.
- Automatic zero-copy integration other PyCapsule Interface-supported libraries.

## Using Polars directly

Polars can also consume and export to and import from the [Arrow C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
kylebarron marked this conversation as resolved.
Show resolved Hide resolved
directly. This is recommended for library maintainers that want to interop with Polars without requiring a pyarrow installation.
directly. This is recommended for libraries that don't support the Arrow PyCapsule Interface and want to interop with Polars without requiring a pyarrow installation.

- To export `ArrowArray` C structs, Polars exposes: `Series._export_arrow_to_c`.
- To import an `ArrowArray` C struct, Polars exposes `Series._import_arrow_from_c`.