You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Arrow project recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary.
This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.
A growing number of Python-Arrow libraries are aware of the PyCapsule interface, and then would be able to read from fastexcel directly, without needing to go through pyarrow or even have it installed in the environment.
For example, I have a PR open for polars in pola-rs/polars#17693, but you could also pass the fastexcel object directly into constructors from pyarrow, nanoarrow, arro3. I'm advocating for more projects to adopt the PyCapsule interface directly, including duckdb, datafusion, vegafusion, and daft.
In terms of implementation, currently fastexcel uses arrow-rs' default pyarrow integration. Instead you need to define one or more dunder methods, probably on the ExcelSheet. If you always return a RecordBatch, then you could implement __arrow_c_array__, but if you ever wanted to expose a lazy stream, you could implement __arrow_c_stream__, which would export multiple batches of data.
I have a helper library, pyo3-arrow, that you can use to implement this, separate from arrow-rs for a few reasons. Or the relevant code is pretty small and self contained to vendor if you don't want to add an external dependency.
The text was updated successfully, but these errors were encountered:
The Arrow project recently created a new protocol for sharing Arrow data in Python. One of the goals of the protocol is allow exporting / importing Arrow data in Python without having to necessarily use PyArrow as an intermediary.
This allows Arrow-exportable objects to be recognized based on the presence of one of several dunder methods.
A growing number of Python-Arrow libraries are aware of the PyCapsule interface, and then would be able to read from fastexcel directly, without needing to go through pyarrow or even have it installed in the environment.
For example, I have a PR open for polars in pola-rs/polars#17693, but you could also pass the fastexcel object directly into constructors from pyarrow, nanoarrow, arro3. I'm advocating for more projects to adopt the PyCapsule interface directly, including duckdb, datafusion, vegafusion, and daft.
In terms of implementation, currently fastexcel uses arrow-rs' default pyarrow integration. Instead you need to define one or more dunder methods, probably on the
ExcelSheet
. If you always return aRecordBatch
, then you could implement__arrow_c_array__
, but if you ever wanted to expose a lazy stream, you could implement__arrow_c_stream__
, which would export multiple batches of data.I have a helper library, pyo3-arrow, that you can use to implement this, separate from arrow-rs for a few reasons. Or the relevant code is pretty small and self contained to vendor if you don't want to add an external dependency.
The text was updated successfully, but these errors were encountered: