Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ingesting objects that support the Arrow PyCapsule API #498

Open
jonmmease opened this issue Jul 13, 2024 · 6 comments
Open

Support ingesting objects that support the Arrow PyCapsule API #498

jonmmease opened this issue Jul 13, 2024 · 6 comments

Comments

@jonmmease
Copy link
Collaborator

We could support ingesting objects that implement the Arrow PyCapsule API.

Compared to the current support for the DataFrame Interchange Protocol, accepting objects that implement the Arrow PyCapsule API wouldn't require pyarrow, and wouldn't require converting to a pyarrow Table on the Python side.

I think we could use @kylebarron's new pyo3-arrow crate for this (since it doesn't require the pyarrow dependency). In fact, I think we could drop pyarrow as a hard dependency using this approach, since pyarrow itself supports the PyCapsule API.


cc @MarcoGorelli based on comment in vega/altair#3452 (comment)

I've caught up with Polars devs, and they're on board with using Altair in polars.DataFrame.plot if the plots can be done directly without going via pandas

In order for VegaFusion (which powers Vega-Altair's optional "vegafusion" data transformer) to support polars without pyarrow (so that operations like Vega-Altair's histogram binning and aggregation are performed in the Python kernel rather than in the browser), I think we'll need polars to support the PyCapsule API as discussed in pola-rs/polars#12530.

@kylebarron
Copy link

I think we could drop pyarrow as a hard dependency using this approach, since pyarrow itself supports the PyCapsule API.

Yes, though it does require the user to have a relatively recent version of pyarrow.

Let me know if I can help with pyo3-arrow at all! I've only published a version support arrow version 52, so you'll have to upgrade before you can use pyo3-arrow. I figure you only care about importing data, not exporting data?

@jonmmease
Copy link
Collaborator Author

Thanks for chiming in @kylebarron

I've only published a version support arrow version 52

Yeah, I need to update DataFusion and Arrow soon anyway.

I figure you only care about importing data, not exporting data?

Thats correct

Yes, though it does require the user to have a relatively recent version of pyarrow.

Thanks for the call out, that's a good point.

@kylebarron
Copy link

I started a PR for polars pycapsule export here: pola-rs/polars#17676

@kylebarron
Copy link

If you pointed me to where the arrow ingest happens, I could probably make a PR for this if you'd like

@jonmmease
Copy link
Collaborator Author

Thanks for the offer!

Here's is where the pyarrow tables are imported

#[cfg(feature = "pyarrow")]
pub fn from_pyarrow(py: Python, pyarrow_table: &PyAny) -> std::result::Result<Self, PyErr> {
// Extract table.schema as a Rust Schema
let getattr_args = PyTuple::new(py, vec!["schema"]);
let schema_object = pyarrow_table.call_method1("__getattribute__", getattr_args)?;
let schema = Schema::from_pyarrow(schema_object)?;
// Extract table.to_batches() as a Rust Vec<RecordBatch>
let batches_object = pyarrow_table.call_method0("to_batches")?;
let batches_list = batches_object.downcast::<PyList>()?;
let batches = batches_list
.iter()
.map(|batch_any| Ok(RecordBatch::from_pyarrow(batch_any)?))
.collect::<Result<Vec<RecordBatch>>>()?;
Ok(VegaFusionTable::try_new(Arc::new(schema), batches)?)
}

This is invoked from the PyO3 Rust code in:

} else {
// Assume PyArrow Table
// We convert to ipc bytes for two reasons:
// - It allows VegaFusionDataset to compute an accurate hash of the table
// - It works around https://github.com/hex-inc/vegafusion/issues/268
let table = VegaFusionTable::from_pyarrow(py, inline_dataset)?;
VegaFusionDataset::from_table_ipc_bytes(&table.to_ipc_bytes()?)?
};

I'm imagining there would be a VegaFusionDataset.from_arrow_pycapsule or something, then another branch in process_inline_datasets that would check from the PyCapsule interface and use this API. Definitely happy for you to do the whole PR, but even if you only implement this from_arrow_pycapsule method that would be really helpful and I can do the rest of the routing later.

If this is blocked by updating arrow-rs, I can ping this thread once that's done.

@kylebarron
Copy link

If this is blocked by updating arrow-rs, I can ping this thread once that's done

I think that's primarily a question of whether you're ok vendoring the relevant PyCapsule code (on top of arrow-rs' FFI code). It's a relatively small amount of code (polars pr for reference), and then you don't have to add a dependency on pyo3-arrow if you don't want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants