Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow c data interface #79

Merged
merged 18 commits into from
Jun 14, 2024

Conversation

Alex-PLACET
Copy link
Collaborator

@Alex-PLACET Alex-PLACET commented Apr 18, 2024

Fix #69

@Alex-PLACET Alex-PLACET marked this pull request as ready for review April 19, 2024 14:49
@Alex-PLACET Alex-PLACET force-pushed the arrow_c_data_interface branch 2 times, most recently from c17a20d to 65eacf4 Compare April 29, 2024 07:41
include/sparrow/c_interface.hpp Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Show resolved Hide resolved
test/test_c_data_interface.cpp Show resolved Hide resolved
test/test_c_data_interface.cpp Show resolved Hide resolved
Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Just some notes from writing a few things along these lines in nanoarrow 🙂

include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
*/
template <template <typename> class Allocator>
requires sparrow::allocator<Allocator<char>>
std::unique_ptr<ArrowSchema> make_arrow_schema(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would highly recommend avoiding std::unique_ptr<ArrowSchema/ArrowArray>, unless used with a deleter that ensures that any unreleased array that it is pointing to is released when the unique pointer is deleted. In cudf they use a unique pointer with a deleter:

https://github.com/rapidsai/cudf/blob/e58838b6cc820fc89f1f67eb9117a3ee6ddeaa47/cpp/src/interop/to_arrow_device.cu#L601-L604

...in nanoarrow we have a UniqueArray/Schema class that wraps a stack-allocated ArrowArray/Schema:

https://github.com/apache/arrow-nanoarrow/blob/a0632177c27cb52880c62533fcd441613f41815b/src/nanoarrow/nanoarrow.hpp#L265-L272

..but the idea is the same: always make a home for the struct such that it is guaranteed to be cleaned up, and "move" it between safe homes (e.g., using something like https://github.com/apache/arrow-nanoarrow/blob/a0632177c27cb52880c62533fcd441613f41815b/src/nanoarrow/nanoarrow_types.h#L334-L340 ).

(Apologies if there's some C++ magic going on here that already ensures this)

Copy link
Collaborator Author

@Alex-PLACET Alex-PLACET Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in replying.
Thank you for your comments. The PR was improved and use a deleter 👍

test/test_c_data_interface.cpp Outdated Show resolved Hide resolved
child->release(child);
SPARROW_ASSERT_TRUE(child->release == nullptr);
}
delete child;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the implementations of an ArrowArray/ArrowSchema I've read choose to own their own memory for the ArrowArray/ArrowSchema structs. In this case, it might mean an std::move() of the std::vector<std::unique_ptr<ArrowArray>> into the private data and rely on the C++ deleter to call the release callbacks (see my comment below about always making sure that std::unique_ptr<ArrowThing> is guaranteed to call the release callback if present).

Alternatively, one could use new ArrowArray[] and something like https://github.com/apache/arrow-nanoarrow/blob/a0632177c27cb52880c62533fcd441613f41815b/src/nanoarrow/nanoarrow_types.h#L334-L340 to "move" the struct from foreign memory.

@Alex-PLACET Alex-PLACET marked this pull request as draft May 6, 2024 06:50
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
@Alex-PLACET Alex-PLACET marked this pull request as ready for review May 17, 2024 11:35
@Alex-PLACET Alex-PLACET self-assigned this Jun 7, 2024
Copy link
Collaborator

@JohanMabille JohanMabille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split the declaration and implementation of public API functions? Ideally all the public API functions should be at the top of the file, so one can quicky check the API without scroling the whole file.

include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
include/sparrow/c_interface.hpp Outdated Show resolved Hide resolved
Comment on lines +30 to +31

extern "C"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it may still be needed to include:

#ifndef ARROW_C_DATA_INTERFACE
#define ARROW_C_DATA_INTERFACE

...to prevent a duplicate definition in the case where somebody does:

#include <adbc.h>
#include <sparrow/c_interface.hpp>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I fixed that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have noticed this on the first round, but if one includes adbc.h first, the helpful initializers you've added to the definition won't be there (I am not sure if this affects the code you've written here but might have unexpected consequences for users of this header).

int64_t null_count,
int64_t offset,
const R& buffer_sizes,
std::vector<std::unique_ptr<ArrowArray>>&& children,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be std::vector<std::unique_ptr<ArrowArray, arrow_array_custom_deleter>>? (It may help reduce the number of lines of code where an exception or early return results in leaked ArrowArrays)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Fixed

std::string_view name,
std::optional<std::span<char>> metadata,
std::optional<ArrowFlag> flags,
std::vector<std::unique_ptr<ArrowSchema>>&& children,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be std::vector<std::unique_ptr<ArrowSchema, arrow_schema_custom_deleter>>? (It may help reduce the number of lines of code where an exception or early return results in leaked ArrowSchemas)

Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@JohanMabille JohanMabille merged commit 1a12b1a into man-group:main Jun 14, 2024
31 checks passed
@Alex-PLACET Alex-PLACET deleted the arrow_c_data_interface branch June 14, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the Arrow C data interface
4 participants