-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nano-arrow #11179
nano-arrow #11179
Conversation
c5c697c
to
50ba7b5
Compare
Hi, given that the name "nanoarrow" is already in use by Apache Arrow project https://github.com/apache/arrow-nanoarrow, you might want to rename it before the first release to crates.io. |
That release already happened 21 days ago: https://crates.io/crates/nano-arrow. The project you point to is C++. I don't believe there is rust counterpart of that. That's what I want for nano-arrow. A very minimal implementation that only implements the memory spec. Could rename it, but I don't think there is any conflict at the moment. |
The name isn't important to us. Let's move the code into |
Disclaimer: I just wanted to let you know that "nanoarrow" already exists because I happened to find this PR, and I don't want to use this name myself. I think the name changes here are helpful to make things easier to understand for users. Thank you for your prompt reply. |
@ritchie46 Just learned about this whole ordeal by stumbling into this after bumping to 0.34
There's a subset of users (myself and my team included) who have code written for arrow2 that interops with polars-core on the Rust side (e.g. if you want to parallelize chunk loading in a particular way, where chunks are organized in some non-automatic problem-specific fashion etc), writing your own low-level arrow routines may become somewhat critical if dealing with very large amounts of arrow data. We also have Rust projects (e.g. ones writing arrow data) that don't depend on polars at all and simply use the low-level api of arrow2 for doing so due to the amount of data being processed in a streaming fashion (hence my recent prs in arrow2 in fixing mutable dicts that were not behaving correctly). We are very concerned about it, given there's no explicit statements about it, hence a few questions:
|
This one. There will be a public API, but it will be limited in goal. For polars we want to adhere to arrow memory, but have compute in polars. Consumers/producers of a different arrow implementation should still be able to move the data into polars zero copy. Either via
no guarantees, but we want to keep all the builders for the data-types we support in polars. The compute and IO maybe removed. What do you use mostly? |
@ritchie46 Totally forgot to reply to this one:
Here's a sample use case - you have a custom parallelized chunk reader written in arrow2, you end up with a bunch of chunks and you want to create a polars dataframe out of them. So you can (a) rely on all the low-level tools available for io in arrow2 but (b) interface with the outer world via polars. This used to be possible but I believe now it's not (see my code snippet posted above). I guess, to formalize this question: right now, for most of the low-level arrow2 code, if you simply replace |
Yes, you can expect that. Though parquet is moved to |
Polars has done great on arrow2, but now that Jorge has stepped back, the benefits of utilizing arrow(2) (with some choices not being ideal for our usecase) are much less prevalent. This will fork/continue on the great work of arrow2 and repel almost everything, keeping only the memory specification, IPC and interop with arrow-rs intact.
All compute and IO eventually will be implemented/integrated within polars. Arrow is such a big dependency of polars and we are so tightly integrated with this, that we want this in the same repo.
We can share dependencies within the same workspace and keep versions and CI tightly coupled.