nano-arrow #11179

ritchie46 · 2023-09-18T12:54:47Z

Polars has done great on arrow2, but now that Jorge has stepped back, the benefits of utilizing arrow(2) (with some choices not being ideal for our usecase) are much less prevalent. This will fork/continue on the great work of arrow2 and repel almost everything, keeping only the memory specification, IPC and interop with arrow-rs intact.

All compute and IO eventually will be implemented/integrated within polars. Arrow is such a big dependency of polars and we are so tightly integrated with this, that we want this in the same repo.

We can share dependencies within the same workspace and keep versions and CI tightly coupled.

eitsupi · 2023-10-09T05:04:48Z

Hi, given that the name "nanoarrow" is already in use by Apache Arrow project https://github.com/apache/arrow-nanoarrow, you might want to rename it before the first release to crates.io.

ritchie46 · 2023-10-09T10:15:16Z

That release already happened 21 days ago: https://crates.io/crates/nano-arrow.

The project you point to is C++. I don't believe there is rust counterpart of that. That's what I want for nano-arrow. A very minimal implementation that only implements the memory spec. Could rename it, but I don't think there is any conflict at the moment.

ritchie46 · 2023-10-09T10:44:29Z

The name isn't important to us. Let's move the code into polars-arrow. The name nano-arrow can be used by apache. If they want to have the crates.io handle, please ping me. :)

eitsupi · 2023-10-09T10:52:01Z

Disclaimer: I just wanted to let you know that "nanoarrow" already exists because I happened to find this PR, and I don't want to use this name myself.
I just thought the similar names were confusing. (For example, the R polars package has the nanoarrow R package as an optional dependency and is used for data conversion with the arrow package. (pola-rs/r-polars#5))

I think the name changes here are helpful to make things easier to understand for users. Thank you for your prompt reply.

aldanor · 2023-11-11T16:20:07Z

@ritchie46 Just learned about this whole ordeal by stumbling into this after bumping to 0.34

error[E0277]: the trait bound `polars_core::frame::DataFrame: std::convert::From<(arrow2::chunk::Chunk<std::boxed::Box<dyn arrow2::array::Array>>, &[arrow2::datatypes::Field])>` is not satisfied
    |
    |         Ok(DataFrame::try_from((chunk, fields.as_slice()))?)
    |            ------------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `std::convert::From<(arrow2::chunk::Chunk<std::boxed::Box<dyn arrow2::array::Array>>, &[arrow2::datatypes::Field])>` is not implemented for `polars_core::frame::DataFrame`
    |            |

There's a subset of users (myself and my team included) who have code written for arrow2 that interops with polars-core on the Rust side (e.g. if you want to parallelize chunk loading in a particular way, where chunks are organized in some non-automatic problem-specific fashion etc), writing your own low-level arrow routines may become somewhat critical if dealing with very large amounts of arrow data. We also have Rust projects (e.g. ones writing arrow data) that don't depend on polars at all and simply use the low-level api of arrow2 for doing so due to the amount of data being processed in a streaming fashion (hence my recent prs in arrow2 in fixing mutable dicts that were not behaving correctly).

We are very concerned about it, given there's no explicit statements about it, hence a few questions:

Does polars-arrow become a purely 'internal' crate for polars/polars-core, which implies the polars team could then throw out unneeded parts, reshuffle the code as they see fit, so it aligns better with polars goals?
Or, does polars-arrow become an arrow2 replacement even for projects not using polars on the Rust side? If it will be better maintained than arrow2, this is a legit and a pretty important question. If true, than arrow2 repo should definitely be marked as deprecated with a link to polars-arrow.
If the latter is the case, is it worth moving polars-arrow out into a separate crate, so as not to mix commits and issues with the massive amount of polars/py-polars commits/issues?
Are there any guarantees that the arrow2/polars-arrow code will stay (somewhat) intact? (i.e. can we s/arrow2/polars_arrow/g and hope that this code will survive for non-zero amount of time?)

ritchie46 · 2023-11-15T12:33:38Z

Does polars-arrow become a purely 'internal' crate for polars/polars-core, which implies the polars team could then throw out unneeded parts, reshuffle the code as they see fit, so it aligns better with polars goals?

This one. There will be a public API, but it will be limited in goal. For polars we want to adhere to arrow memory, but have compute in polars. Consumers/producers of a different arrow implementation should still be able to move the data into polars zero copy. Either via polars-arrow or via the arrow C-ffi.

Are there any guarantees that the arrow2/polars-arrow code will stay (somewhat) intact? (i.e. can we s/arrow2/polars_arrow/g and hope that this code will survive for non-zero amount of time?)

no guarantees, but we want to keep all the builders for the data-types we support in polars. The compute and IO maybe removed. What do you use mostly?

aldanor · 2024-01-03T13:32:22Z

@ritchie46 Totally forgot to reply to this one:

we want to keep all the builders for the data-types we support in polars. The compute and IO maybe removed. What do you use mostly?

Here's a sample use case - you have a custom parallelized chunk reader written in arrow2, you end up with a bunch of chunks and you want to create a polars dataframe out of them. So you can (a) rely on all the low-level tools available for io in arrow2 but (b) interface with the outer world via polars. This used to be possible but I believe now it's not (see my code snippet posted above).

I guess, to formalize this question: right now, for most of the low-level arrow2 code, if you simply replace s/arrow2/polars_arrow/g, the code will still work (i.e. array::*, datatypes::* and io::* are mostly untouched). Can one expect that at least array/datatypes/io functionality will remain more-or-less intact in polars-arrow?

ritchie46 · 2024-01-06T11:05:57Z

Yes, you can expect that. Though parquet is moved to polars-parquet crate. It's usage is the same. I think it is actually better, as we already learn that we maintain and develop our version of arrow much more actively since the fork.

ritchie46 requested a review from orlp as a code owner September 18, 2023 12:54

ritchie46 added 5 commits September 19, 2023 08:07

init

02c8573

exclude csv

0f4b4dc

prune

9b1fdb0

prune

fedd4a8

clippy

50ba7b5

ritchie46 force-pushed the nano-arrow branch from c5c697c to 50ba7b5 Compare September 19, 2023 06:26

ritchie46 requested review from stinodego and alexander-beedie as code owners September 19, 2023 06:26

ritchie46 added 7 commits September 19, 2023 08:49

move json-write to polars-json

12df3b8

typos

5aee649

prune substring

d920bd3

prune like

0087286

prune regex_match

5d265e1

prune small types

cff11f1

lockfile

ce43f4e

ritchie46 requested a review from universalmind303 as a code owner September 19, 2023 08:02

lint

5c18f9e

ritchie46 changed the title ~~WIP: nano-arrow~~ nano-arrow Sep 19, 2023

ritchie46 merged commit 122b2ed into main Sep 19, 2023
18 checks passed

ritchie46 deleted the nano-arrow branch September 19, 2023 08:41

ion-elgreco mentioned this pull request Sep 20, 2023

feat(rust): utf8 to temporal casting #10517

Closed

cmdlineluser mentioned this pull request Oct 2, 2023

Implementation of .str.normalize Method for String (Unicode) Normalization #11455

Open

kylebarron mentioned this pull request Oct 5, 2023

Switch to implementation based on arrow-rs geoarrow/geoarrow-rs#208

Closed

mnpw mentioned this pull request Oct 9, 2023

Add notice on mainenance bandwidth to readme jorgecarleitao/arrow2#1476

Closed

cmdlineluser mentioned this pull request Nov 24, 2023

JSON Column With all Null values is dropped #7858

Closed

2 tasks

tustvold mentioned this pull request Jan 16, 2024

[Proposal] Combination of arrow-rs and arrow2, deprecation of arrow2 repository jorgecarleitao/arrow2#1429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nano-arrow #11179

nano-arrow #11179

ritchie46 commented Sep 18, 2023

eitsupi commented Oct 9, 2023

ritchie46 commented Oct 9, 2023

ritchie46 commented Oct 9, 2023

eitsupi commented Oct 9, 2023 •

edited

Loading

aldanor commented Nov 11, 2023 •

edited

Loading

ritchie46 commented Nov 15, 2023 •

edited

Loading

aldanor commented Jan 3, 2024 •

edited

Loading

ritchie46 commented Jan 6, 2024

nano-arrow #11179

nano-arrow #11179

Conversation

ritchie46 commented Sep 18, 2023

eitsupi commented Oct 9, 2023

ritchie46 commented Oct 9, 2023

ritchie46 commented Oct 9, 2023

eitsupi commented Oct 9, 2023 • edited Loading

aldanor commented Nov 11, 2023 • edited Loading

ritchie46 commented Nov 15, 2023 • edited Loading

aldanor commented Jan 3, 2024 • edited Loading

ritchie46 commented Jan 6, 2024

eitsupi commented Oct 9, 2023 •

edited

Loading

aldanor commented Nov 11, 2023 •

edited

Loading

ritchie46 commented Nov 15, 2023 •

edited

Loading

aldanor commented Jan 3, 2024 •

edited

Loading