-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust): Scan and read datasets from remote object stores (parquet only) #11256
Conversation
Signed-off-by: Chitral Verma <[email protected]>
…oud-file-datasets Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
Signed-off-by: Chitral Verma <[email protected]>
@ritchie46 big PR for that old discussion incoming finally. I'm building a prototype alongside the existing code starting with the scan side changes for parquet. will proceed to the eager side after this. I have 2 questions about this -
|
@chitralverma I don't exactly follow what this will add differently from currently globbing? I just exposed that to python this week and was thinking of supporting |
yes, the globbing is just an add-on, but this PR is more about refactoring and aggregating all the file_format/ async stuff under This refactoring will also directly feed into the pluggable/ user-defined data sources idea. This also makes some way for partitioned datasets and their pruning across all file formats once things are standardized Some things that were sequential before (like schema inference) are parallel now. the second major change is the ability to work with more backends which are provided by opendal, so that's about the versatility. finally, these changes in this PR are just for parquet so functionally everything is more or less the same. But if and when this gets merged, this can easily be extended to other formats like csv, avro, ipc, json, etc. |
I'm sorry to hear about this, but understand if that is the direction you wish to take. Do let me know if there is any functionality that would help sway you back to the object_store fold 😅
FWIW the readers in DataFusion just layer on DataFusion specific functionality like predicate pushdown, schema adaption, etc... the core IO exists in arrow-rs/parquet and would be usable by polars.
I feel something has been lost in translation here, object_store doesn't provide these because object stores themselves don't provide them. A roundtrip to object stores is on the order of 100-200ms, so even if you have pre-fetching heuristics and a very predictable read pattern, something which isn't really true for parquet, you're going to pay a very high price for this API. The approach taken by arrow-rs/DataFusion is instead to learn the lessons from the Hadoop ecosystem (apache/datafusion#2205 (comment)) and start with a vectored IO abstraction from the get go. That being said we could possibly add an AsyncRead + AsyncSeek utility to object_store to faciliate integration with the arrow2 IO readers, I'll have a play Edit: see apache/arrow-rs#4857 |
Thanks @tustvold I'll check it out. @ritchie46 so do these changes still make sense to you or shall I drop this idea because while this PR is just for parquet, a similar pattern will be used for other formats later? |
We have this feature available now in Polars, so I'm closing this PR. |
[I'll write a better description once this is out of the experiments phase]
Aim to replace the current architecture of scanning and reading all file based datasets (parquet only in this PR)
Targeting the following changes (also from the discord chats)
FileFormat
that knows how to read a file's content and its metadata (schema inference etc.)polars-io
via the above abstractions.operator
to connect to remote stores.Notes for review:
opendal
a part of Apache org, backed by an active community. It provides a lotttt of backends out of the box and looks like a great fit for polars.object_store
does not have arrow2 compatible async readers, see more here. opendal provides this out of the box.object_store
has support for lesser backends and works better with thearrow-rs
crate instead. This will involve writing async readers for each format as done bydatafusion
.object_store
, so if this is required later it can still be used.object_store
. Feature gating has not been done yet for the same reason.cloud-datasets
or something which gates all the async stuff like tokio etc.