You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka arrow::DictionaryArray). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).
The simplest thing would be to fork the codebase into a libarrow_csv shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already build libparquet_arrow inside parquet-cpp (https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?
The text was updated successfully, but these errors were encountered:
Hi @wesm, This sounds like a very interesting idea. The paratext reader is still missing a lot of functionality that's available in pandas.read_csv so I imagine it will take some work to flesh out the feature matrix before deprecating read_csv. Full DateTime support and reading arbitrary objects will be a lot of work to get the details right. The chunking features and use_cols in Pandas should be much easier. Calling Python functions in multi-threaded code is deadly so a read_csv feature that takes in a Pure Python function will be problematic. Let's discuss further in person.
@deads at some point in the next 6 months, I would like to use the paratext codebase to emit native Arrow C++ array objects (and native categorical aka
arrow::DictionaryArray
). Eventually we can deprecate the existing CSV reader in pandas and make the paratext+Arrow-powered CSV reader the next-gen CSV reader for pandas (since I've already spent a lot of time optimizing the Arrow->pandas code path -- in pandas 2.0 the overhead should drop to 0).The simplest thing would be to fork the codebase into a
libarrow_csv
shared library that lives in the Arrow codebase, since the code might diverge (and there will be overlapping concerns where code sharing might benefit, like on-the-fly dictionary encoding). Another option is to add a libparatext_arrow library within this repo, and make that a dependency of the pyarrow library, similar to how we've already buildlibparquet_arrow
inside parquet-cpp (https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow). Thoughts?The text was updated successfully, but these errors were encountered: