Basic Arrow.jl-based collect and createDataFrame #115
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Functions
collect_arrow
,collect_tuples
, andcollect_df
are provided,which all use Arrow.jl and Spark's Arrow support to transfer data
from Spark to Julia.
collect_arrow
returns the raw Arrow.jl table,collect_df
returns the DataFrame from DataFrames.jl,collect_tuples
returns a simple Vector of named tuples.
createDataFrame
now has overloads which accept a DataFrame or abstract TableThis version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:
datasets
without additional copying on Julia side
However, if you think sockets would be preferable I can change it (PySpark and SparkR use sockets)
Few things are missing
I included 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better.
I added DataFrames.jl to the dependencies, but it doesn't seems right to depend on this fairly non-trivial library just for
collect_df
. Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out incollect_df