Basic Arrow.jl-based collect and createDataFrame #115

exyi · 2022-09-10T18:00:51Z

Functions collect_arrow, collect_tuples, and collect_df are provided,
which all use Arrow.jl and Spark's Arrow support to transfer data
from Spark to Julia. collect_arrow returns the raw Arrow.jl table,
collect_df returns the DataFrame from DataFrames.jl, collect_tuples
returns a simple Vector of named tuples.

createDataFrame now has overloads which accept a DataFrame or abstract Table

This version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:

Simpler :)
Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM
datasets
or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory,
without additional copying on Julia side

However, if you think sockets would be preferable I can change it (PySpark and SparkR use sockets)

Few things are missing

I included 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better.
- One collectToArrow is basically the same one as I used last year. The other comes from looking at PySpark currently does it.
- Would be nice to benchmark them, I don't currently have meaningful data for this as I left the company when we used Spark
I added DataFrames.jl to the dependencies, but it doesn't seems right to depend on this fairly non-trivial library just for collect_df. Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out in collect_df

Functions collect_arrow, collect_tuples, and collect_df are provided, which all use Arrow.jl and Spark's Arrow support to transfer data from Spark to Julia. collect_arrow returns the raw Arrow.jl table, collect_df returns the DataFrame from DataFrames.jl, collect_tuples returns a simple Vector of named tuples. createDataFrame now has overloads which accept a DataFrame or abstract Table This version create a temporary file for each transfer, but I actually think it's preferable in many ways to socket based transfer: * Simpler :) * Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM datasets * or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory, without additional copying on Julia side This commit still includes 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better

dfdx · 2022-09-10T22:26:30Z

Thanks a lot! It's great to see Arrow getting back to Spark.jl!

This version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:

I'm totally fine with files. If I remember correctly, PySpark uses (used?) both - sockets and files in different places or with different settings. Anyway, I realized that it's not necessarily a good idea to follow PySpark or SparkR design since many decisions in them were made specifically to that languages or due to certain conditions that may not apply to our case. So let's do whatever is the best for Julia.

Would be nice to benchmark them, I don't currently have meaningful data for this as I left the company when we used Spark

Can't we just generate, say, a big Parquet file?

Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out in collect_df

Requires.jl was designed to do exactly this, though I haven't used it for years now and don't know its status.

exyi added 2 commits September 10, 2022 19:57

Add Base.propertynames to enable autocompletion of chainable objects

171658a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Arrow.jl-based collect and createDataFrame #115

Basic Arrow.jl-based collect and createDataFrame #115

exyi commented Sep 10, 2022

dfdx commented Sep 10, 2022

Basic Arrow.jl-based collect and createDataFrame #115

Are you sure you want to change the base?

Basic Arrow.jl-based collect and createDataFrame #115

Conversation

exyi commented Sep 10, 2022

dfdx commented Sep 10, 2022