source pyspark-csv from a data frame or a parquet table #8

Tagar · 2016-02-13T16:06:18Z

We have a lot of use cases for pyspark-csv where source files are parquet files..
Would it be possible for pyspark-csv to add a parquet table or an arbitrary data frame?

Tagar · 2016-02-13T19:58:57Z

We have source parquet tables with hundreds of string columns (some up to 800) and it would be great to use something like pyspark-csv to convert to a dataframe (or a parquet table) with correct data types.

Tagar · 2016-02-16T17:04:27Z

http://stackoverflow.com/questions/32722132/spark-data-type-guesser-udaf

Siddhant · 2017-03-20T10:40:52Z

I'm not sure I understand your problem.
You should be able to read parquet data directly into Spark, no?
What's the point of CSV here ?

Tagar · 2017-03-20T16:59:58Z

Let's say we have a dataframe where all columns are strings (because of how we got that data).
We'd like to run this dataframe through the same datatype inference code, to create another
dataframe with actual data types.
So my point is, data inference logic should be separated/abstracted from spark-csv logic so
we can run this data type inference also for a dataframes of all-string columns to come up
with a realt data types that should be used, based on actual data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source pyspark-csv from a data frame or a parquet table #8

source pyspark-csv from a data frame or a parquet table #8

Tagar commented Feb 13, 2016

Tagar commented Feb 13, 2016

Tagar commented Feb 16, 2016

Siddhant commented Mar 20, 2017

Tagar commented Mar 20, 2017

source pyspark-csv from a data frame or a parquet table #8

source pyspark-csv from a data frame or a parquet table #8

Comments

Tagar commented Feb 13, 2016

Tagar commented Feb 13, 2016

Tagar commented Feb 16, 2016

Siddhant commented Mar 20, 2017

Tagar commented Mar 20, 2017