Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source pyspark-csv from a data frame or a parquet table #8

Open
Tagar opened this issue Feb 13, 2016 · 4 comments
Open

source pyspark-csv from a data frame or a parquet table #8

Tagar opened this issue Feb 13, 2016 · 4 comments

Comments

@Tagar
Copy link

Tagar commented Feb 13, 2016

We have a lot of use cases for pyspark-csv where source files are parquet files..
Would it be possible for pyspark-csv to add a parquet table or an arbitrary data frame?

@Tagar
Copy link
Author

Tagar commented Feb 13, 2016

We have source parquet tables with hundreds of string columns (some up to 800) and it would be great to use something like pyspark-csv to convert to a dataframe (or a parquet table) with correct data types.

@Tagar
Copy link
Author

Tagar commented Feb 16, 2016

@Siddhant
Copy link

I'm not sure I understand your problem.
You should be able to read parquet data directly into Spark, no?
What's the point of CSV here ?

@Tagar
Copy link
Author

Tagar commented Mar 20, 2017

Let's say we have a dataframe where all columns are strings (because of how we got that data).
We'd like to run this dataframe through the same datatype inference code, to create another
dataframe with actual data types.
So my point is, data inference logic should be separated/abstracted from spark-csv logic so
we can run this data type inference also for a dataframes of all-string columns to come up
with a realt data types that should be used, based on actual data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants