Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compressed csv in scan_csv #7287

Closed
corneliusroemer opened this issue Mar 1, 2023 · 5 comments · Fixed by #17841
Closed

Support compressed csv in scan_csv #7287

corneliusroemer opened this issue Mar 1, 2023 · 5 comments · Fixed by #17841
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@corneliusroemer
Copy link

corneliusroemer commented Mar 1, 2023

Problem description

It would be nice if polars could load compressed csvs out of the box, e.g. a zstd compressed csv.

I'm not sure what the best workaround is. xopen doesn't seem to work:

import polars as pl
import xopen 

#%%
with xopen.xopen("metadata_germany.tsv.zst", "rt") as f:
    pl.scan_csv(f, has_header=True, sep="\t").head(10).collect()

raises:

TypeError: argument 'path': 'TextIOWrapper' object cannot be converted to 'PyString'

Related: #3166

@corneliusroemer corneliusroemer added the enhancement New feature or an improvement of an existing feature label Mar 1, 2023
@corneliusroemer
Copy link
Author

Now I'd also like zstd support for ndjson - would be great to be able to read from compressed files.

@natir
Copy link

natir commented May 23, 2023

A rust xopen alternative is niffler.

Sorry for self promoting, but I'm also need this feature, maybe niffler could help polars.

@corneliusroemer
Copy link
Author

corneliusroemer commented Jun 6, 2023

It could be that reading in rb mode would work, at least in the case of read_csv I got managed to get it to read from a zst compressed file with xopen, see https://stackoverflow.com/questions/76417610/how-to-read-csv-a-zstd-compressed-file-using-python-polars

The downside is that the entire uncompressed file/stream is read into memory before parsing, so this doesn't work for cases where uncompressed file is of similar size as the machine's memory

@ghuls
Copy link
Collaborator

ghuls commented Jun 16, 2023

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them:
#9283 (comment)

@nameexhaustion
Copy link
Collaborator

Closed as completed via #17841

@c-peters c-peters added the accepted Ready for implementation label Jul 28, 2024
@c-peters c-peters added this to Backlog Jul 28, 2024
@c-peters c-peters moved this to Done in Backlog Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants