Support compressed csv in `scan_csv` #7287

corneliusroemer · 2023-03-01T21:23:15Z

Problem description

It would be nice if polars could load compressed csvs out of the box, e.g. a zstd compressed csv.

I'm not sure what the best workaround is. xopen doesn't seem to work:

import polars as pl
import xopen 

#%%
with xopen.xopen("metadata_germany.tsv.zst", "rt") as f:
    pl.scan_csv(f, has_header=True, sep="\t").head(10).collect()

raises:

TypeError: argument 'path': 'TextIOWrapper' object cannot be converted to 'PyString'

Related: #3166

The text was updated successfully, but these errors were encountered:

corneliusroemer · 2023-03-14T16:57:51Z

Now I'd also like zstd support for ndjson - would be great to be able to read from compressed files.

natir · 2023-05-23T13:32:06Z

A rust xopen alternative is niffler.

Sorry for self promoting, but I'm also need this feature, maybe niffler could help polars.

corneliusroemer · 2023-06-06T18:56:33Z

It could be that reading in rb mode would work, at least in the case of read_csv I got managed to get it to read from a zst compressed file with xopen, see https://stackoverflow.com/questions/76417610/how-to-read-csv-a-zstd-compressed-file-using-python-polars

The downside is that the entire uncompressed file/stream is read into memory before parsing, so this doesn't work for cases where uncompressed file is of similar size as the machine's memory

ghuls · 2023-06-16T11:14:20Z

You can use parquet-fromcsv in the meantime to convert compressed CSV/TSV files to parquet and use pl.scan_parquet on them:
#9283 (comment)

nameexhaustion · 2024-07-27T09:46:01Z

Closed as completed via #17841

corneliusroemer added the enhancement New feature or an improvement of an existing feature label Mar 1, 2023

natir mentioned this issue May 23, 2023

VCF in compress format (gz) natir/variantplaner#11

Closed

This was referenced Jun 6, 2023

Add BytesIO support to scan_csv #4950

Closed

read_csv reads entire BytesIO stream into memory #9266

Open

Support reading of zstd compressed csv files #9283

Open

Wainberg mentioned this issue Dec 31, 2023

Meta-issue: improving CSV reading and writing #13346

Open

Wainberg mentioned this issue Jan 14, 2024

Unify read and scan functions #13040

Open

10 tasks

sm-Fifteen mentioned this issue Feb 21, 2024

Scan zipped files #9601

Open

CanglongCl mentioned this issue Apr 2, 2024

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

aut0clave mentioned this issue Jun 17, 2024

Scan csv.gz tables #17011

Closed

nameexhaustion mentioned this issue Jul 27, 2024

feat: Decompress in CSV / NDJSON scan #17841

Merged

nameexhaustion closed this as completed Jul 27, 2024

c-peters added the accepted Ready for implementation label Jul 28, 2024

c-peters assigned nameexhaustion Jul 28, 2024

c-peters added this to Backlog Jul 28, 2024

c-peters moved this to Done in Backlog Jul 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support compressed csv in `scan_csv` #7287

Support compressed csv in `scan_csv` #7287

corneliusroemer commented Mar 1, 2023 •

edited

Loading

corneliusroemer commented Mar 14, 2023

natir commented May 23, 2023

corneliusroemer commented Jun 6, 2023 •

edited

Loading

ghuls commented Jun 16, 2023

nameexhaustion commented Jul 27, 2024

Support compressed csv in scan_csv #7287

Support compressed csv in scan_csv #7287

Comments

corneliusroemer commented Mar 1, 2023 • edited Loading

Problem description

corneliusroemer commented Mar 14, 2023

natir commented May 23, 2023

corneliusroemer commented Jun 6, 2023 • edited Loading

ghuls commented Jun 16, 2023

nameexhaustion commented Jul 27, 2024

Support compressed csv in `scan_csv` #7287

Support compressed csv in `scan_csv` #7287

corneliusroemer commented Mar 1, 2023 •

edited

Loading

corneliusroemer commented Jun 6, 2023 •

edited

Loading