Support reading of zstd compressed csv files #9283

corneliusroemer · 2023-06-07T11:40:36Z

Problem description

I wish I could read zstd compressed csv files using read_csv and ideally also scan_csv. See https://stackoverflow.com/questions/76417610/how-to-read-csv-a-zstd-compressed-file-using-python-polars/76420788#76420788

Possibly related:

The text was updated successfully, but these errors were encountered:

ghuls · 2023-06-16T11:12:11Z

For now you can use parquet-fromcsv from arrow-rs to convert CSV/TSV files to parquet.

This supports compressed (uncompressed, snappy, gzip brotli lz4 zstd`) files only relatively recently: apache/arrow-rs#3721

parquet-fromcsv is also useful in general to read compressed CSV/TSV files and convert them to parquet, so you can use pl.scan_parquet on very big files.

# Clone Rust arrow repo.
git clone https://github.com/apache/arrow-rs

# Build arrow CLI tools.
cargo build --release --features=cli

# Current list of Arrow CLI tools:
❯  file  target/release/*|grep ELF
target/release/arrow-file-to-stream:              ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/arrow-json-integration-test:       ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/arrow-stream-to-file:              ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/flight-test-integration-client:    ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/flight-test-integration-server:    ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/gen:                               ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/libparquet_derive.so:              ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, with debug_info, not stripped, too many notes (256)
target/release/parquet-concat:                    ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-fromcsv:                   ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-index:                     ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-layout:                    ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-read:                      ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-rewrite:                   ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-rowcount:                  ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-schema:                    ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-show-bloom-filter:         ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)

# Use parquet-fromcsv to convert CSV/TSV file to parquet, which you then can use with Polars.
$ target/release/parquet-fromcsv --help
Binary to convert csv to Parquet

Usage: parquet-fromcsv [OPTIONS] --schema <SCHEMA> --input-file <INPUT_FILE> --output-file <OUTPUT_FILE>

Options:
  -s, --schema <SCHEMA>
          message schema for output Parquet

  -i, --input-file <INPUT_FILE>
          input CSV file

  -o, --output-file <OUTPUT_FILE>
          output Parquet file

  -f, --input-format <INPUT_FORMAT>
          input file format
          
          [default: csv]
          [possible values: csv, tsv]

  -b, --batch-size <BATCH_SIZE>
          batch size
          
          [env: PARQUET_FROM_CSV_BATCHSIZE=]
          [default: 1000]

  -h, --has-header
          has header

  -d, --delimiter <DELIMITER>
          field delimiter
          
          default value: when input_format==CSV: ',' when input_format==TSV: 'TAB'

  -r, --record-terminator <RECORD_TERMINATOR>
          record terminator
          
          [possible values: lf, crlf, cr]

  -e, --escape-char <ESCAPE_CHAR>
          escape character

  -q, --quote-char <QUOTE_CHAR>
          quote character

  -D, --double-quote <DOUBLE_QUOTE>
          double quote
          
          [possible values: true, false]

  -C, --csv-compression <CSV_COMPRESSION>
          compression mode of csv
          
          [default: UNCOMPRESSED]

  -c, --parquet-compression <PARQUET_COMPRESSION>
          compression mode of parquet
          
          [default: SNAPPY]

  -w, --writer-version <WRITER_VERSION>
          writer version

  -m, --max-row-group-size <MAX_ROW_GROUP_SIZE>
          max row group size

      --enable-bloom-filter <ENABLE_BLOOM_FILTER>
          whether to enable bloom filter writing
          
          [possible values: true, false]

      --help
          display usage help

  -V, --version
          Print version

# Example: Reading gzipped TSV file and converting to parquet.
$ parquet-fromcsv \
    --schema fragments.schema \
    --input-format tsv \
    --csv-compression gzip \
   --parquet-compression zstd \
   --input-file fragments.raw.tsv.gz \
    --output-file fragments.parquet

# To get the schema file:
#   - get a small uncompressed part of the compressed CSV/TSV file
#   - Read the file with Polars and write to parquet. 
#   - Get schema from parquet file with parquet-schema 
$ zcat  fragments.raw.tsv.gz | head -n 10000 > /tmp/fragments.head10000.tsv

pl.read_csv("/tmp/fragments.head10000.tsv", separator="\t", has_header=False).write_parquet("/tmp/fragments.head1000.parquet")

$ parquet-schema  /tmp/fragments.head10000.parquet | grep -A 1000000 '^message' > fragments.schema

$  cat fragments.schema
message root {
  OPTIONAL BYTE_ARRAY column_1 (STRING);
  OPTIONAL INT64 column_2;
  OPTIONAL INT64 column_3;
  OPTIONAL BYTE_ARRAY column_4 (STRING);
  OPTIONAL INT64 column_5;
}

To address pola-rs#9283 with regard to `read_csv`.

AndreaBarbon · 2024-04-19T13:39:48Z

Very useful, thanks!

Can we use this on a batch of files?

PS: Note that there's a typo in the number of zeros:
Should be consistently 1000 or 10000

ghuls · 2024-07-04T08:04:20Z

Can we use this on a batch of files?

With GNU parallel it is quite easy to run a similar command with different arguments.

# Make a list of all files you want to process (`ls -1` or `cat file_list`)
# Run 4 conversions at the same time (`-j 4`), where `{}` represents a certain line (a `tsv.gz` file in this case) and `{..}` the same minus the `.tsv.gz` part at the end (strip extension twice).
ls -1 *.tsv.gz \
  | parallel -j 4 --plus \
        parquet-fromcsv \
            --schema fragments.schema \
            --input-format tsv \
            --csv-compression gzip \
            --parquet-compression zstd \
            --input-file {} \
            --output-file {..}.parquet

PS: Note that there's a typo in the number of zeros: Should be consistently 1000 or 10000

Fixed

corneliusroemer added the enhancement New feature or an improvement of an existing feature label Jun 7, 2023

This was referenced Jun 16, 2023

Support compressed csv in scan_csv #7287

Closed

Add BytesIO support to scan_csv #4950

Closed

read_csv reads entire BytesIO stream into memory #9266

Open

stinodego mentioned this issue Jul 14, 2023

zstd compression in write_ipc is incompatible with standalone zstd? #5000

Closed

2 tasks

hirohira9119 added a commit to hirohira9119/polars that referenced this issue Oct 25, 2023

feat(rust, python): Add support for zstd compressed files in read_csv

4ed1dfc

To address pola-rs#9283 with regard to `read_csv`.

hirohira9119 mentioned this issue Nov 3, 2023

feat(rust, python): Add support for reading zstd compressed files (no-options) in read_csv #12214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reading of zstd compressed csv files #9283

Support reading of zstd compressed csv files #9283

corneliusroemer commented Jun 7, 2023 •

edited

Loading

ghuls commented Jun 16, 2023 •

edited

Loading

AndreaBarbon commented Apr 19, 2024

ghuls commented Jul 4, 2024

Support reading of zstd compressed csv files #9283

Support reading of zstd compressed csv files #9283

Comments

corneliusroemer commented Jun 7, 2023 • edited Loading

Problem description

ghuls commented Jun 16, 2023 • edited Loading

AndreaBarbon commented Apr 19, 2024

ghuls commented Jul 4, 2024

corneliusroemer commented Jun 7, 2023 •

edited

Loading

ghuls commented Jun 16, 2023 •

edited

Loading