-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support reading of zstd compressed csv files #9283
Comments
For now you can use This supports compressed (
# Clone Rust arrow repo.
git clone https://github.com/apache/arrow-rs
# Build arrow CLI tools.
cargo build --release --features=cli
# Current list of Arrow CLI tools:
❯ file target/release/*|grep ELF
target/release/arrow-file-to-stream: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/arrow-json-integration-test: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/arrow-stream-to-file: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/flight-test-integration-client: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/flight-test-integration-server: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/gen: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/libparquet_derive.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, with debug_info, not stripped, too many notes (256)
target/release/parquet-concat: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-fromcsv: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-index: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-layout: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-read: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-rewrite: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-rowcount: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-schema: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
target/release/parquet-show-bloom-filter: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, with debug_info, not stripped, too many notes (256)
# Use parquet-fromcsv to convert CSV/TSV file to parquet, which you then can use with Polars.
$ target/release/parquet-fromcsv --help
Binary to convert csv to Parquet
Usage: parquet-fromcsv [OPTIONS] --schema <SCHEMA> --input-file <INPUT_FILE> --output-file <OUTPUT_FILE>
Options:
-s, --schema <SCHEMA>
message schema for output Parquet
-i, --input-file <INPUT_FILE>
input CSV file
-o, --output-file <OUTPUT_FILE>
output Parquet file
-f, --input-format <INPUT_FORMAT>
input file format
[default: csv]
[possible values: csv, tsv]
-b, --batch-size <BATCH_SIZE>
batch size
[env: PARQUET_FROM_CSV_BATCHSIZE=]
[default: 1000]
-h, --has-header
has header
-d, --delimiter <DELIMITER>
field delimiter
default value: when input_format==CSV: ',' when input_format==TSV: 'TAB'
-r, --record-terminator <RECORD_TERMINATOR>
record terminator
[possible values: lf, crlf, cr]
-e, --escape-char <ESCAPE_CHAR>
escape character
-q, --quote-char <QUOTE_CHAR>
quote character
-D, --double-quote <DOUBLE_QUOTE>
double quote
[possible values: true, false]
-C, --csv-compression <CSV_COMPRESSION>
compression mode of csv
[default: UNCOMPRESSED]
-c, --parquet-compression <PARQUET_COMPRESSION>
compression mode of parquet
[default: SNAPPY]
-w, --writer-version <WRITER_VERSION>
writer version
-m, --max-row-group-size <MAX_ROW_GROUP_SIZE>
max row group size
--enable-bloom-filter <ENABLE_BLOOM_FILTER>
whether to enable bloom filter writing
[possible values: true, false]
--help
display usage help
-V, --version
Print version
# Example: Reading gzipped TSV file and converting to parquet.
$ parquet-fromcsv \
--schema fragments.schema \
--input-format tsv \
--csv-compression gzip \
--parquet-compression zstd \
--input-file fragments.raw.tsv.gz \
--output-file fragments.parquet
# To get the schema file:
# - get a small uncompressed part of the compressed CSV/TSV file
# - Read the file with Polars and write to parquet.
# - Get schema from parquet file with parquet-schema
$ zcat fragments.raw.tsv.gz | head -n 10000 > /tmp/fragments.head10000.tsv
pl.read_csv("/tmp/fragments.head10000.tsv", separator="\t", has_header=False).write_parquet("/tmp/fragments.head1000.parquet")
$ parquet-schema /tmp/fragments.head10000.parquet | grep -A 1000000 '^message' > fragments.schema
$ cat fragments.schema
message root {
OPTIONAL BYTE_ARRAY column_1 (STRING);
OPTIONAL INT64 column_2;
OPTIONAL INT64 column_3;
OPTIONAL BYTE_ARRAY column_4 (STRING);
OPTIONAL INT64 column_5;
} |
To address pola-rs#9283 with regard to `read_csv`.
Very useful, thanks! Can we use this on a batch of files? PS: Note that there's a typo in the number of zeros: |
With GNU parallel it is quite easy to run a similar command with different arguments. # Make a list of all files you want to process (`ls -1` or `cat file_list`)
# Run 4 conversions at the same time (`-j 4`), where `{}` represents a certain line (a `tsv.gz` file in this case) and `{..}` the same minus the `.tsv.gz` part at the end (strip extension twice).
ls -1 *.tsv.gz \
| parallel -j 4 --plus \
parquet-fromcsv \
--schema fragments.schema \
--input-format tsv \
--csv-compression gzip \
--parquet-compression zstd \
--input-file {} \
--output-file {..}.parquet
Fixed |
Problem description
I wish I could read zstd compressed csv files using
read_csv
and ideally alsoscan_csv
. See https://stackoverflow.com/questions/76417610/how-to-read-csv-a-zstd-compressed-file-using-python-polars/76420788#76420788Possibly related:
scan_csv
#7287scan_csv
#4950The text was updated successfully, but these errors were encountered: