Replies: 1 comment 1 reply
-
Hey, I did a test with the provided file on DuckDB and using PyArrow, and I agree the file is corrupted. DuckDB: dd$ ./duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from read_parquet('~/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2');
Invalid Error: don't know what type:
D PyArrow: ~$ python3
Python 3.12.7 (main, Oct 1 2024, 11:15:50) [GCC 14.2.1 20240910] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> pq.read_table('~/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1485, in read
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Unexpected end of stream: Page was smaller (236) than expected (310)
>>> import pyarrow
>>> pyarrow.__version__
'17.0.0'
>>> ~~Feel free to raise an issue if you would prefer the error handling to take care of this instead of panicking 👍 ~~
EDIT: actually I just tested off latest branch of arrow-rs: use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
fn main() {
let file =
std::fs::File::open("/home/jeffrey/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2")
.unwrap();
let parquet_reader = ParquetRecordBatchReaderBuilder::try_new(file)
.unwrap()
.build()
.unwrap();
let mut batches = Vec::new();
for batch in parquet_reader {
batches.push(batch.unwrap());
}
} Running this gives an error and not an internal panic: arrow-rs$ cargo run --example read_parquet
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
Running `/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/examples/read_parquet`
thread 'main' panicked at parquet/./examples/read_parquet.rs:12:28:
called `Result::unwrap()` on an `Err` value: ParquetError("EOF: Invalid page header")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
arrow-rs$
So seems it is already properly handled. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using serde_arrow for reading and writing Parquet files. Writing succeeds without issues, but when reading, I encounter the following error in some files:
Writing Code:
Error msg
Additional Information:
Questions:
Is this error caused by file corruption?
The file
222-2024-12-30-sleep_feature.zstd.parquet.2.zip
Beta Was this translation helpful? Give feedback.
All reactions