Attempt to subtract with overflow #6978

YuiFn · 2025-01-14T04:56:13Z

YuiFn
Jan 14, 2025

I'm using serde_arrow for reading and writing Parquet files. Writing succeeds without issues, but when reading, I encounter the following error in some files:

Writing Code:

let ops = WriterProperties::builder()
    .set_writer_version(WriterVersion::PARQUET_2_0)
    .set_write_batch_size(3000)
    .set_data_page_size_limit(300 * 1024 * 1024)
    .set_column_dictionary_enabled(ColumnPath::from("device_no"), true)
    .set_compression(Compression::ZSTD(ZstdLevel::try_new(18).unwrap()))
    .build();
let writer = ArrowWriter::try_new(file, T::schema(), Some(ops))?;
let fields = T::fields();
let batch = serde_arrow::to_record_batch(&fields, &self.buf)?;
self.writer.write(&batch)?;
self.writer.flush()?;

Error msg

parquet-54.0.0/src/file/serialized_reader.rs:605:21:
attempt to subtract with overflow
stack backtrace:
   0: rust_begin_unwind
             at /rustc/.../library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/.../library/core/src/panicking.rs:76:14
   2: core::panicking::panic_const::panic_const_sub_overflow
             at /rustc/.../library/core/src/panicking.rs:182:21
   3: <parquet::file::serialized_reader::SerializedPageReader<R> as parquet::column::page::PageReader>::get_next_page
             at /Users/yui/.cargo/registry/src/.../serialized_reader.rs:605:21
   4: parquet::column::reader::GenericColumnReader<R,D,V>::read_new_page

Additional Information:

The error does not occur for all files, only specific ones.
I'm using a Mac machine with Visual Studio Code.
Suspect the issue might be due to file corruption.

Questions:

Is this error caused by file corruption?

The file

222-2024-12-30-sleep_feature.zstd.parquet.2.zip

Jefffrey · 2025-01-24T20:44:35Z

Jefffrey
Jan 24, 2025
Collaborator

Hey,

I did a test with the provided file on DuckDB and using PyArrow, and I agree the file is corrupted.

DuckDB:

dd$ ./duckdb
v1.1.3 19864453f7
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select * from read_parquet('~/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2');
Invalid Error: don't know what type:
D

PyArrow:

~$ python3
Python 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.parquet as pq
>>> pq.read_table('~/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1843, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1485, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Unexpected end of stream: Page was smaller (236) than expected (310)
>>> import pyarrow
>>> pyarrow.__version__
'17.0.0'
>>>

~~Feel free to raise an issue if you would prefer the error handling to take care of this instead of panicking 👍 ~~

~~Actually this issue would seem related to that topic if prefer to chime in: Avoid panics? #6737~~

EDIT: actually I just tested off latest branch of arrow-rs:

use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
fn main() {
    let file =
        std::fs::File::open("/home/jeffrey/Downloads/222-2024-12-30-sleep_feature.zstd.parquet.2")
            .unwrap();
    let parquet_reader = ParquetRecordBatchReaderBuilder::try_new(file)
        .unwrap()
        .build()
        .unwrap();
    let mut batches = Vec::new();
    for batch in parquet_reader {
        batches.push(batch.unwrap());
    }
}

Running this gives an error and not an internal panic:

arrow-rs$ cargo run --example read_parquet
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/examples/read_parquet`
thread 'main' panicked at parquet/./examples/read_parquet.rs:12:28:
called `Result::unwrap()` on an `Err` value: ParquetError("EOF: Invalid page header")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
arrow-rs$

Panic because I unwrapped in calling code

So seems it is already properly handled.

1 reply

YuiFn Feb 10, 2025
Author

Hey,

"I'm investigating an error: 'OSError: Unexpected end of stream: Page was smaller (236) than expected (310)'. Through debugging, I found that the returned size doesn't match the actual read size, which caused a panic during usize subtraction. Here's my reading code:

let file = std::fs::File::open(&parquet_file)?;
let mut reader = parquet::arrow::arrow_reader::ParquetRecordBatchReader::try_new(file, 8096)?;
info!("begin read {:?}", parquet_file);
while let Some(batch) = reader.next() {  // <-- painc here
    let batch = batch?;
    info!("batch size >> {}", batch.num_rows());
    let sleep_features: Vec<SleepFeatureArrowSchema> = serde_arrow::from_record_batch(&batch)?;  
    let sleep_features: Vec<SleepFeatureArrowSchema> = sleep_feature
        .into_iter()
        .filter(|x| x.device_no.len() > 0)
        .filter(|x| BLUE_DEVICES.contains(&x.device_no.as_str()))
        .map(|x| x.into())
        .collect();
    save_raw_data(&pool, &sleep_features).await?;
}

if it gives an error and not an internal panic will be nice .

versions:

arrow = "54.0.0"
serde_arrow = { version = "0.12.3", default-features = false, features = [
"arrow-54",
] }
parquet = { version = "54.0.0", default-features = false, features = [
"arrow",
"zstd",
] }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to subtract with overflow #6978

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Attempt to subtract with overflow #6978

YuiFn Jan 14, 2025

Writing Code:

Error msg

Additional Information:

Questions:

The file

Replies: 1 comment · 1 reply

Jefffrey Jan 24, 2025 Collaborator

YuiFn Feb 10, 2025 Author

versions:

YuiFn
Jan 14, 2025

Replies: 1 comment 1 reply

Jefffrey
Jan 24, 2025
Collaborator

YuiFn Feb 10, 2025
Author