`File out of specification: The max_value of statistics MUST be plain encoded` when writing nested parquet with Rust engine #17948

theelderbeever · 2024-07-30T21:34:34Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame(
    [
        {
            "struct": {
                "struct": {
                    "struct": {"a": None},
                     # Some field following the struct field is necessary. Type seems irrelevant
                    "str": "hello",
                    # "i64": 123456789
                    # "bool": False,
                },
            }
        },
    ]
).write_parquet("womp.parquet")

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
/var/folders/72/8wtgnd0963gfwzpb16q229340000gn/T/ipykernel_32301/1738335881.py in ?()
     10                 },
     11             }
     12         },
     13     ]
---> 14 ).write_parquet("discount.parquet")

~/.pyenv/versions/3.11.8/envs/billing-platform-pipelines/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, file, compression, compression_level, statistics, row_group_size, data_page_size, use_pyarrow, pyarrow_options, partition_by, partition_chunk_size_bytes)
   3626 
   3627             if isinstance(partition_by, str):
   3628                 partition_by = [partition_by]
   3629 
-> 3630             self._df.write_parquet(
   3631                 file,
   3632                 compression,
   3633                 compression_level,

ComputeError: parquet: File out of specification: The max_value of statistics MUST be plain encoded

Issue description

Polars default parquet engine fails with a metadata statistics error which does not occur with use_pyarrow=True.

Expected behavior

Polars parquet writer's should both be able to write the same dataframe.

Installed versions

--------Version info---------
Polars:               1.3.0
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.11.8 (main, Apr 27 2024, 07:50:56) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  1.1.0
cloudpickle:          2.2.1
connectorx:           0.3.3
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             2.5.3
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

theelderbeever · 2024-07-30T21:47:05Z

Additionally, the following structure can be toggled between two separate errors occurring.

"has_more": False,

PanicException: the offset of the new Buffer cannot exceed the existing length

# "has_more": False,

ComputeError: parquet: File out of specification: The max_value of statistics MUST be plain encoded

pl.DataFrame(
    [
        {
            "items": {
                "data": [
                    {
                        "plan": {
                            "tiers": [
                                {
                                    "up_to": None,
                                }
                            ],
                            "tiers_mode": "volume",
                        },
                    },
                    {
                        "plan": {
                            "tiers": [
                                {
                                    "up_to": None,
                                }
                            ],
                            "tiers_mode": "volume",
                        },
                    },
                ],
                "has_more": False, # comment this line to get a buffer size error
            }
        }
    ]
).write_parquet("items.parquet")

fzyzcjy · 2024-08-01T13:06:05Z

Having the same error here, with reproduction:

print(pl.__version__)

df = pl.DataFrame([
    {
        'a': {
            'b': [{'c': 'x'}],
            'd': 10
        }
    }
])
print(df.dtypes)
df.write_parquet('/tmp/a.parquet')

theelderbeever added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 30, 2024

coastalwhite linked a pull request Jul 31, 2024 that will close this issue

fix: Several parquet reader/writer regressions #17941

Merged

ritchie46 closed this as completed in #17941 Jul 31, 2024

c-peters added the accepted Ready for implementation label Aug 5, 2024

c-peters assigned coastalwhite Aug 5, 2024

github-project-automation bot added this to Backlog Aug 5, 2024

github-project-automation bot moved this to Ready in Backlog Aug 5, 2024

c-peters moved this from Ready to Done in Backlog Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`File out of specification: The max_value of statistics MUST be plain encoded` when writing nested parquet with Rust engine #17948

`File out of specification: The max_value of statistics MUST be plain encoded` when writing nested parquet with Rust engine #17948

theelderbeever commented Jul 30, 2024

theelderbeever commented Jul 30, 2024

fzyzcjy commented Aug 1, 2024

File out of specification: The max_value of statistics MUST be plain encoded when writing nested parquet with Rust engine #17948

File out of specification: The max_value of statistics MUST be plain encoded when writing nested parquet with Rust engine #17948

Comments

theelderbeever commented Jul 30, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

theelderbeever commented Jul 30, 2024

fzyzcjy commented Aug 1, 2024

`File out of specification: The max_value of statistics MUST be plain encoded` when writing nested parquet with Rust engine #17948

`File out of specification: The max_value of statistics MUST be plain encoded` when writing nested parquet with Rust engine #17948