Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed writing column statistics when there is only one row that is null #1245

Merged

Conversation

norberttech
Copy link
Member

Change Log

Added

Fixed

  • Fixed writing column statistics when there is only one row that is null

Changed

Removed

Deprecated

Security


Description

This issue was reported on our Discord channel by user guiguiolol.
Code to reproduce this issue:

<?php

use Flow\Parquet\ParquetFile\Schema;
use Flow\Parquet\ParquetFile\Schema\FlatColumn;
use Flow\Parquet\Writer;

require_once __DIR__ . '/../../../vendor/autoload.php';

$writer = new Writer();

$schema = Schema::with(
    FlatColumn::int32('id'),
);

if (file_exists(__DIR__ . '/test.parquet')) {
    unlink(__DIR__ . '/test.parquet');
}

$writer->write(
    __DIR__ . '/test.parquet',
    $schema,
    [
        [
            'id' => null,
        ]
    ]
);

Parquet files created like this can't be read by other software (confirmed through duckdb).

When min/max buffers are empty we can safely make them null (that's what pyarrow is doing).
Below python code to reproduce the same scenario:

import pyarrow as pa
import pyarrow.parquet as pq
import os

# Define the directory and file path
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(dir_path, 'test.parquet')

# Define the schema with an int32 'id' column that is nullable
schema = pa.schema([
    pa.field('id', pa.int32(), nullable=True)
])

# Prepare the data with 'id' set to None (null)
data = {
    'id': [None]
}

# Create a PyArrow Table with the data and schema
table = pa.Table.from_pydict(data, schema=schema)

# Remove the Parquet file if it already exists
if os.path.exists(file_path):
    os.remove(file_path)

# Write the table to a Parquet file
pq.write_table(table, file_path)

Copy link
Contributor

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.602mb +0.00%  | 514.006ms -0.10% | ±1.00% +141.27% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.690mb +0.00%  | 1.086s +0.38%    | ±0.50% -46.68%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 29.145mb +0.00% | 431.769ms -3.16% | ±0.71% +5.71%   |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.330mb +0.00%  | 33.308ms -1.47%  | ±0.11% -74.24%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.312mb +0.00%  | 641.925ms -1.88% | ±0.42% -70.74%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 116.606mb +0.00% | 60.244ms -1.16% | ±0.27% -66.12% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.799mb +0.00%  | 139.343ms -0.81% | ±0.79% +194.33% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 90.382mb +0.00%  | 114.931ms -1.17% | ±0.19% -70.25%  |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 124.436mb +0.00% | 1.226s -3.35%    | ±0.05% -89.45%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.521mb +0.00%  | 43.340ms -4.19%  | ±0.66% +26.13%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 87.355mb +0.00%  | 3.373ms -15.35%  | ±1.29% +52.64%  |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.958mb +0.00% | 189.810ms -2.10% | ±0.50% -36.04%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.678mb +0.00%  | 19.022ms -2.04%  | ±0.54% -2.95%   |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.595mb +0.00%  | 1.620ms -17.63%  | ±1.12% -41.44%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.595mb +0.00%  | 1.634ms -17.02%  | ±0.40% -44.52%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.707mb +0.00%  | 2.783ms -14.72%  | ±0.41% -88.21%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 86.236mb +0.00%  | 15.051ms -4.42%  | ±0.26% -78.27%  |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 86.236mb +0.00%  | 15.404ms -0.39%  | ±0.09% -91.81%  |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 84.139mb +0.00%  | 1.706μs -14.95%  | ±2.72% +17.31%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 84.139mb +0.00%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 93.490mb +0.00%  | 12.635ms -11.37% | ±0.66% -55.35%  |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.861mb +0.00% | 62.015ms -2.36%  | ±1.07% -16.90%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.755mb +0.00%  | 1.394ms -20.73%  | ±2.42% -8.61%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 90.107mb +0.00%  | 63.424ms -3.03%  | ±1.15% +152.38% |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.857mb +0.00%  | 4.141ms -12.77%  | ±0.12% -94.53%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 84.289mb +0.00%  | 39.993ms -6.52%  | ±1.32% -26.65%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 84.289mb +0.00%  | 41.013ms -3.62%  | ±0.78% -64.44%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 84.289mb +0.00%  | 40.439ms -1.79%  | ±0.72% -63.98%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.581mb +0.00%  | 7.419ms +0.72%   | ±0.61% -52.51%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 84.139mb +0.00%  | 29.002ms -2.79%  | ±0.95% -33.90%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 84.139mb +0.00%  | 13.312μs -4.51%  | ±0.71% -60.88%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 84.139mb +0.00%  | 15.879μs -3.18%  | ±0.78% +0.00%   |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.959mb +0.00% | 192.652ms -0.98% | ±2.63% +212.27% |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 53.181mb +0.00%  | 396.449ms -1.27% | ±0.97% +449.97% |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 13.447mb +0.00%  | 81.240ms +0.22%  | ±0.55% -77.45%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 107.443mb +0.00% | 480.736ms -0.43% | ±0.74% -20.09%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.801mb +0.00%  | 242.706ms +0.25% | ±1.00% -36.83%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.639mb +0.00%  | 52.619ms -0.82%  | ±0.29% -79.54%  |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@norberttech norberttech merged commit 4602665 into flow-php:1.x Oct 11, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant