[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

kszucs · 2025-01-27T12:40:52Z

Rationale for this change

I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.

A purpose built evaluation tool is available at https://github.com/kszucs/de

Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet

❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/openfoodfacts/product-database.git ~/Datasets/product-database
❯ de revisions -d /tmp/food ~/Datasets/product-database/food.parquet
food.parquet has 32 revisions
Checking out 2e19b51
Checking out 1f84d31
Checking out d31d108
Checking out 9cd809c
Checking out 41e5f38
Checking out 9a30ddd
...

❯ de stats /tmp/food 
Estimating deduplication for Parquet
Estimating deduplication for CDC Snappy
Estimating deduplication for CDC ZSTD
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃            ┃            ┃            ┃             ┃             ┃ Compressed ┃
┃            ┃            ┃            ┃  Compressed ┃ Deduplicat… ┃ Deduplica… ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃  Chunk Size ┃       Ratio ┃      Ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parquet    │  182.6 GiB │  148.0 GiB │   140.5 GiB │         81% │        77% │
│ CDC Snappy │  178.3 GiB │   75.5 GiB │    73.0 GiB │         42% │        41% │
│ CDC ZSTD   │  109.6 GiB │   55.9 GiB │    55.6 GiB │         51% │        51% │
└────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘

What changes are included in this PR?

Are these changes tested?

Not yet.

Are there any user-facing changes?

There is a new parquet writer property called content_defined_chunking which is subject to renaming.

github-actions · 2025-01-27T12:41:16Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

kszucs · 2025-01-27T14:32:03Z

cpp/src/parquet/column_chunker.h

+  bool Roll(const T value) {
+    constexpr size_t BYTE_WIDTH = sizeof(T);
+    chunk_size_ += BYTE_WIDTH;
+    // if (chunk_size_ < min_len_) {


Skipping bytes until the min size is reached speeds up the boundary detection quite a lot.

kszucs · 2025-01-27T14:32:56Z

cpp/src/parquet/column_chunker.h

+const uint64_t MASK = 0xffff00000000000;
+// const int MIN_LEN = 65536 / 8;
+// const int MAX_LEN = 65536 * 2;
+const int64_t MIN_LEN = 256 * 1024;


These defaults values are subject to change, especially because the default maximum page size is 1Mi.

mapleFU

Is cdc a part of the parquet spec? Or is it a poc?

kszucs · 2025-01-28T14:57:07Z

Is cdc a part of the parquet spec? Or is it a poc?

It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split.

…quet writer

github-actions bot added Component: Parquet Component: C++ Component: Python awaiting committer review Awaiting committer review labels Jan 27, 2025

kszucs commented Jan 27, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 27, 2025

kszucs commented Jan 27, 2025

View reviewed changes

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 27, 2025

mapleFU reviewed Jan 28, 2025

View reviewed changes

kszucs added 4 commits January 30, 2025 19:33

[C++][Python][Parquet] Implement Content-Defined Chunking for the Par…

b5c0268

…quet writer

always roll values

b77d302

add faster paths for flat arrays

1b7fb93

normalize chunk sizes according to fastcdc algorithm

c7a0b3a

kszucs force-pushed the content-defined-chunking branch from d1076a8 to c7a0b3a Compare January 30, 2025 18:33

kszucs added 2 commits January 30, 2025 20:10

missing header and fix level_offset incrementation

69afe12

don't use normalization by default

0280eb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

kszucs commented Jan 27, 2025 •

edited

Loading

github-actions bot commented Jan 27, 2025

kszucs Jan 27, 2025

kszucs Jan 27, 2025

mapleFU left a comment

kszucs commented Jan 28, 2025 •

edited

Loading

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

Are you sure you want to change the base?

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

Conversation

kszucs commented Jan 27, 2025 • edited Loading

Rationale for this change

Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Jan 27, 2025

kszucs Jan 27, 2025

Choose a reason for hiding this comment

kszucs Jan 27, 2025

Choose a reason for hiding this comment

mapleFU left a comment

Choose a reason for hiding this comment

kszucs commented Jan 28, 2025 • edited Loading

kszucs commented Jan 27, 2025 •

edited

Loading

kszucs commented Jan 28, 2025 •

edited

Loading