Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Jan 27, 2025

Rationale for this change

I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.

A purpose built evaluation tool is available at https://github.com/kszucs/de

Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet

❯ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/openfoodfacts/product-database.git ~/Datasets/product-database
❯ de revisions -d /tmp/food ~/Datasets/product-database/food.parquet
food.parquet has 32 revisions
Checking out 2e19b51
Checking out 1f84d31
Checking out d31d108
Checking out 9cd809c
Checking out 41e5f38
Checking out 9a30ddd
...
❯ de stats /tmp/food 
Estimating deduplication for Parquet
Estimating deduplication for CDC Snappy
Estimating deduplication for CDC ZSTD
┏━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃            ┃            ┃            ┃             ┃             ┃ Compressed ┃
┃            ┃            ┃            ┃  Compressed ┃ Deduplicat… ┃ Deduplica… ┃
┃ Title      ┃ Total Size ┃ Chunk Size ┃  Chunk Size ┃       Ratio ┃      Ratio ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parquet    │  182.6 GiB │  148.0 GiB │   140.5 GiB │         81% │        77% │
│ CDC Snappy │  178.3 GiB │   75.5 GiB │    73.0 GiB │         42% │        41% │
│ CDC ZSTD   │  109.6 GiB │   55.9 GiB │    55.6 GiB │         51% │        51% │
└────────────┴────────────┴────────────┴─────────────┴─────────────┴────────────┘

What changes are included in this PR?

Are these changes tested?

Not yet.

Are there any user-facing changes?

There is a new parquet writer property called content_defined_chunking which is subject to renaming.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

bool Roll(const T value) {
constexpr size_t BYTE_WIDTH = sizeof(T);
chunk_size_ += BYTE_WIDTH;
// if (chunk_size_ < min_len_) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping bytes until the min size is reached speeds up the boundary detection quite a lot.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 27, 2025
const uint64_t MASK = 0xffff00000000000;
// const int MIN_LEN = 65536 / 8;
// const int MAX_LEN = 65536 * 2;
const int64_t MIN_LEN = 256 * 1024;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults values are subject to change, especially because the default maximum page size is 1Mi.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 27, 2025
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is cdc a part of the parquet spec? Or is it a poc?

@kszucs
Copy link
Member Author

kszucs commented Jan 28, 2025

Is cdc a part of the parquet spec? Or is it a poc?

It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split.

@kszucs kszucs force-pushed the content-defined-chunking branch from d1076a8 to c7a0b3a Compare January 30, 2025 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants