-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python][Parquet] Implement Content-Defined Chunking for the Parquet writer #45360
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
cpp/src/parquet/column_chunker.h
Outdated
bool Roll(const T value) { | ||
constexpr size_t BYTE_WIDTH = sizeof(T); | ||
chunk_size_ += BYTE_WIDTH; | ||
// if (chunk_size_ < min_len_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping bytes until the min size is reached speeds up the boundary detection quite a lot.
const uint64_t MASK = 0xffff00000000000; | ||
// const int MIN_LEN = 65536 / 8; | ||
// const int MAX_LEN = 65536 * 2; | ||
const int64_t MIN_LEN = 256 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These defaults values are subject to change, especially because the default maximum page size is 1Mi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is cdc a part of the parquet spec? Or is it a poc?
It is not. You can think of it as an implementation specific feature similar to the existing options to specify how record batches and pages are being split. |
d1076a8
to
c7a0b3a
Compare
Rationale for this change
I have been working on to improve Parquet's deduplication efficiency for content-addressable storages. These system generally use some kind of a CDC algorithm which are better suited for uncompressed row-major formats. Although thanks to Parquet's unique features I was able to reach good deduplication results by consistently chunking data pages by maintaining a gearhash based chunker for each column.
A purpose built evaluation tool is available at https://github.com/kszucs/de
Deduplication efficiency for all revisions of openfoodfacts/product-database/food.parquet
What changes are included in this PR?
Are these changes tested?
Not yet.
Are there any user-facing changes?
There is a new parquet writer property called
content_defined_chunking
which is subject to renaming.