Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[log] Arrow Compression support estimate compression ratio in client #330

Merged
merged 2 commits into from
Feb 1, 2025

Conversation

swuferhong
Copy link
Collaborator

@swuferhong swuferhong commented Jan 21, 2025

Purpose

#326

We have introduced arrow compression in #187. However, Currently, the client calculates the batch size using the data before compress, which would cause a discrepancy between the actual batch size sent to the server and the batch size configured by the user, and in turn affects the compression efficiency. Therefore, in this pr we will introduce an estimator to continuously update the latest compression ratio in client, making the compressed batch size closer to the user's specified value.

Tests

API and Format

Documentation

DATA1_ROW_TYPE,
new Object[] {
recordCount,
"a a a"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use guava Strings.repeat("a", 50) instead of counting characters manually, to improve code readability.

* The factor which is used to estimate the compression ratio of the serialized {@link
* ArrowRecordBatch}.
*/
private static final float COMPRESSION_RATE_ESTIMATION_FACTOR = 1.05f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the purpose of this factor is to prevent memory shortages due to a decrease in compression rate during a particular compression, thereby adding a 5% margin for special case. However, we actually already have BUFFER_USAGE_RATIO, which can achieve the same goal, so there is no need to add another factor.

int currentSize = metadataLength + bodyLength;

int estimatedBodyLength = estimatedBytesWritten(bodyLength);
int currentSize = metadataLength + estimatedBodyLength;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need to update sizeInBytes() to return the estimated size, otherwise, RecordAccumulator#drain can't respect the maxSize parameter.

*/
public static final float COMPRESSION_RATIO_DETERIORATE_STEP = 0.05f;

private final Map<Long, Map<String, Float>> compressionRatio;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table only have one compression type and compression level, we only need to maintain one compression for each table.


// reset the compression ratio.
this.estimatedCompressionRatio =
compressionRatioEstimator.estimation(tableId, compressionInfo);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reset in reset() method.

wuchong added a commit to swuferhong/fluss that referenced this pull request Jan 31, 2025
wuchong added a commit to swuferhong/fluss that referenced this pull request Jan 31, 2025
@wuchong wuchong merged commit f96c392 into alibaba:main Feb 1, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Arrow Compression support compression ratio estimator in client to estimate the really batch size
2 participants