-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[log] Arrow Compression support estimate compression ratio in client #330
Conversation
5b38d08
to
9ecd4ba
Compare
9ecd4ba
to
37f0683
Compare
DATA1_ROW_TYPE, | ||
new Object[] { | ||
recordCount, | ||
"a a a" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use guava Strings.repeat("a", 50)
instead of counting characters manually, to improve code readability.
* The factor which is used to estimate the compression ratio of the serialized {@link | ||
* ArrowRecordBatch}. | ||
*/ | ||
private static final float COMPRESSION_RATE_ESTIMATION_FACTOR = 1.05f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the purpose of this factor is to prevent memory shortages due to a decrease in compression rate during a particular compression, thereby adding a 5% margin for special case. However, we actually already have BUFFER_USAGE_RATIO
, which can achieve the same goal, so there is no need to add another factor.
int currentSize = metadataLength + bodyLength; | ||
|
||
int estimatedBodyLength = estimatedBytesWritten(bodyLength); | ||
int currentSize = metadataLength + estimatedBodyLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to update sizeInBytes()
to return the estimated size, otherwise, RecordAccumulator#drain
can't respect the maxSize
parameter.
*/ | ||
public static final float COMPRESSION_RATIO_DETERIORATE_STEP = 0.05f; | ||
|
||
private final Map<Long, Map<String, Float>> compressionRatio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A table only have one compression type and compression level, we only need to maintain one compression for each table.
|
||
// reset the compression ratio. | ||
this.estimatedCompressionRatio = | ||
compressionRatioEstimator.estimation(tableId, compressionInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be reset in reset()
method.
37f0683
to
3f395c9
Compare
…imate ratio for only a table (alibaba#330)
…imate ratio for only a table (alibaba#330)
3f395c9
to
ad63a52
Compare
…imate ratio for only a table (alibaba#330)
ad63a52
to
fc287b5
Compare
Purpose
#326
We have introduced arrow compression in #187. However, Currently, the client calculates the batch size using the data before compress, which would cause a discrepancy between the actual batch size sent to the server and the batch size configured by the user, and in turn affects the compression efficiency. Therefore, in this pr we will introduce an estimator to continuously update the latest compression ratio in client, making the compressed batch size closer to the user's specified value.
Tests
API and Format
Documentation