[log] Arrow Compression support estimate compression ratio in client #330

swuferhong · 2025-01-21T05:51:47Z

Purpose

We have introduced arrow compression in #187. However, Currently, the client calculates the batch size using the data before compress, which would cause a discrepancy between the actual batch size sent to the server and the batch size configured by the user, and in turn affects the compression efficiency. Therefore, in this pr we will introduce an estimator to continuously update the latest compression ratio in client, making the compressed batch size closer to the user's specified value.

Tests

API and Format

Documentation

…aba#330) This closes alibaba#326

wuchong · 2025-01-31T09:18:12Z

fluss-client/src/test/java/com/alibaba/fluss/client/write/ArrowLogWriteBatchTest.java

+                                    DATA1_ROW_TYPE,
+                                    new Object[] {
+                                        recordCount,
+                                        "a                                a                  a"


Use guava Strings.repeat("a", 50) instead of counting characters manually, to improve code readability.

wuchong · 2025-01-31T12:08:44Z

fluss-common/src/main/java/com/alibaba/fluss/row/arrow/ArrowWriter.java

+     * The factor which is used to estimate the compression ratio of the serialized {@link
+     * ArrowRecordBatch}.
+     */
+    private static final float COMPRESSION_RATE_ESTIMATION_FACTOR = 1.05f;


I believe that the purpose of this factor is to prevent memory shortages due to a decrease in compression rate during a particular compression, thereby adding a 5% margin for special case. However, we actually already have BUFFER_USAGE_RATIO, which can achieve the same goal, so there is no need to add another factor.

wuchong · 2025-01-31T16:44:35Z

fluss-common/src/main/java/com/alibaba/fluss/row/arrow/ArrowWriter.java

-            int currentSize = metadataLength + bodyLength;
+
+            int estimatedBodyLength = estimatedBytesWritten(bodyLength);
+            int currentSize = metadataLength + estimatedBodyLength;


we also need to update sizeInBytes() to return the estimated size, otherwise, RecordAccumulator#drain can't respect the maxSize parameter.

wuchong · 2025-01-31T16:48:01Z

fluss-common/src/main/java/com/alibaba/fluss/compression/ArrowCompressionRatioEstimator.java

+     */
+    public static final float COMPRESSION_RATIO_DETERIORATE_STEP = 0.05f;
+
+    private final Map<Long, Map<String, Float>> compressionRatio;


A table only have one compression type and compression level, we only need to maintain one compression for each table.

wuchong · 2025-01-31T16:48:28Z

fluss-common/src/main/java/com/alibaba/fluss/row/arrow/ArrowWriter.java

+
+            // reset the compression ratio.
+            this.estimatedCompressionRatio =
+                    compressionRatioEstimator.estimation(tableId, compressionInfo);


This should be reset in reset() method.

…imate ratio for only a table (alibaba#330)

This closes #326

swuferhong requested review from wuchong, luoyuxia and loserwang1024 January 21, 2025 05:51

swuferhong force-pushed the compression-ratio branch from 5b38d08 to 9ecd4ba Compare January 21, 2025 06:05

swuferhong linked an issue Jan 21, 2025 that may be closed by this pull request

[Feature] Arrow Compression support compression ratio estimator in client to estimate the really batch size #326

Closed

2 tasks

swuferhong force-pushed the compression-ratio branch from 9ecd4ba to 37f0683 Compare January 21, 2025 06:12

[log] Supports estimate compression ratio for Arrow Compression (alib…

6ffac05

…aba#330) This closes alibaba#326

wuchong reviewed Jan 31, 2025

View reviewed changes

wuchong force-pushed the compression-ratio branch from 37f0683 to 3f395c9 Compare January 31, 2025 16:53

wuchong added a commit to swuferhong/fluss that referenced this pull request Jan 31, 2025

[log] Refactor ArrowCompressionRatioEstimator to be lock-free and est…

3f395c9

…imate ratio for only a table (alibaba#330)

wuchong added a commit to swuferhong/fluss that referenced this pull request Jan 31, 2025

[log] Refactor ArrowCompressionRatioEstimator to be lock-free and est…

ad63a52

…imate ratio for only a table (alibaba#330)

wuchong force-pushed the compression-ratio branch from 3f395c9 to ad63a52 Compare January 31, 2025 17:06

[log] Refactor ArrowCompressionRatioEstimator to be lock-free and est…

fc287b5

…imate ratio for only a table (alibaba#330)

wuchong force-pushed the compression-ratio branch from ad63a52 to fc287b5 Compare February 1, 2025 03:18

wuchong merged commit f96c392 into alibaba:main Feb 1, 2025
2 checks passed

wuchong pushed a commit that referenced this pull request Feb 1, 2025

[log] Supports estimate compression ratio for Arrow Compression (#330)

8e2bdbc

This closes #326

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[log] Arrow Compression support estimate compression ratio in client #330

[log] Arrow Compression support estimate compression ratio in client #330

swuferhong commented Jan 21, 2025 •

edited

Loading

wuchong Jan 31, 2025

wuchong Jan 31, 2025

wuchong Jan 31, 2025

wuchong Jan 31, 2025

wuchong Jan 31, 2025

[log] Arrow Compression support estimate compression ratio in client #330

[log] Arrow Compression support estimate compression ratio in client #330

Conversation

swuferhong commented Jan 21, 2025 • edited Loading

Purpose

Tests

API and Format

Documentation

wuchong Jan 31, 2025

Choose a reason for hiding this comment

wuchong Jan 31, 2025

Choose a reason for hiding this comment

wuchong Jan 31, 2025

Choose a reason for hiding this comment

wuchong Jan 31, 2025

Choose a reason for hiding this comment

wuchong Jan 31, 2025

Choose a reason for hiding this comment

swuferhong commented Jan 21, 2025 •

edited

Loading