[Spark] Implement new compaction strategy #585

F-PHantam · 2025-02-26T03:36:54Z

New compression strategy：

The new version of compression is divided into two steps:

Compression of incremental files;
Compression of compressed directory files.

Each compression operation triggers the compression of incremental files, and the compressed data is written to the compressed directory; then perform compression of the compressed directory file. Whether files in the compressed directory be compressed? There are two conditions that trigger compression: 1. COMPACTION_LEVEL_FILE_NUM_LIMIT > 20, 2. COMPACTION_LEVEL_MAX_FILE_SIZE > 5g。

When the incremental file is compressed to the compressed directory, if the number of files in the compressed directory is greater than 20, the compressed directory compression is triggered. When compressing a directory, files are merged in chronological order from front to back, and it is determined whether two consecutive files are both larger than 5g. Therefore, if two consecutive files are both larger than 5g, they will not be compressed again. If two consecutive files are smaller than 5g, the merge operation will be triggered.

In this way, if two of them merge, the first one will definitely get bigger and bigger, and if it exceeds a certain size(5GB), it will not participate in the merge anymore. This can achieve a balance between minimizing IO and simplifying compression execution, which should meet the compression performance requirements.

Signed-off-by: fphantam <[email protected]>

…ul-io#584) * fix_flink_write_file_bucket_bug Signed-off-by: fphantam <[email protected]> * fix invalid hash_bucket_num Signed-off-by: fphantam <[email protected]> * set lakesoul.sink.dynamic_bucketing default value is true Signed-off-by: fphantam <[email protected]> * fix PartitioningAsyncWriter with aux_sort_cols Signed-off-by: fphantam <[email protected]> --------- Signed-off-by: fphantam <[email protected]>

Signed-off-by: fphantam <[email protected]>

F-PHantam added 3 commits February 24, 2025 19:50

new compaction

f0848f1

Signed-off-by: fphantam <[email protected]>

remove debug info and change some params

94adbfc

Signed-off-by: fphantam <[email protected]>

remove debug code

96e755b

Signed-off-by: fphantam <[email protected]>

F-PHantam added the spark spark support into lakesoul label Feb 26, 2025

xuchen-plus changed the title ~~[Spark] New compression implement~~ [Spark] Implement new compaction strategy Feb 26, 2025

F-PHantam requested review from xuchen-plus and Ceng23333 February 26, 2025 04:06

F-PHantam added 3 commits February 26, 2025 12:12

new compaction

2208eb0

Signed-off-by: fphantam <[email protected]>

add License

927f734

Signed-off-by: fphantam <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Implement new compaction strategy #585

[Spark] Implement new compaction strategy #585

F-PHantam commented Feb 26, 2025

[Spark] Implement new compaction strategy #585

Are you sure you want to change the base?

[Spark] Implement new compaction strategy #585

Conversation

F-PHantam commented Feb 26, 2025