Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Implement new compaction strategy #585

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

F-PHantam
Copy link
Contributor

New compression strategy:

The new version of compression is divided into two steps:

  1. Compression of incremental files;
  2. Compression of compressed directory files.

Each compression operation triggers the compression of incremental files, and the compressed data is written to the compressed directory; then perform compression of the compressed directory file. Whether files in the compressed directory be compressed? There are two conditions that trigger compression: 1. COMPACTION_LEVEL_FILE_NUM_LIMIT > 20, 2. COMPACTION_LEVEL_MAX_FILE_SIZE > 5g。

When the incremental file is compressed to the compressed directory, if the number of files in the compressed directory is greater than 20, the compressed directory compression is triggered. When compressing a directory, files are merged in chronological order from front to back, and it is determined whether two consecutive files are both larger than 5g. Therefore, if two consecutive files are both larger than 5g, they will not be compressed again. If two consecutive files are smaller than 5g, the merge operation will be triggered.

In this way, if two of them merge, the first one will definitely get bigger and bigger, and if it exceeds a certain size(5GB), it will not participate in the merge anymore. This can achieve a balance between minimizing IO and simplifying compression execution, which should meet the compression performance requirements.

@F-PHantam F-PHantam added the spark spark support into lakesoul label Feb 26, 2025
@xuchen-plus xuchen-plus changed the title [Spark] New compression implement [Spark] Implement new compaction strategy Feb 26, 2025
…ul-io#584)

* fix_flink_write_file_bucket_bug

Signed-off-by: fphantam <[email protected]>

* fix invalid hash_bucket_num

Signed-off-by: fphantam <[email protected]>

* set lakesoul.sink.dynamic_bucketing default value is true

Signed-off-by: fphantam <[email protected]>

* fix PartitioningAsyncWriter with aux_sort_cols

Signed-off-by: fphantam <[email protected]>

---------

Signed-off-by: fphantam <[email protected]>
Signed-off-by: fphantam <[email protected]>
Signed-off-by: fphantam <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark spark support into lakesoul
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant