[Spark] Implement new compaction strategy #585
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New compression strategy:
The new version of compression is divided into two steps:
Each compression operation triggers the compression of incremental files, and the compressed data is written to the compressed directory; then perform compression of the compressed directory file. Whether files in the compressed directory be compressed? There are two conditions that trigger compression: 1. COMPACTION_LEVEL_FILE_NUM_LIMIT > 20, 2. COMPACTION_LEVEL_MAX_FILE_SIZE > 5g。
When the incremental file is compressed to the compressed directory, if the number of files in the compressed directory is greater than 20, the compressed directory compression is triggered. When compressing a directory, files are merged in chronological order from front to back, and it is determined whether two consecutive files are both larger than 5g. Therefore, if two consecutive files are both larger than 5g, they will not be compressed again. If two consecutive files are smaller than 5g, the merge operation will be triggered.
In this way, if two of them merge, the first one will definitely get bigger and bigger, and if it exceeds a certain size(5GB), it will not participate in the merge anymore. This can achieve a balance between minimizing IO and simplifying compression execution, which should meet the compression performance requirements.