-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7696] Refactoring functions in HoodieTableMetadataUtil #11823
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bibhu107 Thanks for your contribution. After going through the code, I think unifying the two methods is neither easy nor desirable. However, some common logic can be extracted. For example, if we introduce a method below that can be reused in the two methods, then the logic for collecting, grouping, and aggregating column metadata is extracted out. Let me know what you think.
private static Stream<HoodieRecord> collectAndProcessColumnMetadata(
List<List<HoodieColumnRangeMetadata<Comparable>>> fileColumnMetadata,
String partitionPath) {
// Step 2: Flatten and Group by Column Name
Map<String, List<HoodieColumnRangeMetadata<Comparable>>> columnMetadataMap = fileColumnMetadata.stream()
.flatMap(List::stream)
.collect(Collectors.groupingBy(HoodieColumnRangeMetadata::getColumnName, Collectors.toList())); // Group by column name
// Step 3: Aggregate Column Ranges
Stream<HoodieColumnRangeMetadata<Comparable>> partitionStatsRangeMetadata = columnMetadataMap.entrySet().stream()
.map(entry -> FileFormatUtils.getColumnRangeInPartition(partitionPath, entry.getValue()));
// Create Partition Stats Records
return HoodieMetadataPayload.createPartitionStatsRecords(partitionPath, partitionStatsRangeMetadata.collect(Collectors.toList()), false).stream();
}
Makes sense @codope. Have made necessary changes. |
Change Logs
Two functions in
HoodieTableMetadataUtil
,convertFilesToPartitionStatsRecords
andconvertMetadataToPartitionStatsRecords
, are being evaluated for unification.These functions perform similar tasks of creating statistics for records in a partition. The proposal aims to combine them into a single function, improving code readability and maintainability.
Reference: #10352 (comment)
Impact
This change will primarily affect the internal implementation of
HoodieTableMetadataUtil
. While it's not expected to impact the public API directly.There are simillar other functions like mentioned below :
convertFilesToColumnStatsRecords
andconvertMetadataToColumnStatsRecords
convertFilesToBloomFilterRecords
andconvertMetadataToBloomFilterRecords
Unification challenge:
To merge these function pairs, we would need to transform HoodieCommitMetadata into a format compatible with the unified function. This transformation process could potentially introduce inefficiencies.
Risk level
Low
Documentation Update
N/A
As this is an internal code refactoring, no user-facing documentation updates are required.
Contributor's checklist
cc - @codope