Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7696] Refactoring functions in HoodieTableMetadataUtil #11823

Merged
merged 5 commits into from
Aug 30, 2024

Conversation

bibhu107
Copy link
Contributor

Change Logs

Two functions in HoodieTableMetadataUtil, convertFilesToPartitionStatsRecords and convertMetadataToPartitionStatsRecords, are being evaluated for unification.
These functions perform similar tasks of creating statistics for records in a partition. The proposal aims to combine them into a single function, improving code readability and maintainability.

Reference: #10352 (comment)

Impact

This change will primarily affect the internal implementation of HoodieTableMetadataUtil. While it's not expected to impact the public API directly.

There are simillar other functions like mentioned below :

  • convertFilesToColumnStatsRecords and convertMetadataToColumnStatsRecords
  • convertFilesToBloomFilterRecords and convertMetadataToBloomFilterRecords

Unification challenge:
To merge these function pairs, we would need to transform HoodieCommitMetadata into a format compatible with the unified function. This transformation process could potentially introduce inefficiencies.

Risk level

Low

Documentation Update

N/A

As this is an internal code refactoring, no user-facing documentation updates are required.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

cc - @codope

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Aug 25, 2024
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bibhu107 Thanks for your contribution. After going through the code, I think unifying the two methods is neither easy nor desirable. However, some common logic can be extracted. For example, if we introduce a method below that can be reused in the two methods, then the logic for collecting, grouping, and aggregating column metadata is extracted out. Let me know what you think.

private static Stream<HoodieRecord> collectAndProcessColumnMetadata(
    List<List<HoodieColumnRangeMetadata<Comparable>>> fileColumnMetadata,
    String partitionPath) {
  
  // Step 2: Flatten and Group by Column Name
  Map<String, List<HoodieColumnRangeMetadata<Comparable>>> columnMetadataMap = fileColumnMetadata.stream()
      .flatMap(List::stream)
      .collect(Collectors.groupingBy(HoodieColumnRangeMetadata::getColumnName, Collectors.toList())); // Group by column name
  
  // Step 3: Aggregate Column Ranges
  Stream<HoodieColumnRangeMetadata<Comparable>> partitionStatsRangeMetadata = columnMetadataMap.entrySet().stream()
      .map(entry -> FileFormatUtils.getColumnRangeInPartition(partitionPath, entry.getValue()));
  
  // Create Partition Stats Records
  return HoodieMetadataPayload.createPartitionStatsRecords(partitionPath, partitionStatsRangeMetadata.collect(Collectors.toList()), false).stream();
}

@codope codope self-assigned this Aug 28, 2024
@bibhu107 bibhu107 changed the title [HUDI-7696][Draft] Refactoring functions in HoodieTableMetadataUtil [HUDI-7696] Refactoring functions in HoodieTableMetadataUtil Aug 28, 2024
@bibhu107
Copy link
Contributor Author

Makes sense @codope. Have made necessary changes.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit b15a4d6 into apache:master Aug 30, 2024
44 checks passed
@bibhu107 bibhu107 deleted the partitionStats branch August 30, 2024 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:S PR with lines of changes in (10, 100]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants