[HUDI-7144] Build storage partition stats index and use it for data skipping #10352

codope · 2023-12-18T11:27:38Z

Change Logs

Build storage partition stats index and use it for data skipping. Main changes are as follows:

Index is saved as another partition in the metadata table.
Each index entry is a key-value, where key is the hash(columnName).concat(hash(partitionName)), and value is the stats.
New configs in HoodieMetadataConfig and the writer changes are in HoodieBackedTableMetadataWriter with some util methods in HoodieTableMetadataUtil
On the read path, main changes are in HoodieFileIndex. First, the partition pruning happens as usual, then depending on data filters, data can be skipped further if partition stats index is available.

Impact

Stats aggregated by storage partition. Efficient data skipping. Meta sync need not sync the partition metadata. Queries will use the index while planning in the driver.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

vinothchandar

Lets get this turned on by default?

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java

vinothchandar

Left some comments on cases that I think we should cover out of the gate and have it all passing tests.

Summarizing by understanding here, to get on the same page. When the partition stats index is turned on, we essentially track merged statistics across all files within a storage partition, for each of the partition columns. on the query side, we try to see if any of the storage partitions can be simply pruned out based on columns in the query and the new stats.

So, there should be no semantic differences between this and what hive-style partitioning will achieve. i.e queries written on one of the partition columns and a combination of them etc. Love to see complex test cases like those and confirm this is indeed good

I can do line-by-line detailed review once these are ressolved.

hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

hudi-common/src/main/java/org/apache/hudi/metadata/MetadataPartitionType.java

...datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionStatsIndexSupport.scala

...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/PartitionStatsIndexTestBase.scala

...atasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala

...ource/hudi-spark/src/test/scala/org/apache/hudi/functional/PartitionStatsIndexTestBase.scala

vinothchandar · 2024-01-16T19:23:54Z

I made one skim of the changes. Can you reply on my previous review comments? esp on tests.

codope · 2024-01-18T01:41:16Z

I made one skim of the changes. Can you reply on my previous review comments? esp on tests.

@vinothchandar Thanks for the review. I have addressed all your comments. The test was passing two commits ago. I am looking into the failures. But, PR is ready to review again.

vinothchandar

Left some comments. Will need to step through to understand how partition columns are mapped automatically on the writer side into target column stats index cols.

hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

vinothchandar · 2024-01-20T22:13:44Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

@@ -1901,4 +1910,162 @@ private static Path filePath(String basePath, String partition, String filename)
      return new Path(basePath, partition + Path.SEPARATOR + filename);
    }
  }
+
+  public static HoodieData<HoodieRecord> convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,


I see that you are using recordsGenParams.getTargetColumnsForColumnStatsIndex(); ultimately to generate the stats. do you just add the partition fields into the target columns? to make partition stats aggregation happen automatically.

no, we need to specify the partition field in target columns config.

We can avoid this extra step though.

Can we do this automatically? As a user, if I am already doing partitionBy already, then its fair expectation that the index is built automatically and partition pruning works without any extra configs

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

vinothchandar

Please help me understand if I am missing sth around tests

...atasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala

vinothchandar

Bunch more comments.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

...datasource/hudi-spark-common/src/main/scala/org/apache/hudi/PartitionStatsIndexSupport.scala

...atasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala

vinothchandar · 2024-02-13T00:00:55Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

@@ -346,6 +352,12 @@ case class HoodieFileIndex(spark: SparkSession,
      Option.empty
    } else if (recordKeys.nonEmpty) {
      Option.apply(recordLevelIndex.getCandidateFiles(getAllFiles(), recordKeys))
+    } else if (recordKeys.nonEmpty && partitionStatsIndex.isIndexAvailable && !queryFilters.isEmpty) {
+      val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
+      val shouldReadInMemory = partitionStatsIndex.shouldReadInMemory(this, queryReferencedColumns)


Can you please point me to the piece of code/line numbers where we identify a referenced column as a partitioned column? (on read side)
and how .partitionBy("a,b,c") turns into partition stats on the metadata write side?

That part is a still unclear to me.

First thing I want to clarify is that partition stats are collected only when column stats is enabled, and for only those columns for which column stats is enabled.

Write path: When user will do .partitionBy("a,b,c"), then the logic is similar to column stats. We use the commit metadata and convert that to partition stats. This happens in HoodieTableMetadataUtil.convertMetadataToPartitionStatsRecords. The difference from column stats is that the stats are aggregated by partition value in BaseFileUtils.getColumnRangeInPartition.

Read path: queryReferencedColumns here contain data filters. Partition pruning based on partition filters has already happened one level above.

vinothchandar

Please clarify the latest comments.

vinothchandar · 2024-02-22T01:04:33Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+    return engineContext.parallelize(partitionInfoList, parallelism).flatMap(partitionFiles -> {
+      final String partitionName = partitionFiles.getRelativePath();
+      Stream<HoodieColumnRangeMetadata<Comparable>> partitionStatsRangeMetadata = partitionFiles.getFileNameToSizeMap().keySet().stream()
+          .map(fileName -> getFileStatsRangeMetadata(partitionName, partitionName + "/" + fileName, dataTableMetaClient, columnsToIndex, false))


are we reading all the column stats again to generate partition stats? I think we should avoid doing this extra work and piggy back off the column stats index values?

This is called one time at the time of intialization of partiiton stats. Nevertheless, we can avoid extra work with minimal branching.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

yihua · 2024-04-18T05:02:38Z

...ient/hudi-spark-client/src/test/java/org/apache/hudi/table/upgrade/TestUpgradeDowngrade.java

@@ -557,7 +558,8 @@ public void testDowngradeSixToFiveShouldDeleteRecordIndexPartition() throws Exce
        PARTITION_NAME_COLUMN_STATS,
        PARTITION_NAME_BLOOM_FILTERS,
        PARTITION_NAME_RECORD_INDEX,
-        PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX
+        PARTITION_NAME_FUNCTIONAL_INDEX_PREFIX,


Should this list be generated from the production code, i.e., list of supported MDT partitions? Also, do we need to upgrade the table version? I think master branch is still on table version 6, the same as 0.14.0 release.

I think we will eventually need to upgrade the table version for 1.0. But, this PR does not do any incompatible schema changes, so we should be good for now. HUDI-7665 to track.

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java

yihua · 2024-04-18T05:09:22Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

  private static final int METADATA_TYPE_RECORD_INDEX = 5;
+  private static final int METADATA_TYPE_PARTITION_STATS = 6;


Maybe we should add enum and guarantee the ordering, and automatically assign the type ID?

Hmm.. yes we could. Just to be clear, you're suggesting no schema change right? As in we will still keep the int value in the records, and not enum.

Right, no schema change; just code restructuring maintaining the same ID, i.e., add getId in MetadataPartitionType. You can refactor this part in a separate PR.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

vinothchandar

Submitting a pending review, that I. had hanging around.

hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

vinothchandar · 2024-04-15T17:05:41Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

@@ -1901,4 +1910,162 @@ private static Path filePath(String basePath, String partition, String filename)
      return new Path(basePath, partition + Path.SEPARATOR + filename);
    }
  }
+
+  public static HoodieData<HoodieRecord> convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,


Can we do this automatically? As a user, if I am already doing partitionBy already, then its fair expectation that the index is built automatically and partition pruning works without any extra configs

...rce/hudi-spark/src/test/java/org/apache/hudi/functional/TestDataSkippingWithMORColstats.java

yihua · 2024-04-30T04:50:43Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

+          case PARTITION_STATS:
+            if (dataWriteConfig.getColumnsEnabledForColumnStatsIndex().isEmpty()) {
+              LOG.warn("Skipping partition stats index initialization as target columns are not set");
+              continue;
+            }
+            fileGroupCountAndRecordsPair = initializePartitionStatsIndex(partitionInfoList);
+            break;


Not required for this PR. IMO all such switch-case logic should be included in the MetadataPartitionType enum for easier extensibility.

Then, HoodieBackedTableMetadataWriter can also be simplified.

Agree. Will do it in HUDI-7691

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java

hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java

yihua · 2024-04-30T05:05:27Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

  private static final int METADATA_TYPE_RECORD_INDEX = 5;
+  private static final int METADATA_TYPE_PARTITION_STATS = 6;


Right, no schema change; just code restructuring maintaining the same ID, i.e., add getId in MetadataPartitionType. You can refactor this part in a separate PR.

yihua · 2024-04-30T05:12:32Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

+    if (enabledPartitionTypes.contains(MetadataPartitionType.PARTITION_STATS)) {
+      final HoodieData<HoodieRecord> partitionStatsRDD = convertMetadataToPartitionStatsRecords(commitMetadata, context, dataMetaClient, metadataConfig);
+      partitionToRecordsMap.put(MetadataPartitionType.PARTITION_STATS, partitionStatsRDD);
+    }


A side topic, does this compose a sub-graph in DAG without triggering the execution? Ideally, all types of metadata should be computed in parallel leveraging the parallelism in Spark, instead of being computed type by type sequentially.

No. That's a very good point! Typically, we are enabling indexes one at a time. But, I agree with you. I will work on it in HUDI-7690. Btw, this limitation is only for initialization. In case of update, we don't compute sequuentially.

yihua · 2024-04-30T05:21:35Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

@@ -1872,4 +1883,175 @@ public HoodieRecord next() {
      }
    };
  }
+
+  public static HoodieData<HoodieRecord> convertFilesToPartitionStatsRecords(HoodieEngineContext engineContext,
+                                                                             List<DirectoryInfo> partitionInfoList,


Does this contain log files in MOR tables?

No, the assumptions of colstats apply for parition stats too. Log files are considered in subsequent updates with the write status. So, just like colstats, we will have to compact the table before enabling partition stats.

I see. Is this for initializing the partition stats only? Do we have a guard to throw an exception if the colstats or partition stats partition is initialized on the latest file slices with log files?

Yes, this is just for initialization. We just log a warning and continue for other filegroups. This is what colstats is also doing. However, given that partition stats will be enabled by default, I think it makes sense to throw exception. Will do it.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

yihua · 2024-04-30T05:31:14Z

hudi-common/src/main/java/org/apache/hudi/common/util/BaseFileUtils.java

+        .reduce(BaseFileUtils::mergeRanges).get();
+  }
+
+  private static <T extends Comparable<T>> HoodieColumnRangeMetadata<T> mergeRanges(HoodieColumnRangeMetadata<T> one,


Do we merge column ranges between the base and log files (e.g., log file contains updates and deletes)? Is that covered in this PR or by existing column stats logic (how does that work)?

Yes, it's covered by existing colstats logic. For log files, we get the column ranges based on write stats and then merge as usual.

Got it. I'm trying to understand how col stats are merged between the base and log files, e.g., for custom payload, because we may not be able to simply take the minimum of all minimum values or the maximum of all maximum values from the base and log files, as there can be deletes and there can be event-time based merging. We can discuss this in a separate thread.

We follow the same approach as for colstats -

hudi/hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java

Line 372 in f99b181

public HoodieMetadataPayload preCombine(HoodieMetadataPayload previousRecord) {

If the new record is for delete, then we ignore the previous one. If previous record is for delete, then we take the newer one, otherwise just merge the stats. I don't think we do event-time based merging for MDT.

...rce/hudi-spark/src/test/java/org/apache/hudi/functional/TestDataSkippingWithMORColstats.java

...atasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala

…kipping Merge datasource and sql tests Turn on partition stats by default Address some comments and tests fix mit rebase and partitionBy test Use null for file name when building partition stats Fix the merging of file stats and add test Exit early if no columns to index Add test using hudi_metadata tvf check partition pruning and address other comments fix after rebase Encode partitioning for hive style partitioned table Address comments

hudi-bot · 2024-04-30T16:51:38Z

CI report:

f63dbe1 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala

yihua

+1 you can land the PR once all clarifying questions are answered.

KnightChess

@codope hello, I found partition stats will not effort, the current implementation does not seem to achieve the effect of partition filtering.

first

in this picture, I change the ut filter to trigger partition stats index.

partition_stats will not save fileName, so if reuse CSI logical, it will throw null point in group by key

second

and have a question, I am not sure this pr is use to partition purge like physical partition col, which mean use other field min/max to get which physical partitions to list fileSlice. or filter fileName like CSI, RLI.
thanks

KnightChess · 2024-06-05T08:36:27Z

creaete a issue to track it https://issues.apache.org/jira/browse/HUDI-7829

codope mentioned this pull request Dec 18, 2023

[HUDI-7144] Infer partition logically and use index to prune #10274

Closed

4 tasks

codope requested a review from vinothchandar December 18, 2023 16:33

codope assigned vinothchandar Dec 20, 2023

codope force-pushed the HUDI-7144-storage-partition-index branch 6 times, most recently from 8cb19dc to b5d7c62 Compare January 8, 2024 06:53

apache deleted a comment from hudi-bot Jan 8, 2024

yihua self-assigned this Jan 8, 2024

vinothchandar requested changes Jan 8, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java Show resolved Hide resolved

vinothchandar requested changes Jan 8, 2024

View reviewed changes

codope force-pushed the HUDI-7144-storage-partition-index branch from 7d6558b to 6e55ec7 Compare January 10, 2024 15:51

codope added the release-1.0.0 label Jan 11, 2024

codope force-pushed the HUDI-7144-storage-partition-index branch from 6e55ec7 to 6a4ed24 Compare January 17, 2024 17:26

codope force-pushed the HUDI-7144-storage-partition-index branch from 6a4ed24 to f8c89bc Compare January 18, 2024 03:01

vinothchandar force-pushed the HUDI-7144-storage-partition-index branch from f8c89bc to e72c465 Compare January 19, 2024 02:36

vinothchandar reviewed Jan 20, 2024

View reviewed changes

vinothchandar requested changes Feb 12, 2024

View reviewed changes

...atasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala Outdated Show resolved Hide resolved

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala Show resolved Hide resolved

vinothchandar requested changes Feb 13, 2024

View reviewed changes

vinothchandar force-pushed the HUDI-7144-storage-partition-index branch from e72c465 to d0faf33 Compare February 22, 2024 00:54

vinothchandar requested changes Feb 22, 2024

View reviewed changes

codope force-pushed the HUDI-7144-storage-partition-index branch from a592ad1 to 933b215 Compare February 26, 2024 15:54

github-actions bot added the size:XL PR with lines of changes > 1000 label Feb 26, 2024

codope force-pushed the HUDI-7144-storage-partition-index branch 2 times, most recently from 3ce4c6f to 8510ace Compare March 11, 2024 06:42

codope force-pushed the HUDI-7144-storage-partition-index branch from 8510ace to 1430466 Compare March 13, 2024 13:24

apache deleted a comment from hudi-bot Mar 15, 2024

codope force-pushed the HUDI-7144-storage-partition-index branch 3 times, most recently from 53e141b to 77a9f8e Compare April 11, 2024 11:27

codope force-pushed the HUDI-7144-storage-partition-index branch from 77a9f8e to ab97c0d Compare April 16, 2024 10:37

yihua reviewed Apr 18, 2024

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java Outdated Show resolved Hide resolved

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala Outdated Show resolved Hide resolved

codope force-pushed the HUDI-7144-storage-partition-index branch from ab97c0d to 4a813e8 Compare April 25, 2024 06:32

apache deleted a comment from hudi-bot Apr 25, 2024

vinothchandar reviewed Apr 26, 2024

View reviewed changes

codope force-pushed the HUDI-7144-storage-partition-index branch from 9c26ef9 to 54d5734 Compare April 29, 2024 17:20

yihua reviewed Apr 30, 2024

View reviewed changes

codope added 5 commits April 30, 2024 19:52

Address some minor comments

90760ba

Fix CI failure

b357459

add more tests around payload merging and column range metadata

ef0db51

Make tests clear and add more validations and address few more comments

2ba44a1

codope force-pushed the HUDI-7144-storage-partition-index branch from 54d5734 to 2ba44a1 Compare April 30, 2024 14:22

Fix after rebase

f63dbe1

yihua reviewed Apr 30, 2024

View reviewed changes

...ce/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndexWithSql.scala Show resolved Hide resolved

yihua approved these changes Apr 30, 2024

View reviewed changes

vinothchandar approved these changes May 1, 2024

View reviewed changes

codope merged commit f553ba2 into apache:master May 1, 2024
40 checks passed

KnightChess reviewed Jun 5, 2024

View reviewed changes

bibhu107 mentioned this pull request Aug 25, 2024

[HUDI-7696] Refactoring functions in HoodieTableMetadataUtil #11823

Merged

4 tasks

		private static final int METADATA_TYPE_RECORD_INDEX = 5;
		private static final int METADATA_TYPE_PARTITION_STATS = 6;

[HUDI-7144] Build storage partition stats index and use it for data skipping #10352

[HUDI-7144] Build storage partition stats index and use it for data skipping #10352

Conversation

codope commented Dec 18, 2023 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar left a comment • edited Loading

Choose a reason for hiding this comment

vinothchandar commented Jan 16, 2024

codope commented Jan 18, 2024 • edited Loading

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codope Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Apr 30, 2024

CI report:

yihua left a comment

Choose a reason for hiding this comment

KnightChess left a comment • edited Loading

Choose a reason for hiding this comment

KnightChess commented Jun 5, 2024

codope commented Dec 18, 2023 •

edited

Loading

vinothchandar left a comment •

edited

Loading

codope commented Jan 18, 2024 •

edited

Loading

codope Feb 22, 2024 •

edited

Loading

KnightChess left a comment •

edited

Loading