feat: add experimental remote HDFS support for native DataFusion reader #1359

comphead · 2025-01-31T01:00:08Z

Which issue does this PR close?

Closes #1337.
Depends on #1368

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Manually starting a remote hdfs cluster and running

  test("get_struct_field with DataFusion ParquetExec - simple case - remote HDFS") {

    Seq("parquet").foreach { v1List =>
      withSQLConf(
        SQLConf.USE_V1_SOURCE_LIST.key -> v1List,
        CometConf.COMET_ENABLED.key -> "true",
        CometConf.COMET_EXPLAIN_FALLBACK_ENABLED.key -> "false",
        CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
        "spark.hadoop.fs.defaultFS" -> "hdfs://namenode:9000",
        "spark.hadoop.dfs.client.use.datanode.hostname" -> "true",
        "dfs.client.use.datanode.hostname" -> "true") {

        val df = spark.read
          .parquet("hdfs://namenode:9000/user/test4")
          .select("id", "first_name", "personal_info")
        df.printSchema()
        df.explain("formatted")
        df.show(false)
        // checkSparkAnswerAndOperator(df.select("nested1.id"))
      }
    }
  }

comphead · 2025-01-31T01:01:10Z

native/core/Cargo.toml

@@ -77,6 +77,7 @@ datafusion-comet-proto = { workspace = true }
 object_store = { workspace = true }
 url = { workspace = true }
 chrono = { workspace = true }
+datafusion-objectstore-hdfs = { git = "https://github.com/comphead/datafusion-objectstore-hdfs", branch = "master", optional = true }


@andygrove I'm keeping the updated HDFS object storage in personal repo for now, let me know if there any concerns

Is there an expected timeline for when we can move to an official release? Meantime, since we have pointed to a personal repo in the past, it is reasonable to do so for this as well (especially since this is already behind some configuration flags).

Will be addressed in #1368

comphead · 2025-01-31T01:02:06Z

native/core/src/execution/planner.rs

@@ -1220,7 +1217,7 @@ impl PhysicalPlanner {
                // TODO: I think we can remove partition_count in the future, but leave for testing.
                assert_eq!(file_groups.len(), partition_count);

-                let object_store_url = ObjectStoreUrl::local_filesystem();
+                let object_store_url = ObjectStoreUrl::parse("hdfs://namenode:9000").unwrap();


this will be addressed in #1360

The url should be available as part of the file path passed in. (see line 1178 above)

Thanks @parthchandra it is already fixed.

comphead · 2025-01-31T01:02:56Z

native/core/src/parquet/parquet_support.rs

+    session_context: Arc<SessionContext>,
+) -> Result<(), ExecutionError> {
+    // TODO: read the namenode configuration from file schema or from spark.defaultFS
+    let url = Url::try_from("hdfs://namenode:9000").unwrap();


this will be addressed in #1360

comphead · 2025-01-31T01:04:15Z

native/core/src/parquet/parquet_support.rs

@@ -1861,6 +1864,40 @@ fn trim_end(s: &str) -> &str {
    }
 }

+#[cfg(not(feature = "hdfs"))]


hdfs cargo feature makes a conditional compilation if hdfs needed

parthchandra · 2025-01-31T01:32:36Z

native/core/src/parquet/parquet_support.rs

+pub(crate) fn register_object_store(
+    session_context: Arc<SessionContext>,
+) -> Result<(), ExecutionError> {
+    let object_store = object_store::local::LocalFileSystem::new();


It doesn't have to be only a local file system.

It depends on the feature enabled for the Comet. LocalFileSystem is by default if no specific features selected.
the annotation on this method is

#[cfg(not(feature = "hdfs"))]

This allows to plugin other features like S3, etc

This particular method is responsible for no remote feature selected e.g. for local filesystem.
If a feature selected the conditional compilation will register an object store related to the feature, like HDFS or S3

parthchandra · 2025-01-31T01:32:49Z

native/core/src/execution/planner.rs

@@ -1220,7 +1217,7 @@ impl PhysicalPlanner {
                // TODO: I think we can remove partition_count in the future, but leave for testing.
                assert_eq!(file_groups.len(), partition_count);

-                let object_store_url = ObjectStoreUrl::local_filesystem();
+                let object_store_url = ObjectStoreUrl::parse("hdfs://namenode:9000").unwrap();


The url should be available as part of the file path passed in. (see line 1178 above)

codecov-commenter · 2025-01-31T02:46:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 39.16%. Comparing base (f09f8af) to head (a2130e8).
Report is 19 commits behind head on main.

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1359       +/-   ##
=============================================
- Coverage     56.12%   39.16%   -16.96%     
- Complexity      976     2065     +1089     
=============================================
  Files           119      262      +143     
  Lines         11743    60323    +48580     
  Branches       2251    12836    +10585     
=============================================
+ Hits           6591    23627    +17036     
- Misses         4012    32223    +28211     
- Partials       1140     4473     +3333

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

comphead · 2025-02-02T18:41:36Z

@andygrove @parthchandra @mbutrovich @kazuyukitanimura can I have a review please?

parthchandra

Is there a way to 'mock' hdfs and write a unit test? I suppose using the hdfs support to read a local file should do just as well.

parthchandra · 2025-02-03T17:13:51Z

Makefile

@@ -95,7 +98,7 @@ release-linux: clean
 	cd native && RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit" cargo build --release
 	./mvnw install -Prelease -DskipTests $(PROFILES)
 release:
-	cd native && RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
+	cd native && RUSTFLAGS="$(RUSTFLAGS) -Ctarget-cpu=native" && RUSTFLAGS=$$RUSTFLAGS cargo build --release $(FEATURES_ARG)


$$RUSTFLAGS ? Never seen this pattern before. What does this do?

Makefile syntax lightly different from bash from what I learned.
so in order to access environment variable created on fly it is needed to access it with $$.

It gets created on fly to concatenate release specific RUSTFLAGS with ones that user can set

Thanks. Learnt something new today :)

parthchandra · 2025-02-03T17:22:19Z

native/core/src/execution/planner.rs

-                    .register_object_store(&url, Arc::new(object_store));
+                // By default, local FS object store registered
+                // if `hdfs` feature enabled then HDFS file object store registered
+                let object_store_url = register_object_store(Arc::clone(&self.session_ctx))?;


Should we update this function (get_file_path) as well?
It's currently used by NATIVE_ICEBERG_COMPAT but the goal is to unify it with COMET_DATAFUSION.

Thats a good point, to verify it we probably need to read Iceberg from HDFS which can be done in #1367

We don't need to wait for actual iceberg integration. CometScan will use COMPAT_ICEBERG if the configuration is set (That's how we are able to run the unit tests).

comphead · 2025-02-03T18:08:42Z

Thanks @parthchandra the integration test will be addressed as part of #1367. We also need to think should it be a separate flow in CI?

Makefile

wForget · 2025-02-06T02:15:54Z

native/core/src/parquet/parquet_support.rs

+pub(crate) fn register_object_store(
+    session_context: Arc<SessionContext>,
+) -> Result<ObjectStoreUrl, ExecutionError> {
+    // TODO: read the namenode configuration from file schema or from spark.defaultFS


Do we need to register object store from native_scan.file_partitions?

Thanks @wForget I'm not sure I'm getting it, do you mean the better place to register the object store will be inside file_partitions iterator loop ?

do you mean the better place to register the object store will be inside file_partitions iterator loop ?

Yes, is it possible that native scan paths correspond to multiple object stores or are different from spark.defaultFs?

for HDFS/S3 the default fs can be taken from spark.hadoop.fs.defaultFS parameter.
To support multiple object stores that is interesting idea however I'm not sure when it can be addressed

for HDFS/S3 the default fs can be taken from spark.hadoop.fs.defaultFS parameter.

Sometimes I also access other hdfs ns like:

select * from `parquet`.`hdfs://other-ns:8020/warehouse/db/table`

that is interesting scenario, I'll add a separate test case for this

feat: add experimental remote HDFS support for native DataFusion reader

fc11163

comphead commented Jan 31, 2025

View reviewed changes

fmt

ce91967

parthchandra reviewed Jan 31, 2025

View reviewed changes

comphead added 2 commits January 30, 2025 17:48

fmt

26e7769

fmt

b6c3d2b

parthchandra reviewed Feb 3, 2025

View reviewed changes

kazuyukitanimura reviewed Feb 4, 2025

View reviewed changes

Makefile Outdated Show resolved Hide resolved

fmt

a2130e8

comphead marked this pull request as draft February 5, 2025 00:24

wForget reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add experimental remote HDFS support for native DataFusion reader #1359

feat: add experimental remote HDFS support for native DataFusion reader #1359

comphead commented Jan 31, 2025 •

edited

Loading

comphead Jan 31, 2025

parthchandra Feb 3, 2025

comphead Feb 5, 2025

comphead Jan 31, 2025

parthchandra Jan 31, 2025

comphead Jan 31, 2025

comphead Jan 31, 2025

comphead Jan 31, 2025

parthchandra Jan 31, 2025

comphead Jan 31, 2025 •

edited

Loading

parthchandra Jan 31, 2025

codecov-commenter commented Jan 31, 2025 •

edited

Loading

comphead commented Feb 2, 2025

parthchandra left a comment

parthchandra Feb 3, 2025

comphead Feb 3, 2025 •

edited

Loading

parthchandra Feb 3, 2025

parthchandra Feb 3, 2025

comphead Feb 3, 2025

parthchandra Feb 3, 2025

comphead commented Feb 3, 2025

wForget Feb 6, 2025

comphead Feb 6, 2025

wForget Feb 6, 2025

comphead Feb 6, 2025

wForget Feb 7, 2025

comphead Feb 7, 2025

feat: add experimental remote HDFS support for native DataFusion reader #1359

Are you sure you want to change the base?

feat: add experimental remote HDFS support for native DataFusion reader #1359

Conversation

comphead commented Jan 31, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 31, 2025 • edited Loading

Codecov Report

comphead commented Feb 2, 2025

parthchandra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Feb 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Jan 31, 2025 •

edited

Loading

comphead Jan 31, 2025 •

edited

Loading

codecov-commenter commented Jan 31, 2025 •

edited

Loading

comphead Feb 3, 2025 •

edited

Loading