[data][api] implement `HudiDataSource` #46273

xushiyan · 2024-06-26T03:32:48Z

Why are these changes needed?

Support read from Hudi table into Ray dataset.

Related issue number

Closes #46272

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/tests/test_hudi.py

python/requirements/ml/data-test-requirements.txt

python/ray/data/datasource/hudi_datasource.py

omatthew98

Took a first pass and left some mostly nit comments. Overall looks good, let us know / re-request a review when it is ready for a re-review!

python/ray/data/_internal/datasource/hudi_datasource.py

omatthew98 · 2024-07-16T22:33:13Z

python/ray/data/_internal/datasource/hudi_datasource.py

+        return read_tasks
+
+    def estimate_inmemory_data_size(self) -> Optional[int]:
+        return None


Is there any estimate that can be provided / returned here? Perhaps using the size_bytes from above? Maybe could cache that similar to what is done here.

agree to provide estimates here. However due to current impl, loading this info during init using HudiTable is not a lightweight operation, plus the size bytes are storage size without some translation to in-memory size. i've added a todo here to support this info through HudiTable API.

That sounds reasonable!

omatthew98 · 2024-07-16T22:40:42Z

python/ray/data/tests/test_hudi.py

+    0,
+)
+pytestmark = pytest.mark.skipif(
+    PYARROW_LE_8_0_0, reason="hudi only supported if pyarrow >= 8.0.0"


What happens currently if a user tries to use pyarrow <8.0.0? Should we warn / error in the read_hudi_table or HudiDatasource if the user's pyarrow version is less than 8.0.0?

IMO, we should raise an exception when using either read_hudi_table or HudiDatasource if the arrow version is not supported

thanks for pointing out. i'll add a check and raise exception from within HudiTable API since this is internal to Hudi's implementation. This validation logic will be available in the upcoming hudi python 0.2.0. We can use another PR to integrate 0.2.0 with incremental read support, sounds good?

I think that sounds reasonable too. Just to make sure I understand the idea would be that in the creation of the HudiDatasource we would then get an exception from the from hudi import HudiTable if the version of pyarrow is unsupported?

@omatthew98 that's correct.

python/ray/data/_internal/datasource/hudi_datasource.py

scottjlee · 2024-07-17T19:07:52Z

python/ray/data/tests/test_hudi.py

+    0,
+)
+pytestmark = pytest.mark.skipif(
+    PYARROW_LE_8_0_0, reason="hudi only supported if pyarrow >= 8.0.0"


IMO, we should raise an exception when using either read_hudi_table or HudiDatasource if the arrow version is not supported

omatthew98

Thanks for responding to the questions, had one more for my understanding. After that if you add some more assertions to the test_read_hudi_table it lgtm!

omatthew98 · 2024-07-29T19:52:51Z

python/ray/data/_internal/datasource/hudi_datasource.py

+        return read_tasks
+
+    def estimate_inmemory_data_size(self) -> Optional[int]:
+        return None


That sounds reasonable!

omatthew98 · 2024-07-29T19:55:04Z

python/ray/data/tests/test_hudi.py

+    0,
+)
+pytestmark = pytest.mark.skipif(
+    PYARROW_LE_8_0_0, reason="hudi only supported if pyarrow >= 8.0.0"


I think that sounds reasonable too. Just to make sure I understand the idea would be that in the creation of the HudiDatasource we would then get an exception from the from hudi import HudiTable if the version of pyarrow is unsupported?

aslonnie · 2024-07-30T22:49:28Z

https://buildkite.com/ray-project/premerge/builds/28156#019105cf-e52c-4145-b903-e7707d33e1ea

python/ray/data/_internal/datasource/hudi_datasource.py

python/ray/data/tests/test_hudi.py

@MicroCheck

@MicroCheck //python:ray/data/tests/test_hudi Signed-off-by: Shiyan Xu <[email protected]>

Signed-off-by: Shiyan Xu <[email protected]>

xushiyan force-pushed the add-hudi-datasource branch 5 times, most recently from 7bc3894 to 97f9de1 Compare June 28, 2024 18:11

xushiyan marked this pull request as ready for review June 28, 2024 18:20

xushiyan requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners June 28, 2024 18:20

xushiyan commented Jun 28, 2024

View reviewed changes

python/ray/data/tests/test_hudi.py Outdated Show resolved Hide resolved

xushiyan commented Jun 28, 2024

View reviewed changes

python/requirements/ml/data-test-requirements.txt Outdated Show resolved Hide resolved

xushiyan commented Jun 28, 2024

View reviewed changes

python/ray/data/datasource/hudi_datasource.py Outdated Show resolved Hide resolved

xushiyan force-pushed the add-hudi-datasource branch from 97f9de1 to d4e8af6 Compare June 28, 2024 18:58

anyscalesam added data Ray Data-related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 29, 2024

xushiyan force-pushed the add-hudi-datasource branch 3 times, most recently from e2f6704 to 557b887 Compare July 14, 2024 18:17

scottjlee assigned scottjlee and omatthew98 Jul 16, 2024

omatthew98 reviewed Jul 16, 2024

View reviewed changes

scottjlee reviewed Jul 17, 2024

View reviewed changes

xushiyan force-pushed the add-hudi-datasource branch from 557b887 to 177caab Compare July 26, 2024 21:00

xushiyan force-pushed the add-hudi-datasource branch from 61073db to cb6a8d0 Compare July 29, 2024 18:07

omatthew98 reviewed Jul 29, 2024

View reviewed changes

xushiyan force-pushed the add-hudi-datasource branch from cb6a8d0 to dafcf46 Compare July 29, 2024 23:50

aslonnie added go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels Jul 30, 2024

omatthew98 added the go add ONLY when ready to merge, run all tests label Aug 5, 2024

omatthew98 approved these changes Aug 5, 2024

View reviewed changes

scottjlee reviewed Aug 5, 2024

View reviewed changes

python/ray/data/_internal/datasource/hudi_datasource.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_hudi.py Outdated Show resolved Hide resolved

alexeykudinkin self-requested a review August 9, 2024 17:48

xushiyan force-pushed the add-hudi-datasource branch 8 times, most recently from fe1c93a to b76c200 Compare August 19, 2024 05:47

xushiyan added 2 commits October 15, 2024 07:26

[data][api] implement HudiDataSource

583dcfb

@MicroCheck //python:ray/data/tests/test_hudi Signed-off-by: Shiyan Xu <[email protected]>

update hudi version

f5d4802

Signed-off-by: Shiyan Xu <[email protected]>

xushiyan force-pushed the add-hudi-datasource branch from 00c4a47 to f5d4802 Compare October 15, 2024 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][api] implement `HudiDataSource` #46273

[data][api] implement `HudiDataSource` #46273

xushiyan commented Jun 26, 2024 •

edited

Loading

omatthew98 left a comment

omatthew98 Jul 16, 2024

xushiyan Jul 26, 2024

omatthew98 Jul 29, 2024

omatthew98 Jul 16, 2024

scottjlee Jul 17, 2024

xushiyan Jul 26, 2024 •

edited

Loading

omatthew98 Jul 29, 2024

xushiyan Jul 29, 2024

scottjlee Jul 17, 2024

omatthew98 left a comment

omatthew98 Jul 29, 2024

omatthew98 Jul 29, 2024

aslonnie commented Jul 30, 2024

[data][api] implement HudiDataSource #46273

Are you sure you want to change the base?

[data][api] implement HudiDataSource #46273

Conversation

xushiyan commented Jun 26, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xushiyan Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omatthew98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aslonnie commented Jul 30, 2024

[data][api] implement `HudiDataSource` #46273

[data][api] implement `HudiDataSource` #46273

xushiyan commented Jun 26, 2024 •

edited

Loading

xushiyan Jul 26, 2024 •

edited

Loading