-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data][api] implement HudiDataSource
#46273
base: master
Are you sure you want to change the base?
Conversation
7bc3894
to
97f9de1
Compare
97f9de1
to
d4e8af6
Compare
e2f6704
to
557b887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first pass and left some mostly nit comments. Overall looks good, let us know / re-request a review when it is ready for a re-review!
return read_tasks | ||
|
||
def estimate_inmemory_data_size(self) -> Optional[int]: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any estimate that can be provided / returned here? Perhaps using the size_bytes
from above? Maybe could cache that similar to what is done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree to provide estimates here. However due to current impl, loading this info during init using HudiTable is not a lightweight operation, plus the size bytes are storage size without some translation to in-memory size. i've added a todo here to support this info through HudiTable API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable!
0, | ||
) | ||
pytestmark = pytest.mark.skipif( | ||
PYARROW_LE_8_0_0, reason="hudi only supported if pyarrow >= 8.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens currently if a user tries to use pyarrow <8.0.0? Should we warn / error in the read_hudi_table
or HudiDatasource
if the user's pyarrow version is less than 8.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should raise an exception when using either read_hudi_table or HudiDatasource if the arrow version is not supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing out. i'll add a check and raise exception from within HudiTable
API since this is internal to Hudi's implementation. This validation logic will be available in the upcoming hudi python 0.2.0. We can use another PR to integrate 0.2.0 with incremental read support, sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable too. Just to make sure I understand the idea would be that in the creation of the HudiDatasource
we would then get an exception from the from hudi import HudiTable
if the version of pyarrow is unsupported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omatthew98 that's correct.
0, | ||
) | ||
pytestmark = pytest.mark.skipif( | ||
PYARROW_LE_8_0_0, reason="hudi only supported if pyarrow >= 8.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should raise an exception when using either read_hudi_table or HudiDatasource if the arrow version is not supported
557b887
to
177caab
Compare
61073db
to
cb6a8d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for responding to the questions, had one more for my understanding. After that if you add some more assertions to the test_read_hudi_table
it lgtm!
return read_tasks | ||
|
||
def estimate_inmemory_data_size(self) -> Optional[int]: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable!
0, | ||
) | ||
pytestmark = pytest.mark.skipif( | ||
PYARROW_LE_8_0_0, reason="hudi only supported if pyarrow >= 8.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable too. Just to make sure I understand the idea would be that in the creation of the HudiDatasource
we would then get an exception from the from hudi import HudiTable
if the version of pyarrow is unsupported?
cb6a8d0
to
dafcf46
Compare
fe1c93a
to
b76c200
Compare
@MicroCheck //python:ray/data/tests/test_hudi Signed-off-by: Shiyan Xu <[email protected]>
Signed-off-by: Shiyan Xu <[email protected]>
00c4a47
to
f5d4802
Compare
Why are these changes needed?
Support read from Hudi table into Ray dataset.
Related issue number
Closes #46272
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.