-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAG] Fix missing states in channel pickling #46417
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ensure_registered_as_writer
and ensure_registered_as_reader
should be idempotent, so double registration should be fine. We should make the registration idempotent in this PR.
ray/python/ray/experimental/channel/common.py
Lines 178 to 188 in 9f9041d
def ensure_registered_as_writer(self): | |
""" | |
Check whether the process is a valid writer. This method must be idempotent. | |
""" | |
raise NotImplementedError | |
def ensure_registered_as_reader(self): | |
""" | |
Check whether the process is a valid reader. This method must be idempotent. | |
""" | |
raise NotImplementedError |
Could you break down the sequence of steps in the ADAG that causes this double registration? |
Yeah the issue is not these methods are not idempotent, but a state ( |
Please see the updated description. |
Signed-off-by: Rui Qiao <[email protected]>
3449b85
to
1758783
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what's the reason why it didn't happen in a single node
Yeah Line 3715 in 9f9041d
|
By definition, an idempotent operation should have no difference no matter whether we call the function 1 time or many times. If the second call of |
The issue is that after serde the internal state of a channel changes. And it is arguable whether it is still the same object. I don't think any idempotent method can still guarantee to be idempotent when the object's internal state unexpectedly changes. This is to fix the exact state change bug between serde. If you have a different idea, can you elaborate the fix you have on mind? |
Sync with @ruisearch42 offline. The PR currently makes sense to me. The idempotency of |
Let me take a look before this is merged. |
Hmm, if a channel registers itself as the writer on some node A, and then the channel is passed to some unrelated node B, node B will not register itself as the writer. I think the right direction here would be to be the |
Thanks Jack.
That sounds like the right behavior.
hmm, not sure if this is the same or a different problem you are targeting, and not sure the solution is what we want. I will reach out to discuss offline. |
Why are these changes needed?
When an ADAG channel is pickled, it currently does not include
_writer_registered
flag. However, when the channel is deserialized and passed to another node, the channel may be double registered, causing runtime failures.Using the repro script of #46411 as an example:
ensure_registered_as_writer()
) happens when the driver callsdo_allocate_channel()
on the actor,_writer_registered
is set toTrue
_writer_registered
is False as the field is not pickleddo_exec_tasks()
(->_prep_task()
->output_writer.start()
->_output_channel.ensure_registered_as_writer()
) on the actor, the task's output channel is passed in from driver (with _writer_registered==
False`).ensure_registered_as_writer()
(if the reader is a remote node) eventually callsExperimentalRegisterMutableObjectReaderRemote()
(->HandleRegisterMutableObject()
) on the remote node, where it inserts an entry to a hash map keyed withwriter_object_id
. If there is already an entry with the same key, the check fails as shown in the following snippet:This PR fixes the issue by including these states in pickling. A new test
test_pp
is added to verify the fix.This PR also introduces
test_multi_node_dag
and moves several tests fromtest_accelerated_dag
since it got large.Related issue number
Closes #46411
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.