-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check if pod is already running before relaunch #1407
check if pod is already running before relaunch #1407
Conversation
|
||
# Even _should_relaunch return True by state-machine check, | ||
# we need another round of check | ||
if should_relaunch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these codes into the method _should_relaunch
and add UT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the native k8s recover the pod from starting failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need more investigation
selector = k8s_util.gen_k8s_label_selector_from_dict( | ||
self._get_pod_unique_labels(cur_node) | ||
) | ||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part has actually already been implemented (when generating the event before relaunch), so the issue probably isn't here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event type is MODIFIED and node status is RUNNING from log. so the condition in _process_event is not matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified event with deletion timestamp will transfer to deleted event:
dlrover/dlrover/python/master/watcher/k8s_watcher.py
Lines 229 to 235 in 5d070f4
# if pod has 'deletion_timestamp', set as deleted status directly | |
# because the deletion has low probability of failure will affect | |
# node status judgement | |
if metadata.deletion_timestamp: | |
status = NodeStatus.DELETED | |
else: | |
status = pod.status.phase |
The issue is resolved by: #1408. |
What changes were proposed in this pull request?
check pod is duplicated before relaunch another one
Why are the changes needed?
some k8s plugin can recover pod from starting failure. when a pod is deleted or modified from a pending/running state, the pod with the same rank and id may be already running successfully. so we should not relaunch another same rank pod
Does this PR introduce any user-facing change?
No
How was this patch tested?
BVT UT