Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check if pod is already running before relaunch #1407

Conversation

majieyue
Copy link
Collaborator

What changes were proposed in this pull request?

check pod is duplicated before relaunch another one

Why are the changes needed?

some k8s plugin can recover pod from starting failure. when a pod is deleted or modified from a pending/running state, the pod with the same rank and id may be already running successfully. so we should not relaunch another same rank pod

Does this PR introduce any user-facing change?

No

How was this patch tested?

BVT UT


# Even _should_relaunch return True by state-machine check,
# we need another round of check
if should_relaunch:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these codes into the method _should_relaunch and add UT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator

@workingloong workingloong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the native k8s recover the pod from starting failure?

Copy link
Collaborator

@BalaBalaYi BalaBalaYi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more investigation

selector = k8s_util.gen_k8s_label_selector_from_dict(
self._get_pod_unique_labels(cur_node)
)
logger.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part has actually already been implemented (when generating the event before relaunch), so the issue probably isn't here.

Copy link
Collaborator Author

@majieyue majieyue Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event type is MODIFIED and node status is RUNNING from log. so the condition in _process_event is not matched.

Copy link
Collaborator

@BalaBalaYi BalaBalaYi Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified event with deletion timestamp will transfer to deleted event:

# if pod has 'deletion_timestamp', set as deleted status directly
# because the deletion has low probability of failure will affect
# node status judgement
if metadata.deletion_timestamp:
status = NodeStatus.DELETED
else:
status = pod.status.phase

@BalaBalaYi
Copy link
Collaborator

The issue is resolved by: #1408.

@BalaBalaYi BalaBalaYi closed this Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants