check if pod is already running before relaunch #1407

majieyue · 2024-12-26T08:26:37Z

What changes were proposed in this pull request?

check pod is duplicated before relaunch another one

Why are the changes needed?

some k8s plugin can recover pod from starting failure. when a pod is deleted or modified from a pending/running state, the pod with the same rank and id may be already running successfully. so we should not relaunch another same rank pod

Does this PR introduce any user-facing change?

No

How was this patch tested?

BVT UT

workingloong · 2024-12-26T10:35:10Z

dlrover/python/master/node/dist_job_manager.py

+
+            # Even _should_relaunch return True by state-machine check,
+            # we need another round of check
+            if should_relaunch:


Move these codes into the method _should_relaunch and add UT.

workingloong

Does the native k8s recover the pod from starting failure?

BalaBalaYi

need more investigation

BalaBalaYi · 2024-12-27T03:57:13Z

dlrover/python/master/node/dist_job_manager.py

+                selector = k8s_util.gen_k8s_label_selector_from_dict(
+                    self._get_pod_unique_labels(cur_node)
+                )
+                logger.info(


This part has actually already been implemented (when generating the event before relaunch), so the issue probably isn't here.

event type is MODIFIED and node status is RUNNING from log. so the condition in _process_event is not matched.

modified event with deletion timestamp will transfer to deleted event:

dlrover/dlrover/python/master/watcher/k8s_watcher.py

Lines 229 to 235 in 5d070f4

# if pod has 'deletion_timestamp', set as deleted status directly

# because the deletion has low probability of failure will affect

# node status judgement

if metadata.deletion_timestamp:

status = NodeStatus.DELETED

else:

status = pod.status.phase

BalaBalaYi · 2024-12-30T08:56:05Z

The issue is resolved by: #1408.

check if pod is already running before relaunch

d549013

majieyue requested review from workingloong, samplise and BalaBalaYi as code owners December 26, 2024 08:26

workingloong reviewed Dec 26, 2024

View reviewed changes

BalaBalaYi requested changes Dec 27, 2024

View reviewed changes

BalaBalaYi closed this Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check if pod is already running before relaunch #1407

check if pod is already running before relaunch #1407

majieyue commented Dec 26, 2024

workingloong Dec 26, 2024

majieyue Dec 27, 2024

workingloong left a comment •

edited

Loading

BalaBalaYi left a comment

BalaBalaYi Dec 27, 2024

majieyue Dec 27, 2024 •

edited

Loading

BalaBalaYi Dec 27, 2024 •

edited

Loading

BalaBalaYi commented Dec 30, 2024

	# if pod has 'deletion_timestamp', set as deleted status directly
	# because the deletion has low probability of failure will affect
	# node status judgement
	if metadata.deletion_timestamp:
	status = NodeStatus.DELETED
	else:
	status = pod.status.phase

check if pod is already running before relaunch #1407

check if pod is already running before relaunch #1407

Conversation

majieyue commented Dec 26, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

workingloong Dec 26, 2024

Choose a reason for hiding this comment

majieyue Dec 27, 2024

Choose a reason for hiding this comment

workingloong left a comment • edited Loading

Choose a reason for hiding this comment

BalaBalaYi left a comment

Choose a reason for hiding this comment

BalaBalaYi Dec 27, 2024

Choose a reason for hiding this comment

majieyue Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

BalaBalaYi Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

BalaBalaYi commented Dec 30, 2024

workingloong left a comment •

edited

Loading

majieyue Dec 27, 2024 •

edited

Loading

BalaBalaYi Dec 27, 2024 •

edited

Loading