Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] REP 54: Add PodName to the HeadInfo #2266

Merged
merged 7 commits into from
Jul 25, 2024

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jul 23, 2024

Why are these changes needed?

Closes #2236

This PR adds the new PodName to the HeadInfo status. And make RayService use the field to fetch the Head Pod.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests

@rueian rueian changed the title Head info pod name [Feature] REP 54: Add PodName to the HeadInfo Jul 23, 2024
@@ -1257,7 +1257,7 @@ func (r *RayClusterReconciler) getHeadServiceIPAndName(ctx context.Context, inst
return "", "", fmt.Errorf("head service IP is empty. cluster name %s, filter labels %v", instance.Name, common.RayClusterHeadServiceListOptions(instance))
} else if runtimeServices.Items[0].Spec.ClusterIP == corev1.ClusterIPNone {
// We return Head Pod IP if the Head service is headless.
ip, err := r.getHeadPodIP(ctx, instance)
ip, _, err := r.getHeadPodIPAndName(ctx, instance)
Copy link
Contributor Author

@rueian rueian Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This getHeadServiceIPAndName function returns the service name so we omit the returned Head name here.

if err != nil {
return err
}
instance.Status.Head.PodIP = ip
instance.Status.Head.PodName = name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we save the Head name into the HeadInfo status.

}

assert.Equal(t, tc.expectedIP, ip, "getHeadPodIP returned unexpected IP")
assert.Equal(t, tc.expectedIP, ip, "getHeadPodIPAndName returned unexpected IP")
assert.Equal(t, tc.expectedName, name, "getHeadPodIPAndName returned unexpected name")
Copy link
Contributor Author

@rueian rueian Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that the Head name is saved.

return nil, fmt.Errorf("Found %d head pods for RayCluster %s in the namespace %s", len(podList.Items), instance.Name, instance.Namespace)
if instance.Status.Head.PodName != "" {
pod := &corev1.Pod{}
if err := r.Get(ctx, types.NamespacedName{Namespace: instance.Namespace, Name: instance.Status.Head.PodName}, pod); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the design doc, let RayService use the new .Status.Head.PodName to get the head.

@rueian rueian force-pushed the head-info-pod-name branch from f19023e to 6f40873 Compare July 23, 2024 13:38
@rueian rueian marked this pull request as ready for review July 23, 2024 14:07
@kevin85421 kevin85421 self-assigned this Jul 23, 2024
@kevin85421 kevin85421 self-requested a review July 23, 2024 18:57
ray-operator/controllers/ray/raycluster_controller.go Outdated Show resolved Hide resolved
}
if len(runtimePods.Items) != 1 {
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels))
return "", nil
return "", "", nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit weird to return nil when there is more than one head Pod. Can you check whether we need to return a non-nil error instead?

Copy link
Contributor Author

@rueian rueian Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior was intentional to keep the status update successful. But since we delete extra heads in the reconcilePods step, I think we are okay with returning an error here.

@@ -1229,19 +1229,19 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra
}

// Best effort to obtain the ip of the head node.
func (r *RayClusterReconciler) getHeadPodIP(ctx context.Context, instance *rayv1.RayCluster) (string, error) {
func (r *RayClusterReconciler) getHeadPodIPAndName(ctx context.Context, instance *rayv1.RayCluster) (string, string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use getHeadPod which returns the head Pod and an err instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevin85421, do you mean remove this function? We use this function to get the head IP and name and store them in the HeadInfo. It is actually this function that powers the getHeadPod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: we moved the getHeadPodIPAndName into common.GetRayClusterHeadPod.

ray-operator/controllers/ray/rayservice_controller.go Outdated Show resolved Hide resolved
@@ -1220,17 +1220,3 @@ func (r *RayServiceReconciler) isHeadPodRunningAndReady(ctx context.Context, ins
func isServeAppUnhealthyOrDeployedFailed(appStatus string) bool {
return appStatus == rayv1.ApplicationStatusEnum.UNHEALTHY || appStatus == rayv1.ApplicationStatusEnum.DEPLOY_FAILED
}

// TODO: Move this function to util.go and always use this function to retrieve the head Pod.
func (r *RayServiceReconciler) getHeadPod(ctx context.Context, instance *rayv1.RayCluster) (*corev1.Pod, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to common.GetRayClusterHeadPod.

@rueian rueian force-pushed the head-info-pod-name branch from cc3f369 to 21d466d Compare July 24, 2024 10:06
@rueian rueian requested a review from kevin85421 July 24, 2024 10:58
@@ -159,3 +163,14 @@ func RayJobRayClusterNamespacedName(rayJob *rayv1.RayJob) types.NamespacedName {
Namespace: rayJob.Namespace,
}
}

func GetRayClusterHeadPod(ctx context.Context, reader client.Reader, instance *rayv1.RayCluster) (*corev1.Pod, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use common.RayClusterHeadPodsAssociationOptions instead.

@@ -1309,13 +1313,14 @@ func (r *RayClusterReconciler) updateEndpoints(ctx context.Context, instance *ra
}

func (r *RayClusterReconciler) updateHeadInfo(ctx context.Context, instance *rayv1.RayCluster) error {
ip, err := r.getHeadPodIP(ctx, instance)
ip, name, err := r.getHeadPodIPAndName(ctx, instance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

headPod, err := common.GetRayClusterHeadPod(...)
...
instance.Status.Head.PodIP = headPod.Status.PodIP

@rueian rueian force-pushed the head-info-pod-name branch from 34d462f to 9ff0114 Compare July 24, 2024 17:19
ray-operator/controllers/ray/common/association.go Outdated Show resolved Hide resolved
return nil, err
}
if len(runtimePods.Items) == 0 {
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using fmt.Sprintf. We can use:

logger.Info("Found 0 head Pod", "filter labels", filterLabels)

instead. I removed instance.Name because I think all KubeRay logs currently include the information about the CR name.

return nil, nil
}
if len(runtimePods.Items) > 1 {
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

  1. remove fmt.Sprintf
  2. remove instance.Name

}
if len(runtimePods.Items) > 1 {
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels))
return nil, fmt.Errorf("found multiple heads. cluster name %s, filter labels %v", instance.Name, filterLabels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "cluster name"

@@ -1257,11 +1241,14 @@ func (r *RayClusterReconciler) getHeadServiceIPAndName(ctx context.Context, inst
return "", "", fmt.Errorf("head service IP is empty. cluster name %s, filter labels %v", instance.Name, common.RayClusterHeadServiceListOptions(instance))
} else if runtimeServices.Items[0].Spec.ClusterIP == corev1.ClusterIPNone {
// We return Head Pod IP if the Head service is headless.
ip, err := r.getHeadPodIP(ctx, instance)
head, err := common.GetRayClusterHeadPod(ctx, r, instance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use headPod instead to keep the naming consistent with the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching all these.

@rueian rueian requested a review from kevin85421 July 25, 2024 00:56
@kevin85421 kevin85421 merged commit 78b9828 into ray-project:master Jul 25, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] REP 54: Add PodName to the HeadInfo
3 participants