-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] REP 54: Add PodName to the HeadInfo #2266
Conversation
Signed-off-by: Rueian <[email protected]>
@@ -1257,7 +1257,7 @@ func (r *RayClusterReconciler) getHeadServiceIPAndName(ctx context.Context, inst | |||
return "", "", fmt.Errorf("head service IP is empty. cluster name %s, filter labels %v", instance.Name, common.RayClusterHeadServiceListOptions(instance)) | |||
} else if runtimeServices.Items[0].Spec.ClusterIP == corev1.ClusterIPNone { | |||
// We return Head Pod IP if the Head service is headless. | |||
ip, err := r.getHeadPodIP(ctx, instance) | |||
ip, _, err := r.getHeadPodIPAndName(ctx, instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This getHeadServiceIPAndName
function returns the service name so we omit the returned Head name here.
if err != nil { | ||
return err | ||
} | ||
instance.Status.Head.PodIP = ip | ||
instance.Status.Head.PodName = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we save the Head name into the HeadInfo status.
} | ||
|
||
assert.Equal(t, tc.expectedIP, ip, "getHeadPodIP returned unexpected IP") | ||
assert.Equal(t, tc.expectedIP, ip, "getHeadPodIPAndName returned unexpected IP") | ||
assert.Equal(t, tc.expectedName, name, "getHeadPodIPAndName returned unexpected name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure that the Head name is saved.
return nil, fmt.Errorf("Found %d head pods for RayCluster %s in the namespace %s", len(podList.Items), instance.Name, instance.Namespace) | ||
if instance.Status.Head.PodName != "" { | ||
pod := &corev1.Pod{} | ||
if err := r.Get(ctx, types.NamespacedName{Namespace: instance.Namespace, Name: instance.Status.Head.PodName}, pod); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the design doc, let RayService use the new .Status.Head.PodName
to get the head.
Signed-off-by: Rueian <[email protected]>
f19023e
to
6f40873
Compare
Signed-off-by: Rueian <[email protected]>
Signed-off-by: Rueian <[email protected]>
} | ||
if len(runtimePods.Items) != 1 { | ||
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels)) | ||
return "", nil | ||
return "", "", nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit weird to return nil
when there is more than one head Pod. Can you check whether we need to return a non-nil error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior was intentional to keep the status update successful. But since we delete extra heads in the reconcilePods
step, I think we are okay with returning an error here.
@@ -1229,19 +1229,19 @@ func (r *RayClusterReconciler) calculateStatus(ctx context.Context, instance *ra | |||
} | |||
|
|||
// Best effort to obtain the ip of the head node. | |||
func (r *RayClusterReconciler) getHeadPodIP(ctx context.Context, instance *rayv1.RayCluster) (string, error) { | |||
func (r *RayClusterReconciler) getHeadPodIPAndName(ctx context.Context, instance *rayv1.RayCluster) (string, string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use getHeadPod
which returns the head Pod and an err instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevin85421, do you mean remove this function? We use this function to get the head IP and name and store them in the HeadInfo
. It is actually this function that powers the getHeadPod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: we moved the getHeadPodIPAndName
into common.GetRayClusterHeadPod
.
a90e3b3
to
cc3f369
Compare
@@ -1220,17 +1220,3 @@ func (r *RayServiceReconciler) isHeadPodRunningAndReady(ctx context.Context, ins | |||
func isServeAppUnhealthyOrDeployedFailed(appStatus string) bool { | |||
return appStatus == rayv1.ApplicationStatusEnum.UNHEALTHY || appStatus == rayv1.ApplicationStatusEnum.DEPLOY_FAILED | |||
} | |||
|
|||
// TODO: Move this function to util.go and always use this function to retrieve the head Pod. | |||
func (r *RayServiceReconciler) getHeadPod(ctx context.Context, instance *rayv1.RayCluster) (*corev1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to common.GetRayClusterHeadPod
.
Signed-off-by: Rueian <[email protected]>
cc3f369
to
21d466d
Compare
@@ -159,3 +163,14 @@ func RayJobRayClusterNamespacedName(rayJob *rayv1.RayJob) types.NamespacedName { | |||
Namespace: rayJob.Namespace, | |||
} | |||
} | |||
|
|||
func GetRayClusterHeadPod(ctx context.Context, reader client.Reader, instance *rayv1.RayCluster) (*corev1.Pod, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use common.RayClusterHeadPodsAssociationOptions
instead.
@@ -1309,13 +1313,14 @@ func (r *RayClusterReconciler) updateEndpoints(ctx context.Context, instance *ra | |||
} | |||
|
|||
func (r *RayClusterReconciler) updateHeadInfo(ctx context.Context, instance *rayv1.RayCluster) error { | |||
ip, err := r.getHeadPodIP(ctx, instance) | |||
ip, name, err := r.getHeadPodIPAndName(ctx, instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
headPod, err := common.GetRayClusterHeadPod(...)
...
instance.Status.Head.PodIP = headPod.Status.PodIP
…dPodsAssociationOptions Signed-off-by: Rueian <[email protected]>
34d462f
to
9ff0114
Compare
return nil, err | ||
} | ||
if len(runtimePods.Items) == 0 { | ||
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using fmt.Sprintf
. We can use:
logger.Info("Found 0 head Pod", "filter labels", filterLabels)
instead. I removed instance.Name
because I think all KubeRay logs currently include the information about the CR name.
return nil, nil | ||
} | ||
if len(runtimePods.Items) > 1 { | ||
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
- remove
fmt.Sprintf
- remove
instance.Name
} | ||
if len(runtimePods.Items) > 1 { | ||
logger.Info(fmt.Sprintf("Found %d head pods. cluster name %s, filter labels %v", len(runtimePods.Items), instance.Name, filterLabels)) | ||
return nil, fmt.Errorf("found multiple heads. cluster name %s, filter labels %v", instance.Name, filterLabels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove "cluster name"
@@ -1257,11 +1241,14 @@ func (r *RayClusterReconciler) getHeadServiceIPAndName(ctx context.Context, inst | |||
return "", "", fmt.Errorf("head service IP is empty. cluster name %s, filter labels %v", instance.Name, common.RayClusterHeadServiceListOptions(instance)) | |||
} else if runtimeServices.Items[0].Spec.ClusterIP == corev1.ClusterIPNone { | |||
// We return Head Pod IP if the Head service is headless. | |||
ip, err := r.getHeadPodIP(ctx, instance) | |||
head, err := common.GetRayClusterHeadPod(ctx, r, instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use headPod
instead to keep the naming consistent with the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching all these.
…dPodsAssociationOptions Signed-off-by: Rueian <[email protected]>
Why are these changes needed?
Closes #2236
This PR adds the new
PodName
to theHeadInfo
status. And make RayService use the field to fetch the Head Pod.Checks