wait_for_resource func doesn't validate if more than one pod takes time to be initalized #68

vprashar2929 · 2024-03-31T09:51:15Z

When checking for pod status in case of Prometheus deployment the wait_for_resource func currently checks if pods are in Ready state or not and based on current state waits till max retries are exhausted. In certain situations where some Pods don't come up immediately in a given namespace or are in init state and one pod is already available this check skips validating the remaining pod status.

One of the sample run-log from kepler-operator CI:

2024-03-28T10:34:11.7436108Z 
2024-03-28T10:34:11.7437530Z   �[1m🔆🔆🔆  Waiting for pods in monitoring to be ready  🔆🔆🔆 �[0m
2024-03-28T10:34:11.7451616Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2024-03-28T10:34:11.7467141Z 10:34:11 🔔 INFO : Waiting for pods to be in Ready state
2024-03-28T10:34:11.7467532Z 
2024-03-28T10:34:13.7202430Z pod/prometheus-operator-86c875f999-zgpkm condition met
2024-03-28T10:34:13.7770338Z pod/prometheus-operator-86c875f999-zgpkm condition met
2024-03-28T10:34:13.7771429Z error: condition not met for pods/prometheus-k8s-0
2024-03-28T10:34:13.7790130Z     ❌ pods --all -n monitoring failed to be in Ready state
2024-03-28T10:34:13.7790728Z 
2024-03-28T10:34:13.7791175Z     ❌ Pods below failed to run
2024-03-28T10:34:13.7791565Z 
2024-03-28T10:34:13.8257085Z NAME               READY   STATUS     RESTARTS   AGE
2024-03-28T10:34:13.8258340Z prometheus-k8s-0   0/2     Init:0/1   0          1s
2024-03-28T10:34:13.8344266Z fail to setup local-dev-cluster

The text was updated successfully, but these errors were encountered:

SamYuan1990 · 2024-04-01T11:39:07Z

add timeout or add a retry?

SamYuan1990 · 2024-04-07T08:19:25Z

I suppose it is a bug as https://github.com/sustainable-computing-io/local-dev-cluster/blob/main/lib/utils.sh#L85-L86 @vprashar2929 , I suppose https://github.com/sustainable-computing-io/kepler-model-server/blob/main/hack/k8s_helper.sh#L37 or just use default timeout from kubectl is another option for us? would you like open a PR for fix?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wait_for_resource func doesn't validate if more than one pod takes time to be initalized #68

wait_for_resource func doesn't validate if more than one pod takes time to be initalized #68

vprashar2929 commented Mar 31, 2024

SamYuan1990 commented Apr 1, 2024

SamYuan1990 commented Apr 7, 2024

wait_for_resource func doesn't validate if more than one pod takes time to be initalized #68

wait_for_resource func doesn't validate if more than one pod takes time to be initalized #68

Comments

vprashar2929 commented Mar 31, 2024

SamYuan1990 commented Apr 1, 2024

SamYuan1990 commented Apr 7, 2024