Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wait_for_resource func doesn't validate if more than one pod takes time to be initalized #68

Open
vprashar2929 opened this issue Mar 31, 2024 · 2 comments

Comments

@vprashar2929
Copy link
Contributor

When checking for pod status in case of Prometheus deployment the wait_for_resource func currently checks if pods are in Ready state or not and based on current state waits till max retries are exhausted. In certain situations where some Pods don't come up immediately in a given namespace or are in init state and one pod is already available this check skips validating the remaining pod status.

One of the sample run-log from kepler-operator CI:

2024-03-28T10:34:11.7436108Z 
2024-03-28T10:34:11.7437530Z   �[1m🔆🔆🔆  Waiting for pods in monitoring to be ready  🔆🔆🔆 �[0m
2024-03-28T10:34:11.7451616Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2024-03-28T10:34:11.7467141Z 10:34:11 🔔 INFO : Waiting for pods to be in Ready state
2024-03-28T10:34:11.7467532Z 
2024-03-28T10:34:13.7202430Z pod/prometheus-operator-86c875f999-zgpkm condition met
2024-03-28T10:34:13.7770338Z pod/prometheus-operator-86c875f999-zgpkm condition met
2024-03-28T10:34:13.7771429Z error: condition not met for pods/prometheus-k8s-0
2024-03-28T10:34:13.7790130Z     ❌ pods --all -n monitoring failed to be in Ready state
2024-03-28T10:34:13.7790728Z 
2024-03-28T10:34:13.7791175Z     ❌ Pods below failed to run
2024-03-28T10:34:13.7791565Z 
2024-03-28T10:34:13.8257085Z NAME               READY   STATUS     RESTARTS   AGE
2024-03-28T10:34:13.8258340Z prometheus-k8s-0   0/2     Init:0/1   0          1s
2024-03-28T10:34:13.8344266Z fail to setup local-dev-cluster
@SamYuan1990
Copy link
Contributor

add timeout or add a retry?

@SamYuan1990
Copy link
Contributor

I suppose it is a bug as https://github.com/sustainable-computing-io/local-dev-cluster/blob/main/lib/utils.sh#L85-L86 @vprashar2929 , I suppose https://github.com/sustainable-computing-io/kepler-model-server/blob/main/hack/k8s_helper.sh#L37 or just use default timeout from kubectl is another option for us? would you like open a PR for fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants