Retry failed connections to the APIserver #1044
Labels
impact/quality
impact/reliability
Something that feels unreliable or flaky
kind/enhancement
Improvements or new features
Problem description
During network blips, a common occurrence in CI with lots of parallel jobs running, connection to the API server can become unreachable (see logs below), causing the update to fail. Usually a follow up update will have the connectivity issues resolved and produce a successful update.
In the current implementation there are no retries in the event of an unreachable API server, as errors in this space tend to be user-driven with misconfigurations, or trying to reach a deleted cluster, and do not warrant retries.
Errors & Logs
As you can see, certain k8s resources are created but the pods do not, which means the API server is reachable for some time, but then part way through we error out during what seems to be a network blip.
error: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: the server has asked for the client to provide credentials
Log snippet:
Suggestions for a fix
Add a max retry of say (3) attempts to the apiserver, before erroring.
The text was updated successfully, but these errors were encountered: