-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait and Prune options do not work as expected with Kustomize configMapGenerator #1180
Comments
This is not an option, changing the order would be a major breaking change requiring Flux v3. You could annotate the generated ConfigMaps with |
Rolling back an unhealthy rollout is a typical use case. |
You would roll forward instead of backwards, as in commit the fix to Git and Flux will move the deployment to a new revision. A safe way to rollback is provided by Flagger, which does maintain a clone of the Flux-managed deployment and all configmap/secret refs. |
If you haven't changed both Careful with setting a longer than default There are some fiddly things we should have documented better, like this one. I used to tell everyone to set timeout shorter than interval, without regard for some things can work differently when these values are not set at their defaults. |
Thanks both for the comments! Indeed, as mentioned in the opening comment, rollback is not really the issue here.
The main problem is really that, based on the use of Kustomize configmapGenerator and the order of health checks and garbage collection, the configmaps get cleaned up, leaving the environment "potentially unhealthy". About Flagger:
One more point:
About timeout/intervalWe currently have the following setting for the majority of Kustomizations (as far as I understand, this should not have any downsides when using the Kustomize controller)
We also do not have problems atm with kind:HelmRelease To summariseI still believe that the fact that garbage collection is executed before the health checks causes issues in the scenarios shown above, and maybe maybe we could fix it in Flux v3 😉, everything else is on our side to solve. Thanks again |
We use Kustomize configMapGenerator to generate configmaps, and Flux to deploy them using a Kustomization CR. Our Flux Kustomization CR has both the wait and prune parameters set to true. We use wait to ensure that our deployments are "Ready" (i.e., work nominally) before removing the old version of the application. We have experienced at times that during rollouts of a new version (of a deployment for instance), configmaps referenced in the old version are deleted before the reconciliation of the Kustomization completes.
I assume this is due to pruning happening before the health check validation in the reconciliation loop (specifically, because the wait parameter is evaluated as part of the checkHealth function).
Why is this a problem? Because if the new deployment is unhealthy, and we delete the old configmaps and old pods get restarted, they will fail with a "configmap not found" error. This can result in our cluster going unhealthy during a rollout.
Can you please confirm if my reasoning is correct, if this is expected behaviour, or if we could move the "pruning" part after the health checks?
More info below. Please let me know if I should add more details.
I have not managed to create a test for this yet, but I have run the tests with garbage collection after the health checks (wait) and this does not seem to introduce any regressions.
Kustomize configmap generator adds a suffix to all configmaps it templates. For example, this configuration results in a configmap like
test-naskvbw #random suffix
. When rolling out a new version (commit), the new configmap will have a different suffix (e.g.test-dagsreg
)Let's now assume I have this Kustomization CR, which has both wait and prune set to true.
And that the deployment references the configmap (some parts are omittest for brevity)
I would expect that the Kustomize Controller (Flux) waits for the health checks (of all reconciled resources, determined by the wait parameter, see docs here) to be green before performing the garbage collection, i.e. removing the old configmap (
test-naskvbw
in the example).The text was updated successfully, but these errors were encountered: