Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Allow for specifying the Flatcar OS version or to extend validation steps #202

Open
ewassef opened this issue Mar 7, 2024 · 2 comments

Comments

@ewassef
Copy link

ewassef commented Mar 7, 2024

Current situation

The latest version (3815.2.0) had some significant changes with upstream systems that causes outages. The auto update process will validate the update of the OS has successfully completed but doesn't allow for subsequent checks easily (or at least not through k8s). This causes outages and the only way to fix is to get into each node and manually rollback to the previous version and then pause updates.

Impact

This causes outages and the only way to fix is to get into each node and manually rollback to the previous version and then pause updates.

Ideal future situation

Allow for OS version pinning via an annotation.
OR
Allow for a CM with additional scripts that can be used to verify successful updates

Implementation options

annotation with pinned version that is passed via the DBus to the upate-agent to call flatcar-update

OR

some mechanism (maybe also over DBus) to send down a script that can update the after reboot checks and trigger a rollback if failed.
This is not the same as the https://www.flatcar.org/docs/latest/setup/releases/update-strategies/#configure-a-post-install-update-hook hook as this will keep the node in a bad state (although, this is a good final catch)

Additional information

@jepio
Copy link
Member

jepio commented Mar 8, 2024

Allow for a CM with additional scripts that can be used to verify successful updates

Have you seen the after-reboot checks? Docs here https://github.com/flatcar/flatcar-linux-update-operator/blob/030e43574c229eeb5a8858f03bdcc997f38131d9/doc/before-after-reboot-checks.md and example daemonset here: https://github.com/flatcar/flatcar-linux-update-operator/blob/030e43574c229eeb5a8858f03bdcc997f38131d9/examples/reboot-annotations/after-reboot-daemonset.yaml.

You also have the option of defining a health check on the node level as a systemd service and making it a dependency of update_engine (and kubelet) at the systemd level https://www.flatcar.org/docs/latest/setup/debug/manual-rollbacks/#automated-rollbacks. That way the node automatically performs a rollback when you reboot it from a failed update.

I'm also interested in finding out more about the issues you faced:

The latest version (3815.2.0) had some significant changes with upstream systems that causes outages.

If I recall correctly you had issues with containerd not launching correctly. Where there others?

Here's an example of what could have worked in this case (you would need to evaluate the level of dependency required, Requires= or BindsTo=). If you defined containerd as a dependency of kubelet and both kubelet and containerd as a dependency of update_engine:

  • update_engine would not mark the update as successful and the node would reboot into the previous Flatcar version on failure
  • kubelet would not start and the node would not show up as having completed the update in FLUO. FLUO would have prevented more nodes from rebooting because it uses a default max of 1 node rebooting at a time.

@invidian
Copy link
Member

Thanks for helping out @jepio. I was going to suggest a custom dependency to update_engine service too, since as far as I know this is the official mechanism for extending self-updates validation on Flatcar. @ewassef and other upvoters, could you try it out if it solves your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants