[RFE] Allow for specifying the Flatcar OS version or to extend validation steps #202

ewassef · 2024-03-07T20:07:15Z

Current situation

The latest version (3815.2.0) had some significant changes with upstream systems that causes outages. The auto update process will validate the update of the OS has successfully completed but doesn't allow for subsequent checks easily (or at least not through k8s). This causes outages and the only way to fix is to get into each node and manually rollback to the previous version and then pause updates.

Impact

This causes outages and the only way to fix is to get into each node and manually rollback to the previous version and then pause updates.

Ideal future situation

Allow for OS version pinning via an annotation.
OR
Allow for a CM with additional scripts that can be used to verify successful updates

Implementation options

annotation with pinned version that is passed via the DBus to the upate-agent to call flatcar-update

OR

some mechanism (maybe also over DBus) to send down a script that can update the after reboot checks and trigger a rollback if failed.
This is not the same as the https://www.flatcar.org/docs/latest/setup/releases/update-strategies/#configure-a-post-install-update-hook hook as this will keep the node in a bad state (although, this is a good final catch)

Additional information

jepio · 2024-03-08T14:24:39Z

Allow for a CM with additional scripts that can be used to verify successful updates

Have you seen the after-reboot checks? Docs here https://github.com/flatcar/flatcar-linux-update-operator/blob/030e43574c229eeb5a8858f03bdcc997f38131d9/doc/before-after-reboot-checks.md and example daemonset here: https://github.com/flatcar/flatcar-linux-update-operator/blob/030e43574c229eeb5a8858f03bdcc997f38131d9/examples/reboot-annotations/after-reboot-daemonset.yaml.

You also have the option of defining a health check on the node level as a systemd service and making it a dependency of update_engine (and kubelet) at the systemd level https://www.flatcar.org/docs/latest/setup/debug/manual-rollbacks/#automated-rollbacks. That way the node automatically performs a rollback when you reboot it from a failed update.

I'm also interested in finding out more about the issues you faced:

The latest version (3815.2.0) had some significant changes with upstream systems that causes outages.

If I recall correctly you had issues with containerd not launching correctly. Where there others?

Here's an example of what could have worked in this case (you would need to evaluate the level of dependency required, Requires= or BindsTo=). If you defined containerd as a dependency of kubelet and both kubelet and containerd as a dependency of update_engine:

update_engine would not mark the update as successful and the node would reboot into the previous Flatcar version on failure
kubelet would not start and the node would not show up as having completed the update in FLUO. FLUO would have prevented more nodes from rebooting because it uses a default max of 1 node rebooting at a time.

invidian · 2024-03-24T16:04:28Z

Thanks for helping out @jepio. I was going to suggest a custom dependency to update_engine service too, since as far as I know this is the official mechanism for extending self-updates validation on Flatcar. @ewassef and other upvoters, could you try it out if it solves your issue?

github-actions bot mentioned this issue Mar 22, 2024

Monthly contributions report 2024-02-22 - 2024-03-21 flatcar/Flatcar#1398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] Allow for specifying the Flatcar OS version or to extend validation steps #202

[RFE] Allow for specifying the Flatcar OS version or to extend validation steps #202

ewassef commented Mar 7, 2024

jepio commented Mar 8, 2024 •

edited

Loading

invidian commented Mar 24, 2024

[RFE] Allow for specifying the Flatcar OS version or to extend validation steps #202

[RFE] Allow for specifying the Flatcar OS version or to extend validation steps #202

Comments

ewassef commented Mar 7, 2024

Current situation

Impact

Ideal future situation

Implementation options

Additional information

jepio commented Mar 8, 2024 • edited Loading

invidian commented Mar 24, 2024

jepio commented Mar 8, 2024 •

edited

Loading