Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change in KurReD Behaviour between 1.15 and 1.16 #1066

Closed
abinet opened this issue Jan 23, 2025 · 9 comments
Closed

Change in KurReD Behaviour between 1.15 and 1.16 #1066

abinet opened this issue Jan 23, 2025 · 9 comments

Comments

@abinet
Copy link

abinet commented Jan 23, 2025

Hello,

firstly I would like to thank all the tool maintainers for doing a great job. We are using KuReD to reboot our RHEL nodes and it worked well most of the time.

Lastly we have tried to upgrade KuReD from 1.15.1 to 1.16.2 and noticed one important behavior change. Old version of KuReD run reboot-checks all the time, regardless of scheduled reboot time. This was very convenient for production clusters to get notified by KuReD about required reboot but not reboot servers outside of maintenance window. Since Version 1.16 the behavior have changed. Now KuReD only verifies if reboot is needed within defined reboot time interval.

Was this made intentionally? I have failed to find an corresponding issue, so I am missing the discussion about this :-(
It is not a big deal, however IMHO an important behavioral change and must be highlighted somehow.

@evrardjp
Copy link
Collaborator

Hello!

Thanks for the kind words.

  1. FYI, the metrics are still gathered outside the main loop, but we got messages/issues about the discrepency between the metrics and the behaviour of the main loop. That might help you for a while, until it gets removed.
  2. We got conflicting requests about whether kured should not do ANYTHING AT ALL outside the maintenance window and behaviour like "checking". As "checking" behaviour coud be more than check, we cannot judge. It's all about whether we decide that kured should check whether a reboot needs to happen during the maintenance window, or maintain the state. We are currently in the middle, and that bothers me a bit to be honest. That's why I have a second part of refactors making the experience "consistent" albeit different than what you might expect.
  3. I am sorry if a part of the change of experience was shipped inside the refactor part, as it should not have been that way. All behaviour changes should be documented, as you rightfully mention.

@evrardjp
Copy link
Collaborator

evrardjp commented Jan 26, 2025

I will, however, dig deeper into what happened... should my memory allow :)

@evrardjp
Copy link
Collaborator

evrardjp commented Jan 26, 2025

Just to be sure we are talking about the same thing. Did you mean:
In 1.15.1 and before, some part of the code "notified" you (to be clarified what: logs, real notification, metric -- be sure to clarify in your answer) that the node needed reboot but was not acted upon (and not the fact it was cleaning whether no reboot was required); in 1.16.2 and above, that part of the code has changed: you want kured to still maintain whether a reboot is actually required, regardless of this happens during maintenance or not.

Also, did you try in 1.16.0/1.16.1 for making the bisecting easier?

@abinet
Copy link
Author

abinet commented Jan 26, 2025

Hello,
thank you for the response.

With following configuration:

      - args:
        - --ds-name=kured
        - --ds-namespace=kured
        - --metrics-port=8080
        - --end-time=16:59
        - --period=0h1m0s
        - --reboot-days=su
        - --reboot-days=mo
        - --reboot-days=tu
        - --reboot-days=we
        - --reboot-days=th
        - --reboot-days=fr
        - --reboot-days=sa
        - --reboot-sentinel-command=sh -c "! /usr/bin/needs-restarting --reboothint"
        - --reboot-command=/bin/systemctl reboot
        - --reboot-delay=15m
        - --start-time=08:00
        - --log-format=text
        - --concurrency=1

KuReD 1.16.2 writes in 1 min cycle in info-log:

time="2025-01-26T09:23:52Z" level=info msg="Reboot not required"
time="2025-01-26T09:24:51Z" level=info msg="Reboot not required"
time="2025-01-26T09:25:51Z" level=info msg="Reboot not required"

and stops logging anything outside the configured reboot time.

KuReD 1.15.1 and 1.16.1! writes more logs and do this all the time regardless of the configured reboot time:

time="2025-01-26T09:40:04Z" level=info msg="Updating Subscription Management repositories." cmd=/usr/bin/nsenter std=out
time="2025-01-26T09:40:05Z" level=info msg="No core libraries or services have been updated since boot-up." cmd=/usr/bin/nsenter std=out
time="2025-01-26T09:40:05Z" level=info msg="Reboot should not be necessary." cmd=/usr/bin/nsenter std=out
time="2025-01-26T09:40:05Z" level=info msg="Reboot not required"

So I assume the change is in 1.16.2 version.

@evrardjp
Copy link
Collaborator

I will dig into this.

@evrardjp
Copy link
Collaborator

evrardjp commented Feb 1, 2025

I tried to reproduce this in 1.16.1 and 1.16.2, and in both cases I don't have extra logs. See also: https://gist.github.com/evrardjp/7b503a8a5079ab200de11bd5e79cb466

I can dig deeper if you wish.

I would be happy to have users opinion here:
Do you really want kured to do any operation outside maintenance window? What would be the allowed operations outside maintenance window: Checking whether a reboot is required or other tasks too? This will help my future changes.

@evrardjp
Copy link
Collaborator

evrardjp commented Feb 3, 2025

Closing the issue because I don't think there is anything actionnable in the current state, but feel free to comment, and we'll improve in our next releases. Especially if you want to explain the behaviour you're expecting in more details (your use case), that would be perfect!

@evrardjp evrardjp closed this as completed Feb 3, 2025
@abinet
Copy link
Author

abinet commented Feb 3, 2025

So, I've investigated the behavior of both 1.6.1 and 1.6.2 version with following results.

Both version have two cycles: one for metrics another for reboots. Metrics cycle has an 1 minute interval and runs always, and reboot cycle interval and run times are configurable. Additional logs I mentioned previously are the metrics cycle logs for the case when need-restarting command have been used. They are visible in version 1.6.1 and not visible in version 1.6.2.

Version 1.6.1 with `--reboot-sentinel-command=sh -c "! /usr/bin/needs-restarting --reboothint"``

------metric logs begin -----
time="2025-02-03T07:47:09Z" level=info msg="Updating Subscription Management repositories." cmd=/usr/bin/nsenter std=out
time="2025-02-03T07:47:10Z" level=info msg="No core libraries or services have been updated since boot-up." cmd=/usr/bin/nsenter std=out
time="2025-02-03T07:47:10Z" level=info msg="Reboot should not be necessary." cmd=/usr/bin/nsenter std=out
------metric logs end -----
------reboot logs begin -----
time="2025-02-03T07:47:10Z" level=info msg="Reboot not required"
------reboot logs end -----

Version 1.6.2 with `--reboot-sentinel-command=sh -c "! /usr/bin/needs-restarting --reboothint"``

------reboot logs begin -----
time="2025-02-03T07:47:10Z" level=info msg="Reboot not required"
------reboot logs end -----

We have build our elastic dashboard for reboot notifications based on the behavior of the version 1.6.1 and analyzing need-restarting logs from metrics cycle which are not visible since 1.6.2. It looks like a very edge case and more generic approach would be to use metrics to monitor if reboot is required.

Could you please confirm my assumptions? Are metrics being calculated regardless of the reboot interval config?

@evrardjp
Copy link
Collaborator

evrardjp commented Feb 3, 2025

You're correct in the fact that there is currently two go routines - one to keep metrics done by default every minute and not configurable, one to keep the rest of the code.
There is a question of consistency that made me move towards using a single routine in a PR that hasn't been merged yet.

In that refactor, the intent is to have the test whether a reboot is required quite early, and expose the metric. The usage and behaviour would then depend on the maitnenance window. But that's for the future. I didn't believe it was useful to keep the log to say "reboot not required", BUT I will make sure it's still there in the rewrite, you have a good use case.
Yet, don't hesitate to use the metrics too.

Interestingly, I didn't see the metrics logs in my test BUT I realised I don't have the same vars: I don't have -log-format=text in my test env. I will do that in my next test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants