Do not requeue for running Job #237

gibizer · 2023-04-13T09:04:16Z

Polling the Job status unnecessary causing k8s API load and noisy logging. The caller of DoJob() should declare that it owns a Job via Owns(&batchv1.Job{}) in its SetupWithManager call. Then the caller will be automatically reconciled when the Job status is changed.

There is an edge case when the Job is just created and immediately read back from the API causing NotFound temporarily. In this case returning the NotFound error is simpler than asking for and explicit Requeue and it will still trigger the automatic requeue.

Polling the Job status unnecessary causing k8s API load and noisy logging. The caller of DoJob() should declare that it owns a Job via Owns(&batchv1.Job{}) in its SetupWithManager call. Then the caller will be automatically reconciled when the Job status is changed. There is an edge case when the Job is just created and immediately read back from the API causing NotFound temporarily. In this case returning the NotFound error is simpler than asking for and explicit Requeue and it will still trigger the automatic requeue.

gibizer · 2023-04-13T09:05:17Z

I marked this as Draft as I need to ensure all the DoJob users declaring that they Own the Job it creates.

gibizer · 2023-04-13T09:13:56Z

We also still need a way for the caller to decide when the Job is finished successfully ...

stuggi · 2023-04-13T09:51:30Z

We also still need a way for the caller to decide when the Job is finished successfully ...

not sure I understand that, is there a case other then job.Status.Succeeded > 0 for a job finished successfully?

gibizer · 2023-04-13T09:56:19Z

We also still need a way for the caller to decide when the Job is finished successfully ...

not sure I understand that, is there a case other then job.Status.Succeeded > 0 for a job finished successfully?

Sorry. I wasn't clear. So if we remove all the requeue returns from the DoJob, then that either returns an error, or return err=nil. The latter can either means the job is still running or that the job is successfully finished. Most of the DBsync logic in the caller needs an explicit knowledge about the case when the Job is finished, to store the Hash in the status. https://github.com/openstack-k8s-operators/keystone-operator/blob/203185e59d8912e4a0fc3d169d09bb6a2b468ae8/controllers/keystoneapi_controller.go#L310-L330 I guess we don't want to store the Job Hash while the Job still running. Or do we?

stuggi · 2023-04-13T12:59:23Z

We also still need a way for the caller to decide when the Job is finished successfully ...

not sure I understand that, is there a case other then job.Status.Succeeded > 0 for a job finished successfully?

Sorry. I wasn't clear. So if we remove all the requeue returns from the DoJob, then that either returns an error, or return err=nil. The latter can either means the job is still running or that the job is successfully finished. Most of the DBsync logic in the caller needs an explicit knowledge about the case when the Job is finished, to store the Hash in the status. https://github.com/openstack-k8s-operators/keystone-operator/blob/203185e59d8912e4a0fc3d169d09bb6a2b468ae8/controllers/keystoneapi_controller.go#L310-L330 I guess we don't want to store the Job Hash while the Job still running. Or do we?

right, thanks for explanation, haven't thought about that while checking this PR the first time. Could do one of the following, either return another bool for completed or not, or could also return a custom error on "still running" and then the caller can evaluate the err for the type, similar to to if k8s_errors.Is(err, StillRunning) {

We could probably store the hash while the job is still running to identify if it changed during its execution and recreate with the new definition, but it might be better to wait for it to be finished and then trigger a new one and not cancel it in the middle.

stuggi · 2023-04-13T13:02:01Z

modules/common/job/job.go

@@ -39,15 +39,15 @@ func NewJob(
 	job *batchv1.Job,
 	jobType string,
 	preserve bool,
-	timeout time.Duration,
+	timeout time.Duration, // unused


wondering if we should still support this, like when 0 passed it is not used and something higher to consider requeue ?

My goal is to get rid of the requeue within this helper. I can even drop the field from the Job struct as I anyhow need to do adaptation PRs in other operators so I can do the signature change as well at the same time.

If we return the status of the job to the caller then the caller can decide to requeue or just return. Still I believe that we should never need to explicitly requeue due to a status of a k8s managed resource as we can watch that instead. (The situation can be different for non k8s resource statuses like when we depend on something to happen with openstack or galera itself)

should we proceed with this? looks like a valid and nice to have fix

Yes, we should. If you have time feel free to pick it up. This needs adaptation in all our service-operators.

gibizer · 2023-04-14T10:24:52Z

We also still need a way for the caller to decide when the Job is finished successfully ...
[snip]
Could do one of the following, either return another bool for completed or not, or could also return a custom error on "still running" and then the caller can evaluate the err for the type, similar to to if k8s_errors.Is(err, StillRunning) {

From the DoJob perspective if the Job is still running that is not an error condition. So I lean towards either returning a bool even returning an enum{Active,Succeeded,Failed} together with the error that can be nil.

We could probably store the hash while the job is still running to identify if it changed during its execution and recreate with the new definition, but it might be better to wait for it to be finished and then trigger a new one and not cancel it in the middle.

Good point. I need to be careful about the re-creation of the Job if the hash changes in the middle of the Job execution. I agree that killing a Job in the middle can cause unwanted side effects.

bottom line, I have to work on this PR more and verify the integration of the change with our existing operators. Thanks for the feedback so far. I think I got my answer. At this point as I only wanted to verify that the idea of this PR is not controversial. I will ping you again when I have a completed PR and some example integration changes with other operators.

gibizer requested a review from stuggi April 13, 2023 09:05

stuggi reviewed Apr 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not requeue for running Job #237

Do not requeue for running Job #237

gibizer commented Apr 13, 2023 •

edited

Loading

gibizer commented Apr 13, 2023

gibizer commented Apr 13, 2023 •

edited

Loading

stuggi commented Apr 13, 2023

gibizer commented Apr 13, 2023

stuggi commented Apr 13, 2023

stuggi Apr 13, 2023

gibizer Apr 14, 2023

bogdando Dec 14, 2023

gibizer Dec 19, 2023

gibizer commented Apr 14, 2023

Do not requeue for running Job #237

Are you sure you want to change the base?

Do not requeue for running Job #237

Conversation

gibizer commented Apr 13, 2023 • edited Loading

gibizer commented Apr 13, 2023

gibizer commented Apr 13, 2023 • edited Loading

stuggi commented Apr 13, 2023

gibizer commented Apr 13, 2023

stuggi commented Apr 13, 2023

stuggi Apr 13, 2023

Choose a reason for hiding this comment

gibizer Apr 14, 2023

Choose a reason for hiding this comment

bogdando Dec 14, 2023

Choose a reason for hiding this comment

gibizer Dec 19, 2023

Choose a reason for hiding this comment

gibizer commented Apr 14, 2023

gibizer commented Apr 13, 2023 •

edited

Loading

gibizer commented Apr 13, 2023 •

edited

Loading