Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not requeue for running Job #237

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gibizer
Copy link
Contributor

@gibizer gibizer commented Apr 13, 2023

Polling the Job status unnecessary causing k8s API load and noisy logging. The caller of DoJob() should declare that it owns a Job via Owns(&batchv1.Job{}) in its SetupWithManager call. Then the caller will be automatically reconciled when the Job status is changed.

There is an edge case when the Job is just created and immediately read back from the API causing NotFound temporarily. In this case returning the NotFound error is simpler than asking for and explicit Requeue and it will still trigger the automatic requeue.

Polling the Job status unnecessary causing k8s API load and noisy logging.
The caller of DoJob() should declare that it owns a Job via
Owns(&batchv1.Job{}) in its SetupWithManager call. Then the caller will
be automatically reconciled when the Job status is changed.

There is an edge case when the Job is just created and immediately read
back from the API causing NotFound temporarily. In this case returning
the NotFound error is simpler than asking for and explicit Requeue and
it will still trigger the automatic requeue.
@gibizer
Copy link
Contributor Author

gibizer commented Apr 13, 2023

I marked this as Draft as I need to ensure all the DoJob users declaring that they Own the Job it creates.

@gibizer gibizer requested a review from stuggi April 13, 2023 09:05
@gibizer
Copy link
Contributor Author

gibizer commented Apr 13, 2023

We also still need a way for the caller to decide when the Job is finished successfully ...

@stuggi
Copy link
Contributor

stuggi commented Apr 13, 2023

We also still need a way for the caller to decide when the Job is finished successfully ...

not sure I understand that, is there a case other then job.Status.Succeeded > 0 for a job finished successfully?

@gibizer
Copy link
Contributor Author

gibizer commented Apr 13, 2023

We also still need a way for the caller to decide when the Job is finished successfully ...

not sure I understand that, is there a case other then job.Status.Succeeded > 0 for a job finished successfully?

Sorry. I wasn't clear. So if we remove all the requeue returns from the DoJob, then that either returns an error, or return err=nil. The latter can either means the job is still running or that the job is successfully finished. Most of the DBsync logic in the caller needs an explicit knowledge about the case when the Job is finished, to store the Hash in the status. https://github.com/openstack-k8s-operators/keystone-operator/blob/203185e59d8912e4a0fc3d169d09bb6a2b468ae8/controllers/keystoneapi_controller.go#L310-L330 I guess we don't want to store the Job Hash while the Job still running. Or do we?

@stuggi
Copy link
Contributor

stuggi commented Apr 13, 2023

We also still need a way for the caller to decide when the Job is finished successfully ...

not sure I understand that, is there a case other then job.Status.Succeeded > 0 for a job finished successfully?

Sorry. I wasn't clear. So if we remove all the requeue returns from the DoJob, then that either returns an error, or return err=nil. The latter can either means the job is still running or that the job is successfully finished. Most of the DBsync logic in the caller needs an explicit knowledge about the case when the Job is finished, to store the Hash in the status. https://github.com/openstack-k8s-operators/keystone-operator/blob/203185e59d8912e4a0fc3d169d09bb6a2b468ae8/controllers/keystoneapi_controller.go#L310-L330 I guess we don't want to store the Job Hash while the Job still running. Or do we?

right, thanks for explanation, haven't thought about that while checking this PR the first time. Could do one of the following, either return another bool for completed or not, or could also return a custom error on "still running" and then the caller can evaluate the err for the type, similar to to if k8s_errors.Is(err, StillRunning) {

We could probably store the hash while the job is still running to identify if it changed during its execution and recreate with the new definition, but it might be better to wait for it to be finished and then trigger a new one and not cancel it in the middle.

@@ -39,15 +39,15 @@ func NewJob(
job *batchv1.Job,
jobType string,
preserve bool,
timeout time.Duration,
timeout time.Duration, // unused
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should still support this, like when 0 passed it is not used and something higher to consider requeue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal is to get rid of the requeue within this helper. I can even drop the field from the Job struct as I anyhow need to do adaptation PRs in other operators so I can do the signature change as well at the same time.

If we return the status of the job to the caller then the caller can decide to requeue or just return. Still I believe that we should never need to explicitly requeue due to a status of a k8s managed resource as we can watch that instead. (The situation can be different for non k8s resource statuses like when we depend on something to happen with openstack or galera itself)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we proceed with this? looks like a valid and nice to have fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. If you have time feel free to pick it up. This needs adaptation in all our service-operators.

@gibizer
Copy link
Contributor Author

gibizer commented Apr 14, 2023

We also still need a way for the caller to decide when the Job is finished successfully ...
[snip]
Could do one of the following, either return another bool for completed or not, or could also return a custom error on "still running" and then the caller can evaluate the err for the type, similar to to if k8s_errors.Is(err, StillRunning) {

From the DoJob perspective if the Job is still running that is not an error condition. So I lean towards either returning a bool even returning an enum{Active,Succeeded,Failed} together with the error that can be nil.

We could probably store the hash while the job is still running to identify if it changed during its execution and recreate with the new definition, but it might be better to wait for it to be finished and then trigger a new one and not cancel it in the middle.

Good point. I need to be careful about the re-creation of the Job if the hash changes in the middle of the Job execution. I agree that killing a Job in the middle can cause unwanted side effects.

bottom line, I have to work on this PR more and verify the integration of the change with our existing operators. Thanks for the feedback so far. I think I got my answer. At this point as I only wanted to verify that the idea of this PR is not controversial. I will ping you again when I have a completed PR and some example integration changes with other operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants