-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow scaling system jobs to 0 #24363
base: main
Are you sure you want to change the base?
Conversation
427f9d0
to
b051e59
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Juanadelacuesta, the code here LGTM but I have a couple of comments and questions I believe should be resolved before merging.
The job endpoint test seems to check that we can call the RPC with the given parameters, but we do not have any additional tests to ensure Nomad runs/stops allocations according to what we expect. Should we add some e2e or additional tests to ensure the correct behaviour?
What is our backwards compatibility stance here? Currently operators can submit system jobs without specifying a count and expect Nomad to default the value to 1. This change would mean these job specifications will no longer result in allocations and means upgraders must modify all job specifications which utilise this behaviour which is a breaking change.
I think it would also be useful to document some nuisances around this feature and how it works and expected behaviour. One example I immediately thought of was how does this interact with Nomad GC in the event I leave a job scaled to zero for an extended period of time?
The default to 1 for the system jobs was not done there, it is maintained still, this PR does not change that, so there are no backwards compatibility issues thankfully. As for the garbage colector, how does it behave with any other type of job? Is it different for system jobs? Thinking more about it, the idea behind the feature is to be able to "pause" a job. If the stoped allocation is garbage collected, it wont be rescheduled, the job will "unpause" on rescaling, no need to re run it. Am I missing something? |
2202051
to
f9a3fe9
Compare
f9a3fe9
to
d364aae
Compare
d364aae
to
e9c96a6
Compare
e9c96a6
to
437c6ac
Compare
…t to account for it
d0e8c48
to
215746a
Compare
9449c1c
to
84a592e
Compare
84a592e
to
77bf227
Compare
This PR introduces the possibility to temporarily stop a system job by scaling it to 0 and then restart it again by scaling it back up to 1, without having to resubmit it.