Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow scaling system jobs to 0 #24363

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

Juanadelacuesta
Copy link
Member

@Juanadelacuesta Juanadelacuesta commented Nov 5, 2024

This PR introduces the possibility to temporarily stop a system job by scaling it to 0 and then restart it again by scaling it back up to 1, without having to resubmit it.

@Juanadelacuesta Juanadelacuesta marked this pull request as ready for review November 5, 2024 15:59
@Juanadelacuesta Juanadelacuesta changed the title func: remove validation scaling for system jobs and dont canonicalize… Allow scalating system jobs to 0 Nov 5, 2024
@Juanadelacuesta Juanadelacuesta changed the title Allow scalating system jobs to 0 Allow scaling system jobs to 0 Nov 5, 2024
@Juanadelacuesta Juanadelacuesta requested review from tgross, jrasell and mismithhisler and removed request for tgross November 5, 2024 16:06
@Juanadelacuesta Juanadelacuesta added the backport/1.9.x backport to 1.9.x release line label Nov 5, 2024
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Juanadelacuesta, the code here LGTM but I have a couple of comments and questions I believe should be resolved before merging.

The job endpoint test seems to check that we can call the RPC with the given parameters, but we do not have any additional tests to ensure Nomad runs/stops allocations according to what we expect. Should we add some e2e or additional tests to ensure the correct behaviour?

What is our backwards compatibility stance here? Currently operators can submit system jobs without specifying a count and expect Nomad to default the value to 1. This change would mean these job specifications will no longer result in allocations and means upgraders must modify all job specifications which utilise this behaviour which is a breaking change.

I think it would also be useful to document some nuisances around this feature and how it works and expected behaviour. One example I immediately thought of was how does this interact with Nomad GC in the event I leave a job scaled to zero for an extended period of time?

@Juanadelacuesta
Copy link
Member Author

Juanadelacuesta commented Nov 6, 2024

The default to 1 for the system jobs was not done there, it is maintained still, this PR does not change that, so there are no backwards compatibility issues thankfully. As for the garbage colector, how does it behave with any other type of job? Is it different for system jobs? Thinking more about it, the idea behind the feature is to be able to "pause" a job. If the stoped allocation is garbage collected, it wont be rescheduled, the job will "unpause" on rescaling, no need to re run it. Am I missing something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.9.x backport to 1.9.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants