Allow scaling system jobs to 0 #24363

Juanadelacuesta · 2024-11-05T10:25:08Z

This PR introduces the possibility to temporarily stop a system job by scaling it to 0 and then restart it again by scaling it back up to 1, without having to resubmit it.

jrasell

Thanks @Juanadelacuesta, the code here LGTM but I have a couple of comments and questions I believe should be resolved before merging.

The job endpoint test seems to check that we can call the RPC with the given parameters, but we do not have any additional tests to ensure Nomad runs/stops allocations according to what we expect. Should we add some e2e or additional tests to ensure the correct behaviour?

What is our backwards compatibility stance here? Currently operators can submit system jobs without specifying a count and expect Nomad to default the value to 1. This change would mean these job specifications will no longer result in allocations and means upgraders must modify all job specifications which utilise this behaviour which is a breaking change.

I think it would also be useful to document some nuisances around this feature and how it works and expected behaviour. One example I immediately thought of was how does this interact with Nomad GC in the event I leave a job scaled to zero for an extended period of time?

Juanadelacuesta · 2024-11-06T13:34:50Z

The default to 1 for the system jobs was not done there, it is maintained still, this PR does not change that, so there are no backwards compatibility issues thankfully. As for the garbage colector, how does it behave with any other type of job? Is it different for system jobs? Thinking more about it, the idea behind the feature is to be able to "pause" a job. If the stoped allocation is garbage collected, it wont be rescheduled, the job will "unpause" on rescaling, no need to re run it. Am I missing something?

… to 1

…t to account for it

e2e/scaling/scaling.go

…r node

…as stoped

nomad/job_endpoint.go

vercel bot deployed to Preview – nomad-ui November 5, 2024 10:37 View deployment

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from 427f9d0 to b051e59 Compare November 5, 2024 15:34

vercel bot deployed to Preview – nomad-ui November 5, 2024 15:35 View deployment

vercel bot deployed to Preview – nomad-ui November 5, 2024 15:40 View deployment

Juanadelacuesta marked this pull request as ready for review November 5, 2024 15:59

Juanadelacuesta changed the title ~~func: remove validation scaling for system jobs and dont canonicalize…~~ Allow scalating system jobs to 0 Nov 5, 2024

Juanadelacuesta changed the title ~~Allow scalating system jobs to 0~~ Allow scaling system jobs to 0 Nov 5, 2024

Juanadelacuesta requested review from tgross, jrasell and mismithhisler and removed request for tgross November 5, 2024 16:06

Juanadelacuesta added the backport/1.9.x backport to 1.9.x release line label Nov 5, 2024

vercel bot deployed to Preview – nomad-ui November 5, 2024 16:24 View deployment

jrasell reviewed Nov 6, 2024

View reviewed changes

Juanadelacuesta requested a review from jrasell November 6, 2024 13:34

vercel bot deployed to Preview – nomad-ui November 6, 2024 15:05 View deployment

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from 2202051 to f9a3fe9 Compare November 6, 2024 15:08

vercel bot deployed to Preview – nomad-ui November 6, 2024 15:09 View deployment

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from f9a3fe9 to d364aae Compare November 6, 2024 15:21

vercel bot deployed to Preview – nomad-ui November 6, 2024 15:22 View deployment

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from d364aae to e9c96a6 Compare November 6, 2024 15:52

vercel bot deployed to Preview – nomad-ui November 6, 2024 15:53 View deployment

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from e9c96a6 to 437c6ac Compare November 7, 2024 08:32

vercel bot deployed to Preview – nomad-ui November 7, 2024 08:33 View deployment

vercel bot deployed to Preview – nomad-ui November 7, 2024 11:58 View deployment

vercel bot deployed to Preview – nomad-ui November 7, 2024 13:48 View deployment

vercel bot deployed to Preview – nomad-ui November 7, 2024 16:28 View deployment

Juanadelacuesta added 2 commits November 8, 2024 09:49

func: remove validation scaling for system jobs and dont canonicalize…

63220c8

… to 1

test: update test to validate for 0 and improve error message

6f5a2b0

Juanadelacuesta added 6 commits November 8, 2024 09:49

func: remove the canonicalization to 1 from system jobs

af568b0

docs: add changelog

aae71d8

func: add test for scaling system jobs

2917d48

temp: add logging to debug test

e81aa48

fix: clean up after test is done

186bcf7

fix: scaled down jobs will still have the stop allocation, update tes…

215746a

…t to account for it

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from d0e8c48 to 215746a Compare November 8, 2024 08:49

vercel bot deployed to Preview – nomad-ui November 8, 2024 08:50 View deployment

mismithhisler reviewed Nov 11, 2024

View reviewed changes

e2e/scaling/scaling.go Outdated Show resolved Hide resolved

Juanadelacuesta requested review from mismithhisler and hc-github-team-nomad-core November 12, 2024 09:07

vercel bot deployed to Preview – nomad-ui November 12, 2024 11:18 View deployment

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from 9449c1c to 84a592e Compare November 12, 2024 11:30

vercel bot deployed to Preview – nomad-ui November 12, 2024 11:32 View deployment

Update the e2e test to accomodate for system jobs to have an alloc pe…

77bf227

…r node

Juanadelacuesta force-pushed the f-NET-9976-system-jobs branch from 84a592e to 77bf227 Compare November 12, 2024 11:55

vercel bot deployed to Preview – nomad-ui November 12, 2024 11:56 View deployment

fix: filter to only count ready nodes on the node count

f6ee8bb

vercel bot deployed to Preview – nomad-ui November 13, 2024 09:33 View deployment

fix: remove the datacenter constrain from the system job definition

a91726b

vercel bot deployed to Preview – nomad-ui November 13, 2024 10:10 View deployment

fix: compare alloc IDs to avoid flaky tests when verifying no alloc w…

194d4e9

…as stoped

vercel bot deployed to Preview – nomad-ui November 13, 2024 11:07 View deployment

fix: remove duplicated code

2e99c71

vercel bot deployed to Preview – nomad-ui November 13, 2024 13:56 View deployment

mismithhisler reviewed Nov 13, 2024

View reviewed changes

nomad/job_endpoint.go Show resolved Hide resolved

Juanadelacuesta requested a review from mismithhisler November 15, 2024 08:55

mismithhisler approved these changes Nov 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow scaling system jobs to 0 #24363

Allow scaling system jobs to 0 #24363

Juanadelacuesta commented Nov 5, 2024 •

edited

Loading

jrasell left a comment

Juanadelacuesta commented Nov 6, 2024 •

edited

Loading

Allow scaling system jobs to 0 #24363

Are you sure you want to change the base?

Allow scaling system jobs to 0 #24363

Conversation

Juanadelacuesta commented Nov 5, 2024 • edited Loading

jrasell left a comment

Choose a reason for hiding this comment

Juanadelacuesta commented Nov 6, 2024 • edited Loading

Juanadelacuesta commented Nov 5, 2024 •

edited

Loading

Juanadelacuesta commented Nov 6, 2024 •

edited

Loading