Nomad UI shows failed for jobs that are scaled to 0 #23591

caiodelgadonew · 2024-07-14T10:13:51Z

Hello,

We have this behavior on nomad when we run nomad job scale <job> <group> 0 that the job shows as failed.

It would be nice to have a better message for this state instead of Failed, I know the job is not healthy since the service checks cant pass cuz its scaled to zero, but maybe a flag "Downscaled" would be a good thing to have.

Is this on track? are there any plans for it?

Example situation:

Expected:

or

The text was updated successfully, but these errors were encountered:

tgross · 2024-07-15T12:37:52Z

@caiodelgadonew in recent versions of the UI we're presenting a sort of "aggregate state" based on the job status and the allocation status. Can you verify that the API is returning Running for the job status, and not Dead? (That is, what does nomad job status $myjob show for the job status?)

caiodelgadonew · 2024-07-15T12:52:05Z

@tgross no, its not returning running, its returning dead.
I've deployed an example to test

To reproduce, deploy any job, and then scale the groups to 0
In my example:

$ nomad job scale testing-scheduler app 0
$ nomad job scale testing-scheduler app2 0

tgross · 2024-07-15T13:08:32Z

Ok, thanks. I'd expect a job with all complete allocations to be "dead". We made a lot of changes to provide reasonable "aggregate" statuses in 1.8, but there are an awful lot of corner cases 😁 I'll mark this for roadmapping.

caiodelgadonew · 2024-07-15T13:26:34Z

thanks @tgross , you mean Running right?

tgross · 2024-07-15T13:31:45Z

Actually, no. 😀 The job status is a pretty coarse view of the world and so "dead" just means all allocations are terminal. I could definitely see an argument that this view of the world isn't ideal because it makes the UI design harder, but it's definitely the intended behavior right now. But the UI does want to present a richer "aggregate state" that would account for this case where the job is scaled down.

philrenaud · 2024-07-19T14:50:55Z

I've got a note to make this experience better in the UI. The common way this shows itself is when allocations complete and are garbage-collected, but garbage-collected allocations don't leave the UI with a lot of clues as to whether their absence is a bad thing (Failed) or an expected thing (Dormant or something).

In this case, a deliberate scale to 0 seems more detectable. I'll see what I can do to give a status less off-putting than Failed here.

caiodelgadonew · 2024-08-02T15:47:47Z

Any updates on this @philrenaud ? :)

sevensolutions · 2024-08-04T15:03:53Z

I'am also playing around with job scaling to implement a scale-to-zero-strategy and came accross this issue.
One important thing for me is, that a job, scaled down to zero, must never be removed by the garbage collector so i can re-scale it up at every time.

philrenaud · 2024-08-15T18:58:16Z

Hi @caiodelgadonew, I've started thinking about a solution to this problem. The current logic for status has something like

    if (failedOrLostAllocs.length >= expectedRunningAllocCount) {
      return { label: 'Failed', state: 'critical' };

Which generally works pretty well, except here, where 0 failed allocs is greater than or equal to 0 expected allocs, triggering the Failed state.

As such, in #23829 , I've started putting in a safety valve for exactly this case:

    if (expectedRunningAllocCount === 0) {
      return { label: 'Scaled Down', state: 'neutral' };
    }

This'll show something like this (3rd job shown):

Is this about what you had in mind?

========================

A secondary concern (cc @sevensolutions as well) is that these jobs are nevertheless considered Dead as far as the scheduling and garbage collection processes go (they are terminal until manual change is made to increase allocation count).

This means that these statuses are at most only temporary: "Scaled Down" is what you'd see until garbage collection takes place / a user runs nomad system gc / etc.

We have been exploring some concepts that would create garbage-collection-avoidance permanence (see Golden Job Versions) for example) that might mitigate this in the future, but I wanted to open a dialogue here to indicate the temporary nature of this status as implemented in #23829

caiodelgadonew · 2024-08-15T23:17:06Z

@philrenaud That "Scaled Down" looks amazing!
Imho if it could be in a different color (maybe yellow/warning?) would be 10/10

sevensolutions · 2024-08-16T05:49:13Z

@caiodelgadonew i also thought about a different color first, but i think "Scaled down" doesn't neccesarily need to be a warning. It may be intended.

caiodelgadonew · 2024-08-16T09:16:45Z

Yeah, you're correct, I like your idea. :)

Any plans for shipping it?

philrenaud · 2024-08-16T13:23:08Z

Yep, let me test some edge cases and get it tagged for the next minor release. Thanks for your patience with this!

caiodelgadonew · 2024-08-16T13:24:49Z

Amazing @philrenaud many thanks!!!

caiodelgadonew added the type/enhancement label Jul 14, 2024

tgross added the theme/ui label Jul 15, 2024

tgross added this to Nomad - Community Issues Triage Jul 15, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jul 15, 2024

tgross added the stage/waiting-reply label Jul 15, 2024

tgross removed the stage/waiting-reply label Jul 15, 2024

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jul 15, 2024

tgross added the hcc/jira label Jul 15, 2024

tgross added this to Nomad UI Jul 15, 2024

github-project-automation bot moved this to Backlog in Nomad UI Jul 15, 2024

tgross added type/bug and removed type/enhancement labels Jul 15, 2024

philrenaud self-assigned this Jul 19, 2024

philrenaud moved this from Backlog to Todo in Nomad UI Jul 19, 2024

philrenaud moved this from Todo to In Progress in Nomad UI Aug 15, 2024

philrenaud mentioned this issue Aug 15, 2024

[ui] Show "Scaled Down" as a valid job status when task groups' counts are set to zero #23829

Merged

philrenaud linked a pull request Aug 15, 2024 that will close this issue

[ui] Show "Scaled Down" as a valid job status when task groups' counts are set to zero #23829

Merged

philrenaud closed this as completed in #23829 Aug 19, 2024

github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Aug 19, 2024

github-project-automation bot moved this from In Progress to Done in Nomad UI Aug 19, 2024

hc-github-team-nomad-core mentioned this issue Aug 19, 2024

Backport of [ui] Show "Scaled Down" as a valid job status when task groups' counts are set to zero into release/1.8.x #23844

Merged

philrenaud mentioned this issue Aug 21, 2024

Docs: CE-674 Add job statuses #23849

Merged

schmichael mentioned this issue Oct 14, 2024

[Feature Request] Using ACL, add a new purge-job capability so that we can restrict purging of our jobs #24147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad UI shows failed for jobs that are scaled to 0 #23591

Nomad UI shows failed for jobs that are scaled to 0 #23591

caiodelgadonew commented Jul 14, 2024

tgross commented Jul 15, 2024

caiodelgadonew commented Jul 15, 2024

tgross commented Jul 15, 2024

caiodelgadonew commented Jul 15, 2024

tgross commented Jul 15, 2024

philrenaud commented Jul 19, 2024

caiodelgadonew commented Aug 2, 2024

sevensolutions commented Aug 4, 2024

philrenaud commented Aug 15, 2024 •

edited

Loading

caiodelgadonew commented Aug 15, 2024

sevensolutions commented Aug 16, 2024

caiodelgadonew commented Aug 16, 2024

philrenaud commented Aug 16, 2024

caiodelgadonew commented Aug 16, 2024

Nomad UI shows failed for jobs that are scaled to 0 #23591

Nomad UI shows failed for jobs that are scaled to 0 #23591

Comments

caiodelgadonew commented Jul 14, 2024

tgross commented Jul 15, 2024

caiodelgadonew commented Jul 15, 2024

tgross commented Jul 15, 2024

caiodelgadonew commented Jul 15, 2024

tgross commented Jul 15, 2024

philrenaud commented Jul 19, 2024

caiodelgadonew commented Aug 2, 2024

sevensolutions commented Aug 4, 2024

philrenaud commented Aug 15, 2024 • edited Loading

caiodelgadonew commented Aug 15, 2024

sevensolutions commented Aug 16, 2024

caiodelgadonew commented Aug 16, 2024

philrenaud commented Aug 16, 2024

caiodelgadonew commented Aug 16, 2024

philrenaud commented Aug 15, 2024 •

edited

Loading