Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad UI shows failed for jobs that are scaled to 0 #23591

Closed
caiodelgadonew opened this issue Jul 14, 2024 · 14 comments · Fixed by #23829
Closed

Nomad UI shows failed for jobs that are scaled to 0 #23591

caiodelgadonew opened this issue Jul 14, 2024 · 14 comments · Fixed by #23829

Comments

@caiodelgadonew
Copy link

Hello,

We have this behavior on nomad when we run nomad job scale <job> <group> 0 that the job shows as failed.

It would be nice to have a better message for this state instead of Failed, I know the job is not healthy since the service checks cant pass cuz its scaled to zero, but maybe a flag "Downscaled" would be a good thing to have.

Is this on track? are there any plans for it?

Example situation:
CleanShot 2024-07-14 at 12 02 03

Expected:
CleanShot 2024-07-14 at 12 10 43
or
CleanShot 2024-07-14 at 12 13 30

@tgross
Copy link
Member

tgross commented Jul 15, 2024

@caiodelgadonew in recent versions of the UI we're presenting a sort of "aggregate state" based on the job status and the allocation status. Can you verify that the API is returning Running for the job status, and not Dead? (That is, what does nomad job status $myjob show for the job status?)

@caiodelgadonew
Copy link
Author

@tgross no, its not returning running, its returning dead.
I've deployed an example to test

To reproduce, deploy any job, and then scale the groups to 0
In my example:

$ nomad job scale testing-scheduler app 0
$ nomad job scale testing-scheduler app2 0

CleanShot 2024-07-15 at 14 50 15

CleanShot 2024-07-15 at 14 50 24

@tgross
Copy link
Member

tgross commented Jul 15, 2024

Ok, thanks. I'd expect a job with all complete allocations to be "dead". We made a lot of changes to provide reasonable "aggregate" statuses in 1.8, but there are an awful lot of corner cases 😁 I'll mark this for roadmapping.

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jul 15, 2024
@tgross tgross added this to Nomad UI Jul 15, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Nomad UI Jul 15, 2024
@caiodelgadonew
Copy link
Author

thanks @tgross , you mean Running right?

@tgross
Copy link
Member

tgross commented Jul 15, 2024

Actually, no. 😀 The job status is a pretty coarse view of the world and so "dead" just means all allocations are terminal. I could definitely see an argument that this view of the world isn't ideal because it makes the UI design harder, but it's definitely the intended behavior right now. But the UI does want to present a richer "aggregate state" that would account for this case where the job is scaled down.

@philrenaud
Copy link
Contributor

I've got a note to make this experience better in the UI. The common way this shows itself is when allocations complete and are garbage-collected, but garbage-collected allocations don't leave the UI with a lot of clues as to whether their absence is a bad thing (Failed) or an expected thing (Dormant or something).

In this case, a deliberate scale to 0 seems more detectable. I'll see what I can do to give a status less off-putting than Failed here.

@philrenaud philrenaud self-assigned this Jul 19, 2024
@philrenaud philrenaud moved this from Backlog to Todo in Nomad UI Jul 19, 2024
@caiodelgadonew
Copy link
Author

Any updates on this @philrenaud ? :)

@sevensolutions
Copy link
Contributor

I'am also playing around with job scaling to implement a scale-to-zero-strategy and came accross this issue.
One important thing for me is, that a job, scaled down to zero, must never be removed by the garbage collector so i can re-scale it up at every time.

@philrenaud
Copy link
Contributor

philrenaud commented Aug 15, 2024

Hi @caiodelgadonew, I've started thinking about a solution to this problem. The current logic for status has something like

    if (failedOrLostAllocs.length >= expectedRunningAllocCount) {
      return { label: 'Failed', state: 'critical' };

Which generally works pretty well, except here, where 0 failed allocs is greater than or equal to 0 expected allocs, triggering the Failed state.

As such, in #23829 , I've started putting in a safety valve for exactly this case:

    if (expectedRunningAllocCount === 0) {
      return { label: 'Scaled Down', state: 'neutral' };
    }

This'll show something like this (3rd job shown):
image

Is this about what you had in mind?

========================

A secondary concern (cc @sevensolutions as well) is that these jobs are nevertheless considered Dead as far as the scheduling and garbage collection processes go (they are terminal until manual change is made to increase allocation count).

This means that these statuses are at most only temporary: "Scaled Down" is what you'd see until garbage collection takes place / a user runs nomad system gc / etc.

We have been exploring some concepts that would create garbage-collection-avoidance permanence (see Golden Job Versions) for example) that might mitigate this in the future, but I wanted to open a dialogue here to indicate the temporary nature of this status as implemented in #23829

@caiodelgadonew
Copy link
Author

@philrenaud That "Scaled Down" looks amazing!
Imho if it could be in a different color (maybe yellow/warning?) would be 10/10

@sevensolutions
Copy link
Contributor

@caiodelgadonew i also thought about a different color first, but i think "Scaled down" doesn't neccesarily need to be a warning. It may be intended.

@caiodelgadonew
Copy link
Author

Yeah, you're correct, I like your idea. :)

Any plans for shipping it?

@philrenaud
Copy link
Contributor

Yep, let me test some edge cases and get it tagged for the next minor release. Thanks for your patience with this!

@caiodelgadonew
Copy link
Author

Amazing @philrenaud many thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
4 participants