Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784

Closed
frrist opened this issue Apr 11, 2024 · 4 comments · Fixed by #3957
Assignees
Labels
type/bug Type: Something is not working as expected
Milestone

Comments

@frrist
Copy link
Member

frrist commented Apr 11, 2024

Due to changes here:

Current proposal is to:

cc @rossjones & @wdbaruni to weigh in on how the new event system introduced in #3772 can be used to force scheduling of executions when offline compute nodes come online again.

@frrist frrist added the type/bug Type: Something is not working as expected label Apr 11, 2024
@frrist frrist added this to the Release v1.3.1 milestone Apr 11, 2024
@frrist frrist self-assigned this Apr 15, 2024
@wdbaruni
Copy link
Member

What is the proposal here? I believe the default option now is to auto-approve nodes, and only schedule on approved and connected nodes. Is any of that still missing?

@frrist
Copy link
Member Author

frrist commented Apr 16, 2024

Yeah the only scheduling on connected and approved nodes is missing. Currently we schedule on disconnected node for some job types and ignore their approval state for other job types. frankly it's a bit of a mess:

Or rather than modify, allow these aspects of scheduling to be configured.

Further we need to ensure that worked scheduled on an offline node runs when the node comes back online which we will need #3772 to do. e.g. the orchestrator could listen for connected events and create an evaluation to execute the work.

@frrist
Copy link
Member Author

frrist commented Apr 22, 2024

Another point to consider:
How can we allow users to define different scheduling heuristics for compute nodes. e.g. nodes in a data center ought to have a more strict requirement on connectedness than nodes that are expected to go offline for longer periods of time (e.g. submarine compute nodes)

@rossjones
Copy link
Contributor

@wdbaruni previously suggested adding another (a third) timeout in future which allows nodes to be offline for that long before being considered dead.

frrist pushed a commit that referenced this issue Apr 25, 2024
frrist pushed a commit that referenced this issue Apr 25, 2024
@frrist frrist moved this from Inbox to In Progress in Engineering Planning May 1, 2024
@frrist frrist moved this from In Progress to In Review in Engineering Planning May 1, 2024
frrist added a commit that referenced this issue May 8, 2024
- This change modifies the Requester nodes scheduling constraints s.t.
jobs will only be scheduled on nodes that are online and approved.
Disconnected nodes and nodes that are rejected or pending will not be
eligible to run jobs.
- Additionally, this change cleans up some code by making constraints an
parameter to the node selector - which simplify various parts of
dependency construction.
- Lastly, this change removes some *Param types to avoid the possibility
of NPD.
- fixes #3784

Co-authored-by: frrist <[email protected]>
@github-project-automation github-project-automation bot moved this from In Review to Done in Engineering Planning May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: Something is not working as expected
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants