-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The requester node attempts to schedule work on disconnected nodes resulting in the job never running #3784
Comments
What is the proposal here? I believe the default option now is to auto-approve nodes, and only schedule on approved and connected nodes. Is any of that still missing? |
Yeah the only scheduling on connected and approved nodes is missing. Currently we schedule on disconnected node for some job types and ignore their approval state for other job types. frankly it's a bit of a mess:
Or rather than modify, allow these aspects of scheduling to be configured. Further we need to ensure that worked scheduled on an offline node runs when the node comes back online which we will need #3772 to do. e.g. the orchestrator could listen for connected events and create an evaluation to execute the work. |
Another point to consider: |
@wdbaruni previously suggested adding another (a third) timeout in future which allows nodes to be offline for that long before being considered dead. |
- This change modifies the Requester nodes scheduling constraints s.t. jobs will only be scheduled on nodes that are online and approved. Disconnected nodes and nodes that are rejected or pending will not be eligible to run jobs. - Additionally, this change cleans up some code by making constraints an parameter to the node selector - which simplify various parts of dependency construction. - Lastly, this change removes some *Param types to avoid the possibility of NPD. - fixes #3784 Co-authored-by: frrist <[email protected]>
Due to changes here:
Current proposal is to:
NodeInfo
withunknown
approval status which overrides previous approvals/rejections #3783cc @rossjones & @wdbaruni to weigh in on how the new event system introduced in #3772 can be used to force scheduling of executions when offline compute nodes come online again.
The text was updated successfully, but these errors were encountered: