Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent per-node scheduling fallback during batch scheduling retry #719

Merged
merged 3 commits into from
Oct 8, 2024

Conversation

Andyz26
Copy link
Collaborator

@Andyz26 Andyz26 commented Oct 5, 2024

Context

Currently, when a new job gets submitted, all the workers get scheduled in batch to have an all-or-nothing manner. However, the job actor heartbeat check will also try to re-schedule a worker if it's "stuck" in the allocation phase for too long (based on the worker-heartbeat-timeout config). Thus the batch scheduling gets invalidated after some timeout (which could be problematic for larger jobs when we need more time to get the requested resource allocated from the cluster auto scaler).

Behavior changes here:

  • Batch scheduling failure will retry without attempt limit. (We will rely on the cancel request message from its job actor to interrupt).
  • JobActor heartbeat routine will not act on unscheduled workers.
  • fixed the akka-tests and added these back to the CI build.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable

@Andyz26
Copy link
Collaborator Author

Andyz26 commented Oct 5, 2024

@fdc-ntflx ptal

Copy link

github-actions bot commented Oct 5, 2024

Test Results

614 tests  +75   604 ✅ +71   8m 7s ⏱️ +11s
142 suites + 3    10 💤 + 4 
142 files   + 3     0 ❌ ± 0 

Results for commit a527db3. ± Comparison against base commit 65a7949.

♻️ This comment has been updated with latest results.

@Andyz26 Andyz26 requested a deployment to Integrate Pull Request October 7, 2024 23:28 — with GitHub Actions Waiting
@Andyz26 Andyz26 force-pushed the andyz/resubmitHandlingWithBatch branch from 46964da to a527db3 Compare October 7, 2024 23:44
@Andyz26 Andyz26 requested a deployment to Integrate Pull Request October 7, 2024 23:44 — with GitHub Actions Waiting
fdc91

This comment was marked as duplicate.

Copy link
Collaborator

@fdc-ntflx fdc-ntflx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I reviewed it with the wrong account. LGTM

@Andyz26 Andyz26 merged commit 0aabc16 into master Oct 8, 2024
4 of 5 checks passed
@Andyz26 Andyz26 deleted the andyz/resubmitHandlingWithBatch branch October 8, 2024 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants