Prevent per-node scheduling fallback during batch scheduling retry #719

Andyz26 · 2024-10-05T01:44:50Z

Context

Currently, when a new job gets submitted, all the workers get scheduled in batch to have an all-or-nothing manner. However, the job actor heartbeat check will also try to re-schedule a worker if it's "stuck" in the allocation phase for too long (based on the worker-heartbeat-timeout config). Thus the batch scheduling gets invalidated after some timeout (which could be problematic for larger jobs when we need more time to get the requested resource allocated from the cluster auto scaler).

Behavior changes here:

Batch scheduling failure will retry without attempt limit. (We will rely on the cancel request message from its job actor to interrupt).
JobActor heartbeat routine will not act on unscheduled workers.
fixed the akka-tests and added these back to the CI build.

Checklist

./gradlew build compiles code correctly
Added new tests where applicable
./gradlew test passes all tests
Extended README or added javadocs where applicable

Andyz26 · 2024-10-05T01:45:10Z

@fdc-ntflx ptal

github-actions · 2024-10-05T01:51:21Z

Test Results

614 tests +75 604 ✅ +71 8m 7s ⏱️ +11s
142 suites + 3 10 💤 + 4
142 files + 3 0 ❌ ± 0

Results for commit a527db3. ± Comparison against base commit 65a7949.

♻️ This comment has been updated with latest results.

fdc-ntflx

Sorry I reviewed it with the wrong account. LGTM

Andyz26 added 2 commits October 4, 2024 17:48

batch schedule retry

93113d4

fix akka test

aa2a666

Andyz26 requested review from calvin681, sundargates, hmitnflx and fdc-ntflx as code owners October 5, 2024 01:44

Andyz26 requested a deployment to Integrate Pull Request October 5, 2024 01:45 — with GitHub Actions Waiting

Andyz26 requested a deployment to Integrate Pull Request October 7, 2024 23:28 — with GitHub Actions Waiting

fix ut

a527db3

Andyz26 force-pushed the andyz/resubmitHandlingWithBatch branch from 46964da to a527db3 Compare October 7, 2024 23:44

Andyz26 requested a deployment to Integrate Pull Request October 7, 2024 23:44 — with GitHub Actions Waiting

This comment was marked as duplicate.

Sign in to view

fdc-ntflx approved these changes Oct 8, 2024

View reviewed changes

Andyz26 merged commit 0aabc16 into master Oct 8, 2024
4 of 5 checks passed

Andyz26 deleted the andyz/resubmitHandlingWithBatch branch October 8, 2024 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent per-node scheduling fallback during batch scheduling retry #719

Prevent per-node scheduling fallback during batch scheduling retry #719

Andyz26 commented Oct 5, 2024

Andyz26 commented Oct 5, 2024

github-actions bot commented Oct 5, 2024 •

edited

Loading

This comment was marked as duplicate.

fdc-ntflx left a comment

Prevent per-node scheduling fallback during batch scheduling retry #719

Prevent per-node scheduling fallback during batch scheduling retry #719

Conversation

Andyz26 commented Oct 5, 2024

Context

Checklist

Andyz26 commented Oct 5, 2024

github-actions bot commented Oct 5, 2024 • edited Loading

Test Results

This comment was marked as duplicate.

fdc-ntflx left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 5, 2024 •

edited

Loading