Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PENG-2342] Jobbergate Agent continually resubmits the same job if the job status update fails #607

Merged
merged 3 commits into from
Sep 5, 2024

Conversation

fschuch
Copy link
Member

@fschuch fschuch commented Aug 28, 2024

What

Create a local cache of job submissions for the jobbergate agent.:

  • Check the local cache to see if a job submission was already dispatched to Slurm
  • If the job was not already dispatched:
    • submit the job to Slurm
    • Create an entry in the cache to indicate that the job was submitted
  • Attempt to update the Job Submission in the Jobbergate API
  • If the update was successful delete the entry in the cache

Why

We've found a situation where the jobbergate-agent resubmits a job constantly to slurm despite the job having been successfully submitted.
After the Jobbergate Agent successfully submits a pending job submission to slurm, it then attempts to update the status of the job in the Jobbergate API to indicate that the job was submitted. However, if the call to update the job submission in the Jobbergate API fails for any reason, the job_submission will be left in the pending status. This means that in the next cycle of the Jobbergate Agent, it will still see the job as if it had never been submitted and will try to submit the job again.
Task: https://app.clickup.com/t/18022949/PENG-2342


Peer Review

Please follow the upstream omnivector documentation concerning
peer-review guidelines.

Copy link

codecov bot commented Aug 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.17%. Comparing base (1223c89) to head (e205c72).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
+ Coverage   92.16%   92.17%   +0.01%     
==========================================
  Files          83       83              
  Lines        4417     4423       +6     
==========================================
+ Hits         4071     4077       +6     
  Misses        346      346              
Flag Coverage Δ
agent 93.48% <ø> (+0.06%) ⬆️
api 95.09% <ø> (ø)
cli 87.70% <ø> (ø)
core 96.32% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dusktreader dusktreader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I added one very minor comment that you may choose to ignore if you wish.

jobbergate-agent/tests/jobbergate/test_submit.py Outdated Show resolved Hide resolved
@fschuch fschuch force-pushed the fschuch/PENG-2342--cache-submited-jobs-on-the-agent branch from 7e4dd5c to e205c72 Compare September 5, 2024 20:20
@fschuch fschuch merged commit 119c973 into main Sep 5, 2024
12 checks passed
@fschuch fschuch deleted the fschuch/PENG-2342--cache-submited-jobs-on-the-agent branch September 5, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants