Investigation: Will Gadi/SSH/GHA terminate our long-running PBS jobs? #15

CodeGat · 2024-02-07T01:08:27Z

Our Gadi-related work has for the most part been short jobs that wouldn't run afoul of either GHAs policy on job run length, the ssh connection to Gadi being interrupted, or Gadi itself wondering why we are on the login node doing nothing.

This might change for longer running jobs, like performance testing.

Determine solutions to get around this. Self hosted runners with custom polling logic seems like the best solution, but security issues are rife.

References https://github.com/ACCESS-NRI/access-om2-configs/pull/2/files#r1477520036, quoted below:

This is more a comment for future thinking so feel free to resolve...
The CI pipeline will have to maintain a connection to Gadi all this time, so we might need to think up a different solution in CI/CD land for especially long running jobs, lest Gadi or GitHub terminate our connection/job early.
This is why self-hosted runners would be great, because we can have the completion of the qsub run trigger another workflow via custom logic on the runner itself.
For example, the entire ssh process to Gadi (and the steps within) can just be disowned or taken out of the runner environment entirely (so it isn't killed on workflow completion) and the workflow completes immediately. Then, when the disowned ssh connection completes (or we check in periodically on the runner (but not part of a job, just as code on the runner)) we kick off another job on.workflow_dispatch that connects to Gadi again and gets all the return codes and files etc.
Although there is still that little chestnut of security issues of self-hosted runners...and for something like this I believe we should backup vk83 at minimum and make a separate key of tm70_ci that can only access a subset of the projects and files it needs to do this testing.

The text was updated successfully, but these errors were encountered:

Update inputs to use updated topography around Antarctica.

CodeGat mentioned this issue Feb 15, 2024

Model Repro CI ACCESS-NRI/access-om2-configs#2

Merged

anton-seaice referenced this issue in ACCESS-NRI/access-om2-configs Apr 15, 2024

Merge pull request #8 from COSIMA/update_antarctic_topo

9b2d7f6

Update inputs to use updated topography around Antarctica.

aidanheerdegen transferred this issue from ACCESS-NRI/access-om2-configs May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Will Gadi/SSH/GHA terminate our long-running PBS jobs? #15

Investigation: Will Gadi/SSH/GHA terminate our long-running PBS jobs? #15

CodeGat commented Feb 7, 2024

Investigation: Will Gadi/SSH/GHA terminate our long-running PBS jobs? #15

Investigation: Will Gadi/SSH/GHA terminate our long-running PBS jobs? #15

Comments

CodeGat commented Feb 7, 2024