You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our Gadi-related work has for the most part been short jobs that wouldn't run afoul of either GHAs policy on job run length, the ssh connection to Gadi being interrupted, or Gadi itself wondering why we are on the login node doing nothing.
This might change for longer running jobs, like performance testing.
Determine solutions to get around this. Self hosted runners with custom polling logic seems like the best solution, but security issues are rife.
This is more a comment for future thinking so feel free to resolve...
The CI pipeline will have to maintain a connection to Gadi all this time, so we might need to think up a different solution in CI/CD land for especially long running jobs, lest Gadi or GitHub terminate our connection/job early.
This is why self-hosted runners would be great, because we can have the completion of the qsub run trigger another workflow via custom logic on the runner itself.
For example, the entire ssh process to Gadi (and the steps within) can just be disowned or taken out of the runner environment entirely (so it isn't killed on workflow completion) and the workflow completes immediately. Then, when the disowned ssh connection completes (or we check in periodically on the runner (but not part of a job, just as code on the runner)) we kick off another job on.workflow_dispatch that connects to Gadi again and gets all the return codes and files etc.
Although there is still that little chestnut of security issues of self-hosted runners...and for something like this I believe we should backup vk83 at minimum and make a separate key of tm70_ci that can only access a subset of the projects and files it needs to do this testing.
The text was updated successfully, but these errors were encountered:
Our Gadi-related work has for the most part been short jobs that wouldn't run afoul of either GHAs policy on job run length, the ssh connection to Gadi being interrupted, or Gadi itself wondering why we are on the login node doing nothing.
This might change for longer running jobs, like performance testing.
Determine solutions to get around this. Self hosted runners with custom polling logic seems like the best solution, but security issues are rife.
References https://github.com/ACCESS-NRI/access-om2-configs/pull/2/files#r1477520036, quoted below:
The text was updated successfully, but these errors were encountered: