-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up GPU CI #1067
Comments
New issue: caching. My understanding is that all of github actions caching caches data to github servers. This doesn't reduce data ingress when our job isn't running on github servers. Right now this job is downloading about 1gb of data per run. We should try and enable caching through aws. |
I wouldn't be surprised if fetching the image and connecting to Github actions takes some time? But I guess @aktech knows this better... |
GitHub only reports the time it took to run the job on it, nothing before or after. There are the following times to consider:
Using your custom AMI, (it's just basically spinning up an ubuntu machine with gpu and installing nvidia drivers yourself, and creating and AMI from it), would reduce the spinup time significantly, I can get on a call to help with this, if required. |
Here’s the docs on how to set up custom images with CiRun: https://docs.cirun.io/custom-images/cloud-custom-images |
You want the second section in that doc: https://docs.cirun.io/custom-images/cloud-custom-images#aws-building-custom-images-with-user-modification (first one ues nvidia image) |
Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢. Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly. |
Are you planning to run test inside docker container? You'd still need nvidia/cuda in the base VM image I think.
Currently I don't think we have that statistics in the UI anywhere, but I can consider adding it in the check run. Meanwhile, until the instance is visible in the aws dashboard (it is usually visible for sometime in the dashboard), you can run the following command to see how long it was alive for: aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].LaunchTime' --region eu-north-1 aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].StateTransitionReason' --region eu-north-1 |
No, but this was just following the amazon ECR instructions for "how to create an image". I believe you need
I would like to have a more programatic way to construct these containers, so will look into this. Thanks! I found out that I had managed to get logged into a scope with very little access, which is what was making it so difficult to see anything... Still no idea how I did that, I think maybe via the rackspace site? But now I can look at CloudTrail and have set up Config so I think we can use that. I think times are down now. It was taking about 12 min a run last friday (according to rackspace), now it's more like 4.5 (according to aws). Github still says it's more like 2, so there's room for improvement, but still better. Of course will be good to compare measurements from the same place. @aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers. |
Yes, correct.
Yep, makes sense. We would have built for customers, but NVIDIA's license doesn't allows distribution, but yeah if we had the CI setup for automating this, you could have used that, but currently we don't have one public, it's a WIP.
I didn't see that, which workflow? This one seems to take less than 2.5 minutes: https://github.com/scverse/anndata/actions/runs/5716599171/job/15494250907?pr=1084 |
It's not that it takes a long time, it's that it takes longer to setup than to run the tests, so I'd like to bring that down. |
Triggering GPU CISo after billing was a little higher than expected (which may be fixed, but need to confirm once billing updates) we decided not to run CI on every commit. We set the action to run on So, we need something else to trigger this. It seems our options are:
Implementation (I think)on:
pull_request:
types:
- labeled
- edited
- synchronize
jobs:
test:
if: ${{ contains(github.event.pull_request.labels, 'run-gpu-ci') }}
Currently thinking a tag makes the most sense, since we can easily enable and disable it, and it isn't neccesarily linked with merging. It could be that either a label or auto merge are enough. |
yup, as I thought. except for the merge queue, all of these of course mean that it’ll run for all commits after the label/comment/whatever is added. one option would be to have the workflow remove the label again:
|
@ivirshup the rapids-team does this with a comment from a member. this tiggers a ci run. But from what I can tell they use |
Let’s avoid this if possible. It might be possible to manually call the GitHub API to list all PRs for a branch and then create and update a check for the PR being found, but I’d rather not go down that road when it looks that there’s a much simpler solution. |
I think there is value to giving a PR the green light to use paid CI, and not needed to approve each individual commit. The one off case could be useful too, but I think triggering via a comment makes more sense in this case. |
@Intron7, I think rapids are using API calls from checks to trigger |
@ivirshup Another tip: To reduce cost you can use preemptible (spot) instances: runners:
- name: aws-gpu-runner
cloud: aws
instance_type: g4dn.xlarge
machine_image: ami-067a4ba2816407ee9
region: eu-north-1
preemptible:
- true
- false
labels:
- cirun-aws-gpu Doc: https://docs.cirun.io/reference/fallback-runners#example-3-preemptiblenon-preemptible-instances This would try to spinup a preemptible instance first and if that fails, then it will spinup on-demand instance.
|
We've still got a little room for improvement on GPU CI, but I think it's pretty much set up! Costs per run are now down to about 1 cent for anndata |
Please describe your wishes and possible alternatives to achieve the desired result.
What does GPU CI need?
import anndata, cupy; cupyx.scipy.sparse.random(100, 50, format="csr")
Will be partially solved by: #1066
cc @Intron7
The text was updated successfully, but these errors were encountered: