Set up GPU CI #1067

ivirshup · 2023-07-21T13:58:34Z

Please describe your wishes and possible alternatives to achieve the desired result.

What does GPU CI need?

import anndata, cupy; cupyx.scipy.sparse.random(100, 50, format="csr")
Pytest activation mark, only run gpu tests
When does this run? Start with every time?
How does the secret code work?
- It is managed through cirun
How do images work? https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq?sr=0-1&ref_=beagle&applicationId=AWS-EC2-Console
- Do we need to do our own image? @Zethson is following up with cirun. Could be the reason the test can take a while to start.
Sharing coverage

Will be partially solved by: #1066

The text was updated successfully, but these errors were encountered:

ivirshup · 2023-07-27T18:13:44Z

New issue: caching. My understanding is that all of github actions caching caches data to github servers. This doesn't reduce data ingress when our job isn't running on github servers. Right now this job is downloading about 1gb of data per run.

We should try and enable caching through aws.

ivirshup · 2023-07-28T10:46:51Z

So the billing was a little higher than expected. Basically about $2 for the one PR.

Admittedly this PR had a lot of trouble shooting pushes – about 29 commits had checks start. However a number of these checks could have been cancelled by follow up pushes, which I'll add.

Right now the entire CI run (as reported by github actions) takes about 4 minutes:

But on our billing console it says it took about 12 minutes. So what up with that? Is our billing console reporting time in an unexpected way, is the machine running for longer than github actions knows?

Any thoughts @Zethson @aktech?

Zethson · 2023-07-28T10:48:15Z

I wouldn't be surprised if fetching the image and connecting to Github actions takes some time? But I guess @aktech knows this better...

aktech · 2023-07-28T12:09:21Z

But on our billing console it says it took about 12 minutes. So what up with that? Is our billing console reporting time in an unexpected way, is the machine running for longer than github actions knows?

GitHub only reports the time it took to run the job on it, nothing before or after.

There are the following times to consider:

If you're using NVIDIA images, they take a very long time to start (they have some internal provisioning, I don't know a lot about it)
there is some time to provision/initialize github actions by cirun (less than a minute)
There is some time delay (variable: 10s to 2 min) for github to start the job on it, it's a known github issue: Very slow queuing with plenty of idle runners available actions/runner#676

Using your custom AMI, (it's just basically spinning up an ubuntu machine with gpu and installing nvidia drivers yourself, and creating and AMI from it), would reduce the spinup time significantly, I can get on a call to help with this, if required.

flying-sheep · 2023-07-31T12:52:39Z

Here’s the docs on how to set up custom images with CiRun: https://docs.cirun.io/custom-images/cloud-custom-images

aktech · 2023-07-31T13:08:05Z

You want the second section in that doc: https://docs.cirun.io/custom-images/cloud-custom-images#aws-building-custom-images-with-user-modification (first one ues nvidia image)

ivirshup · 2023-07-31T16:35:26Z

Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.

Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly.

aktech · 2023-07-31T16:54:42Z

Thanks for the info @aktech! I've been able to get something running using that. I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.

Are you planning to run test inside docker container? You'd still need nvidia/cuda in the base VM image I think.

Right now I am trying to see how long the instance was actually around for, but I'm not actually sure where I can see logs for this. I think our setup is a little obfuscated here, and the view I have doesn't seem to update quickly.

Currently I don't think we have that statistics in the UI anywhere, but I can consider adding it in the check run. Meanwhile, until the instance is visible in the aws dashboard (it is usually visible for sometime in the dashboard), you can run the following command to see how long it was alive for:

aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].LaunchTime' --region eu-north-1

aws ec2 describe-instances --instance-ids INSTANCE_ID --query 'Reservations[0].Instances[0].StateTransitionReason' --region eu-north-1

ivirshup · 2023-07-31T18:58:19Z

I had been trying to create a docker file for this from some AWS docs, but it turns out it gets more complicated to generate dockerfiles for GPU setups 😢.
Are you planning to run test inside docker container?

No, but this was just following the amazon ECR instructions for "how to create an image".

I believe you need nvidia-container-toolkit (an extension to docker) to do this kind of thing.

You'd still need nvidia/cuda in the base VM image I think.

I would like to have a more programatic way to construct these containers, so will look into this. Thanks!

I found out that I had managed to get logged into a scope with very little access, which is what was making it so difficult to see anything... Still no idea how I did that, I think maybe via the rackspace site? But now I can look at CloudTrail and have set up Config so I think we can use that.

I think times are down now. It was taking about 12 min a run last friday (according to rackspace), now it's more like 4.5 (according to aws). Github still says it's more like 2, so there's room for improvement, but still better. Of course will be good to compare measurements from the same place.

@aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers.

aktech · 2023-07-31T21:43:23Z

I believe you need nvidia-container-toolkit (an extension to docker) to do this kind of thing.

Yes, correct.

I would like to have a more programatic way to construct these containers, so will look into this. Thanks!

Yep, makes sense. We would have built for customers, but NVIDIA's license doesn't allows distribution, but yeah if we had the CI setup for automating this, you could have used that, but currently we don't have one public, it's a WIP.

@aktech, do you have any suggestions for how we could do caching for our CI? A non-trivial amount of time is spent building wheels and downloading things, which I think we could get down. However, I don't think that Github Actions caching is going to help a ton here since it's on github's servers.

I didn't see that, which workflow? This one seems to take less than 2.5 minutes: https://github.com/scverse/anndata/actions/runs/5716599171/job/15494250907?pr=1084

ivirshup · 2023-08-01T09:17:28Z

I didn't see that, which workflow?

It's not that it takes a long time, it's that it takes longer to setup than to run the tests, so I'd like to bring that down.

ivirshup · 2023-08-01T10:00:31Z

Triggering GPU CI

So after billing was a little higher than expected (which may be fixed, but need to confirm once billing updates) we decided not to run CI on every commit. We set the action to run on workflow_dispatch so it would be manually triggered, but it seems like we can't use this as branch protection since workflow_dispatch doesn't count towards passing a check.

So, we need something else to trigger this. It seems our options are:

a tag

Implementation (I think)

https://stackoverflow.com/questions/62325286/run-github-actions-when-pull-requests-have-a-specific-label

on:
  pull_request:
    types:
      - labeled
      - edited
      - synchronize

jobs:
  test:
    if: ${{ contains(github.event.pull_request.labels, 'run-gpu-ci') }}

a comment
approving review
merge queue

Currently thinking a tag makes the most sense, since we can easily enable and disable it, and it isn't neccesarily linked with merging. It could be that either a label or auto merge are enough.

flying-sheep · 2023-08-01T10:16:19Z

yup, as I thought. except for the merge queue, all of these of course mean that it’ll run for all commits after the label/comment/whatever is added.

one option would be to have the workflow remove the label again:

for each commit, all tests except for GPU tests are run
we add the run GPU tests once label
the workflow job first removes the label again, then runs the GPU tests

Intron7 · 2023-08-01T10:57:22Z

a comment

@ivirshup the rapids-team does this with a comment from a member. this tiggers a ci run. But from what I can tell they use workflow_dispatch

flying-sheep · 2023-08-01T11:09:15Z

they use workflow_dispatch

Let’s avoid this if possible. It might be possible to manually call the GitHub API to list all PRs for a branch and then create and update a check for the PR being found, but I’d rather not go down that road when it looks that there’s a much simpler solution.

ivirshup · 2023-08-01T12:41:08Z

I think there is value to giving a PR the green light to use paid CI, and not needed to approve each individual commit.

The one off case could be useful too, but I think triggering via a comment makes more sense in this case.

ivirshup · 2023-08-01T12:46:40Z

@Intron7, I think rapids are using API calls from checks to trigger workflow_dispatch. But they also have a pretty involved CI system: https://github.com/rapidsai/cudf/blob/branch-23.10/.github/workflows/pr.yaml

aktech · 2023-08-02T12:36:57Z

@ivirshup Another tip: To reduce cost you can use preemptible (spot) instances:

runners:
  - name: aws-gpu-runner
    cloud: aws
    instance_type: g4dn.xlarge
    machine_image: ami-067a4ba2816407ee9
    region: eu-north-1
    preemptible:
      - true
      - false
    labels:
      - cirun-aws-gpu

Doc: https://docs.cirun.io/reference/fallback-runners#example-3-preemptiblenon-preemptible-instances

This would try to spinup a preemptible instance first and if that fails, then it will spinup on-demand instance.
They are upto 90% cheaper and 50% on a average, current price in a couple of regions (https://aws.amazon.com/ec2/spot/pricing/):

us-east-2

Capacity/Instance	Spot	On-Demand
g4dn.xlarge	$0.1578 per Hour	$0.3418 per Hour

eu-north-1

Capacity/Instance	Spot	On-Demand
g4dn.xlarge	$0.1674 per Hour	$0.3514 per Hour

ivirshup · 2023-09-07T09:24:39Z

We've still got a little room for improvement on GPU CI, but I think it's pretty much set up!

Costs per run are now down to about 1 cent for anndata

ivirshup added enhancement dev process labels Jul 21, 2023

flying-sheep assigned ivirshup and Zethson Jul 27, 2023

ivirshup mentioned this issue Jul 31, 2023

Update machine image for GPU CI #1084

Merged

3 tasks

ivirshup closed this as completed Sep 7, 2023

aktech mentioned this issue Jul 18, 2024

Adding GPU CI zarr-developers/zarr-python#2041

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up GPU CI #1067

Set up GPU CI #1067

ivirshup commented Jul 21, 2023 •

edited

Loading

ivirshup commented Jul 27, 2023

ivirshup commented Jul 28, 2023

Zethson commented Jul 28, 2023

aktech commented Jul 28, 2023

flying-sheep commented Jul 31, 2023

aktech commented Jul 31, 2023

ivirshup commented Jul 31, 2023

aktech commented Jul 31, 2023

ivirshup commented Jul 31, 2023

aktech commented Jul 31, 2023

ivirshup commented Aug 1, 2023

ivirshup commented Aug 1, 2023

flying-sheep commented Aug 1, 2023 •

edited

Loading

Intron7 commented Aug 1, 2023 •

edited

Loading

flying-sheep commented Aug 1, 2023

ivirshup commented Aug 1, 2023

ivirshup commented Aug 1, 2023

aktech commented Aug 2, 2023

ivirshup commented Sep 7, 2023

Set up GPU CI #1067

Set up GPU CI #1067

Comments

ivirshup commented Jul 21, 2023 • edited Loading

Please describe your wishes and possible alternatives to achieve the desired result.

ivirshup commented Jul 27, 2023

ivirshup commented Jul 28, 2023

Zethson commented Jul 28, 2023

aktech commented Jul 28, 2023

flying-sheep commented Jul 31, 2023

aktech commented Jul 31, 2023

ivirshup commented Jul 31, 2023

aktech commented Jul 31, 2023

ivirshup commented Jul 31, 2023

aktech commented Jul 31, 2023

ivirshup commented Aug 1, 2023

ivirshup commented Aug 1, 2023

Triggering GPU CI

flying-sheep commented Aug 1, 2023 • edited Loading

Intron7 commented Aug 1, 2023 • edited Loading

flying-sheep commented Aug 1, 2023

ivirshup commented Aug 1, 2023

ivirshup commented Aug 1, 2023

aktech commented Aug 2, 2023

ivirshup commented Sep 7, 2023

ivirshup commented Jul 21, 2023 •

edited

Loading

flying-sheep commented Aug 1, 2023 •

edited

Loading

Intron7 commented Aug 1, 2023 •

edited

Loading