Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CICD: Runs a full GPU install on an EC2 instance #157

Open
wants to merge 126 commits into
base: main
Choose a base branch
from
Open

Conversation

robotrapta
Copy link
Member

@robotrapta robotrapta commented Jan 8, 2025

Pretty big (but useful!) change. Adds a GHA workflow step that in a fully automatic way:

  • Creates a new g4 EC2 instance
  • Installs EE on it. (Installs K3s, and installs our YAMLs.)
  • Checks that the EE install script runs successfully
  • Checks that the k8 deployments come up
  • Checks that the SDK can do minimal things through the EE (whoami, list-detectors)
  • (Does not tears down the EC2 infra - there's a sweeper which will run every 30 minutes and do that async so the pipeline doesn't have to wait for it, because it takes ~7 minutes to turn off a G4 instance.)

It uses pulumi for infra, and relies on pulumi infra defined in the GL_Public AWS account.

It does NOT yet:

  • Check that any inference models work.

(Note: there is replication of #169 which improves the workflow YAML and validation thereof directly.)
(Note: this relies on resources defined in the gl_public account defined in the internal ci-infra repo.)

@robotrapta
Copy link
Member Author

Okay I think this is actually working again and really ready for review! I'm gonna deliberately break the k8 YAML and make sure the test fails (it wouldn't have before) and then restore it.

@robotrapta robotrapta marked this pull request as draft January 20, 2025 20:17
@robotrapta
Copy link
Member Author

robotrapta commented Jan 20, 2025

It's spec not shrek :)

Error from server (BadRequest): error when creating "deploy/k3s/edge_deployment/edge_deployment.yaml.tmp": Deployment in version "v1" cannot be handled as a Deployment: strict decoding error: unknown field "shrek"

@robotrapta robotrapta marked this pull request as ready for review January 20, 2025 20:28
@@ -0,0 +1,3 @@
echo "This is a uv project. Remember to 'uv run ...' everything"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooohhh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a worry b/c dependabot wont check uv.lock files yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal. But this dependency closure is pretty tiny - only a handful of packages in here. And it looks like dependabot is going to add uv this quarter.

Copy link
Member

@tyler-romero tyler-romero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants