CICD: Runs a full GPU install on an EC2 instance (#157)

Pretty big (but useful!) change. Adds a GHA workflow step that in a fully automatic way: - Creates a new g4 EC2 instance - Installs EE on it. (Installs K3s, and installs our YAMLs.) - Checks that the EE install script runs successfully - Checks that the k8 deployments come up - Checks that the SDK can do minimal things through the EE (whoami, list-detectors) - (Does not tears down the EC2 infra - there's a sweeper which will run every 30 minutes and do that async so the pipeline doesn't have to wait for it, because it takes ~7 minutes to turn off a G4 instance.) It uses pulumi for infra, and relies on pulumi infra defined in the GL_Public AWS account. It does NOT yet: - Check that any inference models work. (Note: there is replication of #169 which improves the workflow YAML and validation thereof directly.) (Note: this relies on resources defined in the `gl_public` account defined in the internal `ci-infra` repo.) --------- Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Ubuntu <robotrapta@groundlight>
groundlight · Jan 24, 2025 · 1695c3f · 1695c3f
1 parent 063e8d3
commit 1695c3f
Show file tree

Hide file tree

Showing 16 changed files with 1,724 additions and 5 deletions.
diff --git a/.github/.yamllint.yaml b/.github/.yamllint.yaml
@@ -12,3 +12,4 @@ rules:
   comments: disable
   trailing-spaces: disable
   empty-lines: disable
+  new-line-at-end-of-file: disable
diff --git a/.github/actionlint.yaml b/.github/actionlint.yaml
@@ -21,3 +21,8 @@ paths:
       - '.*was deprecated.*'
       - '.*shellcheck.*:warning:.*'
       - '.*shellcheck.*:info:.*'
+
+      # The security warning of head.ref being dangerous is painfully stupid.
+      # It's worried that the commit hash string could be malicious.  (Never mind that
+      # an attacker generating PR's can much more easily just execute malicious code.)
+      - '.*github.event.pull_request.head.ref.*is potentially untrusted.*'
diff --git a/.github/workflows/pipeline.yaml b/.github/workflows/pipeline.yaml
@@ -28,7 +28,7 @@ jobs:
 
       - name: Set up python
         id: setup_python
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ env.PYTHON_VERSION }}
 
@@ -179,7 +179,7 @@ jobs:
         uses: actions/checkout@v3
 
       - name: Set up python
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ env.PYTHON_VERSION }}
 
@@ -252,13 +252,130 @@ jobs:
         if: always()
         run: docker stop ${{ steps.start_container.outputs.container_id }}
 
+  G4-end-to-end:
+    # Note this job can run multiple times in parallel because the stack name is unique
+    # for the run.  How much we want to do this is TBD.
+    runs-on: self-hosted
+
+    # Run this on any PR.
+    # Question: Should we wait until the other tests pass before running this?
+    #needs:
+    #  - validate-setup-ee
+    #  - test-with-k3s
+    #  - test-sdk
+
+    env:
+      PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_CICD_PAT }}
+      PYTHONUNBUFFERED: 1
+    defaults:
+      run:
+        working-directory: cicd/pulumi
+    steps:
+      - name: Check out code
+        uses: actions/checkout@v3
+
+      - name: Name the stack
+        run: |
+          # Set to expire in 60 minutes
+          EXPIRATION_TIME=$(($(date +%s) + 60 * 60))
+          STACK_NAME=ee-cicd-${{ github.run_id }}-expires-${EXPIRATION_TIME}
+          echo "STACK_NAME=${STACK_NAME}" | tee -a $GITHUB_ENV
+          # We give the stack a name including its expiration time so that the sweeper
+          # (in sweeper-eeut.yaml) knows when to get rid of it.
+          # This saves us having to clean up here, which can be quite slow (~7 minutes for a g4)
+
+      - name: Check that aws credentials are set
+        # Credentials come from an IAM profile on the runner instance
+        run: |
+          aws sts get-caller-identity
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ env.PYTHON_VERSION }}
+
+      - name: Install Pulumi
+        run: |
+          curl -fsSL https://get.pulumi.com | sh
+          export HOME=$(eval echo ~$(whoami))
+          echo "$HOME/.pulumi/bin" >> $GITHUB_PATH
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+
+      - name: Make sure uv is working
+        run: |
+          uv --version
+          uv sync
+          uv run python --version
+
+      - name: Check that pulumi is installed and authenticated
+        run: |
+          uv run pulumi whoami
+
+      - name: Prepare pulumi stack
+        run: |
+          uv run pulumi stack init ${STACK_NAME}
+          uv run pulumi config
+
+      - name: Pick which commit we will test
+        run: |
+          echo "This is a bit subtle."
+          echo "We can't just test on 'main' for fairly obvious reasons - we"
+          echo "want to test the code in this PR's branch. The current commit"
+          echo "right here is ${GITHUB_SHA}, which is likely a merge commit."
+          echo "Merge commits are challenging. They are what would happen if"
+          echo "this PR were to be merged into its base branch. But they are"
+          echo "ephemeral things and not available in the public repo. So the"
+          echo "EEUT can't just check them out. Making them available to the"
+          echo "EEUT would require pushing them and polluting the repo. So,"
+          echo "for now, we are going to use the PR's head ref"
+          echo "${{ github.event.pull_request.head.ref }}, which is the commit"
+          echo "that was used to create the PR. Recognizing that this doesn't"
+          echo "reflect what will happen after merge. But it's simpler."
+
+          # TODO: test on the merge commit by pushing it to the repo with a temporary
+          # branch, and then clean up the branch later.
+
+          COMMIT_TO_TEST=${{ github.event.pull_request.head.ref }}
+          uv run pulumi config set ee-cicd:targetCommit ${COMMIT_TO_TEST}
+
+      - name: Create the EEUT instance
+        run: |
+          uv run pulumi up --yes 
+
+      - name: Check that EE install succeeded
+        run: |
+          uv run fab connect --patience=150
+          uv run fab wait-for-ee-setup
+
+      - name: Wait for K8 to load everything
+        run: |
+          uv run fab check-k8-deployments
+          uv run fab check-server-port
+
+      - name: Use groundlight sdk through EE
+        run: |
+          EEUT_IP=$(uv run pulumi stack output eeut_private_ip)
+          export GROUNDLIGHT_ENDPOINT=http://${EEUT_IP}:30101
+          uv run groundlight whoami
+          uv run groundlight list-detectors
+
+      - name: Thank the worker and shut down
+        if: always()
+        run: |
+          echo "Strong work, G4! Now go to sleep. The grim sweeper will visit soon."
+          # This saves money and frees up resources
+          uv run fab shutdown-instance
+
   build-push-edge-endpoint-multiplatform:
     if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch' }}
     # We only run this action if all the prior test actions succeed
     needs:
       - test-general-edge-endpoint
       - test-sdk
       - validate-setup-ee
+      - G4-end-to-end
     runs-on: ubuntu-22.04
     steps:
       - name: Configure AWS credentials

diff --git a/.github/workflows/sweeper-eeut.yaml b/.github/workflows/sweeper-eeut.yaml
@@ -0,0 +1,61 @@
+name: sweeper-eeut
+# This workflow tears down old EEUT stacks from pulumi.
+# We do this as a background sweeper job, because the teardown is VERY slow (~7 minutes for a g4)
+# and we don't want to slow down the main pipeline for that.
+on:
+  schedule:
+    - cron: '*/15 * * * *'  # Every 15 minutes
+      # Note cron workflows only run from the main branch.
+  push:
+    branches:
+      # If you're working on this stuff, name your branch e2e-something and this will run.
+      - e2e*
+concurrency:
+  group: sweeper-eeut
+env:
+  PYTHON_VERSION: "3.11"
+
+jobs:
+  destroy-expired-eeut-stacks:
+    #runs-on: ubuntu-22.04  # preferably
+    # Currently running on self-hosted because something is wrong with the AWS perms on the GH runners.
+    runs-on: self-hosted
+    env:
+      PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_CICD_PAT }}
+    defaults:
+      run:
+        working-directory: cicd/pulumi
+    steps:
+      - name: Check out code
+        uses: actions/checkout@v3
+
+      - name: Set AWS credentials
+        uses: aws-actions/configure-aws-credentials@v2
+        with:
+          aws-region: us-west-2
+          # TODO: move these back to GH-provided secrets
+          # Currently using IAM roles on the self-hosted runner instance.
+          #aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          #aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          #aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ env.PYTHON_VERSION }}
+
+      - name: Install Pulumi
+        run: |
+          curl -fsSL https://get.pulumi.com | sh
+          export HOME=$(eval echo ~$(whoami))
+          echo "$HOME/.pulumi/bin" >> $GITHUB_PATH
+
+      - name: Check that pulumi is installed and authenticated
+        run: |
+          set -ex
+          pulumi whoami
+
+      - name: Destroy old EEUT stacks
+        working-directory: cicd/pulumi
+        run: |
+          ./sweep-destroy-eeut-stacks.sh
diff --git a/cicd/bin/install-on-ubuntu.sh b/cicd/bin/install-on-ubuntu.sh
@@ -0,0 +1,102 @@
+#! /bin/bash
+# This script is intended to run on a new ubuntu instance to set it up 
+# Sets up an edge-endpoint environment.
+# It is tested in the CICD pipeline to install the edge-endpoint on a new 
+# g4dn.xlarge EC2 instance with Ubuntu 22.04LTS.
+
+# As a user-data script on ubuntu, this file probably lands at
+# /var/lib/cloud/instance/user-data.txt
+echo "Setting up Groundlight Edge Endpoint.  Follow along at /var/log/cloud-init-output.log" > /etc/motd
+
+echo "Starting cloud init.  Uptime: $(uptime)"
+
+# Set up signals about the status of the installation
+mkdir -p /opt/groundlight/ee-install-status
+touch /opt/groundlight/ee-install-status/installing
+SETUP_COMPLETE=0
+record_result() {
+    if [ "$SETUP_COMPLETE" -eq 0 ]; then
+        echo "Setup failed at $(date)"
+        touch /opt/groundlight/ee-install-status/failed
+        echo "Groundlight Edge Endpoint setup FAILED.  See /var/log/cloud-init-output.log for details." > /etc/motd
+    else
+        echo "Setup complete at $(date)"
+        echo "Groundlight Edge Endpoint setup complete.  See /var/log/cloud-init-output.log for details." > /etc/motd
+        touch /opt/groundlight/ee-install-status/success
+    fi
+    # Remove "installing" at the end to avoid a race where there is no status
+    rm -f /opt/groundlight/ee-install-status/installing
+}
+trap record_result EXIT
+
+set -e  # Exit on error of any command.
+
+wait_for_apt_lock() {
+    # We wait for any apt or dpkg processes to finish to avoid lock collisions
+    # Unattended-upgrades can hold the lock and cause the install to fail
+    while sudo fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do
+        echo "Another apt/dpkg process is running. Waiting for it to finish..."
+        sleep 5
+    done
+}
+
+# Install basic tools
+wait_for_apt_lock
+sudo apt update
+wait_for_apt_lock
+sudo apt install -y \
+    git \
+    vim \
+    tmux \
+    htop \
+    curl \
+    wget \
+    tree \
+    bash-completion \
+    ffmpeg
+
+# Download the edge-endpoint code
+CODE_BASE=/opt/groundlight/src/
+mkdir -p ${CODE_BASE}
+cd ${CODE_BASE}
+git clone https://github.com/groundlight/edge-endpoint
+cd edge-endpoint/
+# The launching script should update this to a specific commit.
+SPECIFIC_COMMIT="__EE_COMMIT_HASH__"
+if [ -n "$SPECIFIC_COMMIT" ]; then
+    # See if the string got substituted.  Note can't compare to the whole thing
+    # because that would be substituted too!
+    if [ "${SPECIFIC_COMMIT:0:10}" != "__EE_COMMIT" ]; then
+        echo "Checking out commit ${SPECIFIC_COMMIT}"
+        git checkout ${SPECIFIC_COMMIT}
+    else
+        echo "It appears the commit hash was not substituted.  Staying on main."
+    fi
+else
+    echo "A blank commit hash was provided.  Staying on main."
+fi
+
+# Set up k3s with GPU support
+./deploy/bin/install-k3s-nvidia.sh
+
+# Set up some shell niceties
+TARGET_USER="ubuntu"
+echo "alias k='kubectl'" >> /home/${TARGET_USER}/.bashrc
+echo "source <(kubectl completion bash)" >> /home/${TARGET_USER}/.bashrc
+echo "complete -F __start_kubectl k" >> /home/${TARGET_USER}/.bashrc
+echo "set -o vi" >> /home/${TARGET_USER}/.bashrc
+
+# Configure the edge-endpoint with environment variables
+export DEPLOYMENT_NAMESPACE="gl-edge"
+export INFERENCE_FLAVOR="GPU"
+export GROUNDLIGHT_API_TOKEN="api_token_not_set"
+
+# Install the edge-endpoint
+kubectl create namespace gl-edge
+kubectl config set-context edge --namespace=gl-edge --cluster=default --user=default
+kubectl config use-context edge
+./deploy/bin/setup-ee.sh
+
+# Indicate that setup is complete
+SETUP_COMPLETE=1
+echo "EE is installed into kubernetes, which will attempt to finish the setup."
diff --git a/cicd/pulumi/.envrc b/cicd/pulumi/.envrc
@@ -0,0 +1,3 @@
+echo "This is a uv project.  Remember to 'uv run ...' everything"
+uv sync
+
diff --git a/cicd/pulumi/.gitignore b/cicd/pulumi/.gitignore
@@ -0,0 +1,5 @@
+
+*.pyc
+venv/
+.venv/
+__pycache__/
diff --git a/cicd/pulumi/Pulumi.yaml b/cicd/pulumi/Pulumi.yaml
@@ -0,0 +1,11 @@
+name: ee-cicd
+runtime:
+  name: python
+  options:
+    toolchain: uv
+description: CI/CD for Edge Endpoint
+config:
+  ee-cicd:instanceType: g4dn.xlarge
+  # Default to "main" so things are sensible if this doesn't get customized.
+  # But for testing purposes, this should be set to the specific commit you want to test.
+  ee-cicd:targetCommit: main
diff --git a/cicd/pulumi/README.md b/cicd/pulumi/README.md
@@ -0,0 +1,5 @@
+# Pulumi automation
+
+Pulumi automation to build an EE from scratch in EC2 and run basic integration tests.
+
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		echo "This is a uv project. Remember to 'uv run ...' everything"
		uv sync
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,5 @@
		# Pulumi automation

		Pulumi automation to build an EE from scratch in EC2 and run basic integration tests.