Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CICD: Runs a full GPU install on an EC2 instance #157

Open
wants to merge 126 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
c9cc231
first crack at pulumi automation for cicd
robotrapta Jan 8, 2025
3c4a81c
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 9, 2025
6fc83cc
Adding e2e test in the main pipeline yaml.
robotrapta Jan 9, 2025
7688c2b
Merge branch 'main' into e2e-cicd
robotrapta Jan 15, 2025
ff802b1
Fixing pulumi typo
robotrapta Jan 15, 2025
f5c493e
moving test-install-g4 onto self-hosted runnner
robotrapta Jan 16, 2025
d18739b
sets default dir
robotrapta Jan 16, 2025
dece567
Commenting out pulumi up
robotrapta Jan 16, 2025
4301572
Changing triggers on main pipeline to only include PR's not every push.
robotrapta Jan 16, 2025
cda512b
Removing redundant runs-on
robotrapta Jan 16, 2025
344aa8d
Adding check on workflow formatting.
robotrapta Jan 16, 2025
8e24ea2
Adding yamllint config
robotrapta Jan 16, 2025
daed2f4
Iterating on yamllint rules.
robotrapta Jan 16, 2025
34f3d29
YAMLlint should be working now.
robotrapta Jan 16, 2025
d0cecdb
Tweaking yamllint. Fixing deliberate failure.
robotrapta Jan 16, 2025
68cbc0e
Working on self-hosted runner check.
robotrapta Jan 16, 2025
7aa119b
Check for this specific PR while developing.
robotrapta Jan 16, 2025
59bf1b6
Fixing path on pulumi
robotrapta Jan 16, 2025
2833ae1
faster iteration
robotrapta Jan 16, 2025
c2b6b8f
Tweaking pulumi auth & install.
robotrapta Jan 16, 2025
bdf03e9
fixing GHA yaml
robotrapta Jan 16, 2025
dcf7bd7
Trying to get pulumi on the path.
robotrapta Jan 16, 2025
cdb1c97
Trying again to set pulumi in the path.
robotrapta Jan 16, 2025
04c1f9e
path path path
robotrapta Jan 16, 2025
39514bb
Switching pulumi to use uv
robotrapta Jan 16, 2025
fbaa838
Iterating on installing uv
robotrapta Jan 17, 2025
dfbec55
iterating.
robotrapta Jan 17, 2025
1d5a5c5
tweak
robotrapta Jan 17, 2025
a7ba440
uv
robotrapta Jan 17, 2025
005a5db
installing python
robotrapta Jan 17, 2025
d7e376f
installing pulumi
robotrapta Jan 17, 2025
5d44aa2
switching to frigging pip
robotrapta Jan 17, 2025
d841730
Cleaning out useless uv stuff.
robotrapta Jan 17, 2025
e0a4bfa
Getting the names right of the network resources.
robotrapta Jan 17, 2025
aa69069
name tag, not name.
robotrapta Jan 17, 2025
2e964ec
Find the firstrun script.
robotrapta Jan 17, 2025
f2e2444
Actually stand up the stack!!!
robotrapta Jan 17, 2025
fce0874
Adding some automated reporting on setup success/failure.
robotrapta Jan 17, 2025
a6ebbad
Using smaller (non-gpu) instance type - maybe faster?
robotrapta Jan 17, 2025
80d09d8
Adding first crack at fabric commands to verify if EEUT is working.
robotrapta Jan 17, 2025
6009351
Adding fab tests, which can't possibly pass yet.
robotrapta Jan 17, 2025
6c2090a
actually gets the private ip of the eeut
Jan 17, 2025
24e90cb
Fab can connect to EEUT
Jan 17, 2025
c2e1a32
Adding a script to connect to eeut.
Jan 17, 2025
b185ba9
rename
Jan 17, 2025
b84a265
Activate fab!
robotrapta Jan 17, 2025
5e8071b
Make fab more patient to connect over ssh
robotrapta Jan 17, 2025
a5a29d1
Disabling ipv6 in EEUT. Fixing fab call for ee-setup check
robotrapta Jan 17, 2025
edfdc1a
More patience waiting for init script to run.
robotrapta Jan 17, 2025
f2f4242
Tweaking EEUT install tests.
robotrapta Jan 17, 2025
0a0fcc7
Give the EEUT a public IP.
robotrapta Jan 18, 2025
fb2b35c
yamllint is not a workflow.
robotrapta Jan 18, 2025
9c08a38
Switching to g4 for test.
robotrapta Jan 18, 2025
f4c355d
Comment on script.
robotrapta Jan 18, 2025
2eaed9e
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 18, 2025
f35cded
Adding workflow to validate workflow yamls.
robotrapta Jan 18, 2025
32dff5f
Taking out the TODO's in the workflows pipeline.
robotrapta Jan 18, 2025
2007c53
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 18, 2025
a4cfb5c
Delays deleting stacks until sweeper runs, to speed up the pipeline.
robotrapta Jan 18, 2025
380ff5d
Tweaking GHA rules.
robotrapta Jan 19, 2025
471d498
yaml lint
robotrapta Jan 19, 2025
e8502b7
Improving the workflow validation to catch semantic errors.
robotrapta Jan 19, 2025
09bfde7
FIxing sweeper-eeut gha yaml
robotrapta Jan 19, 2025
56162e5
Merge remote-tracking branch 'origin/main' into validate-workflow-yamls
robotrapta Jan 19, 2025
cb0b882
Improving the workflow validation to catch semantic errors.
robotrapta Jan 19, 2025
2fc60a9
Fixing comment.
robotrapta Jan 19, 2025
38c4c5f
Runs actionlint twice - once for errors, again for warnings.
robotrapta Jan 19, 2025
0c6861c
Ignoring shellcheck warnings.
robotrapta Jan 19, 2025
5981d7c
Merge branch 'validate-workflow-yamls' into e2e-cicd
robotrapta Jan 19, 2025
d2f1a68
Setting aws region.
robotrapta Jan 19, 2025
1f788a6
Setting wd
robotrapta Jan 19, 2025
4972363
Correct filename
robotrapta Jan 19, 2025
e83f546
Cleanup output on sweep-destroy.
robotrapta Jan 19, 2025
35615ae
Using instance profile with rights to pull from ECR
robotrapta Jan 19, 2025
eacb689
Serious crack at checking k8
robotrapta Jan 19, 2025
911fab9
Finds the instance profile properly
Jan 19, 2025
c90f9ec
Decent looking k8 test.
Jan 19, 2025
c4029c6
Runs the e2e test on all PRs
Jan 19, 2025
02eb995
Runs the check k8 deployment test e2e
Jan 19, 2025
80425a2
Refactoring some checking and expiration code.
Jan 19, 2025
c2c44a8
Further refactoring.
Jan 19, 2025
a33ea3f
Adding a server-port check.
Jan 19, 2025
c393bbb
Using serverport check
Jan 19, 2025
9143ae2
(Barely) functional SDK test
Jan 19, 2025
18a86ed
More disk!
Jan 19, 2025
ba77382
Adding full-check.
Jan 19, 2025
89e4868
Fixup pipeline dependency naming miss.
Jan 19, 2025
88a2a11
Basic OO fail
Jan 19, 2025
32c020d
Avoid collision with unattended-upgrade
Jan 19, 2025
c3df51f
Reordering things.
robotrapta Jan 19, 2025
4946f2d
Longer timeout for GPU to come online. Also installing into /opt/gro…
robotrapta Jan 20, 2025
ebca6c9
bugfix on expiring the stack
robotrapta Jan 20, 2025
4dc4656
Don't rename the stack. Don't `rm` the stack because it's not workin…
robotrapta Jan 20, 2025
d67e1eb
Always terminate g4 at the end.
robotrapta Jan 20, 2025
72e5ac3
Forgot to activate venv
robotrapta Jan 20, 2025
d6d8a17
typo in fab
robotrapta Jan 20, 2025
060b413
Switching to uv for faster pipelines.
robotrapta Jan 20, 2025
eeac387
worfklow syntax error.
robotrapta Jan 20, 2025
2039b38
Tweaking uv setup
robotrapta Jan 20, 2025
338bf54
activating uv's venv
robotrapta Jan 20, 2025
dd713e9
syntax error in uv cache.
robotrapta Jan 20, 2025
09fde70
losing uv venv
robotrapta Jan 20, 2025
cccc3fe
Explicitly installing pulumi again.
robotrapta Jan 20, 2025
d0de5e0
Taking out comments in pipeline.
robotrapta Jan 20, 2025
2588af2
Adding uv sync.
robotrapta Jan 20, 2025
797b38e
Swallows error shutting down instance.
robotrapta Jan 20, 2025
096faad
Makes sure the EEUT uses the code in our current branch - Derp!
robotrapta Jan 20, 2025
5fc91bf
forgot import - tired.
robotrapta Jan 20, 2025
a9cdd0a
WOrking around pulumi stupid
robotrapta Jan 20, 2025
8439e93
tweak
robotrapta Jan 20, 2025
80bbd57
robustificating again.
robotrapta Jan 20, 2025
8998ef4
Trying again to load the correct code.
robotrapta Jan 20, 2025
bd4d841
ANother attempt to set the proper code into the test envirohnment.
robotrapta Jan 20, 2025
28ec988
Simpler
robotrapta Jan 20, 2025
1d91a88
Moving sweeper to self-hosted runners.
robotrapta Jan 20, 2025
d9ae8ef
Trying to understand commit hashes
robotrapta Jan 20, 2025
375b903
USing self-hosted runner aws creds
robotrapta Jan 20, 2025
c7148cc
iterating debugging
robotrapta Jan 20, 2025
f625676
trying more
robotrapta Jan 20, 2025
f7c1d0e
AVoiding merge commit for test.
robotrapta Jan 20, 2025
4867cbf
Taking out the debugging job.
robotrapta Jan 20, 2025
4aee362
minor comments
robotrapta Jan 20, 2025
78f918f
upping GPU ready timeout to 10 minutes
robotrapta Jan 20, 2025
e086943
Deliberately broken YAML for edge deployment.
robotrapta Jan 20, 2025
7c59b30
fixing deliberately broken YAML
robotrapta Jan 20, 2025
8a1adaa
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/.yamllint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ rules:
comments: disable
trailing-spaces: disable
empty-lines: disable
new-line-at-end-of-file: disable
5 changes: 5 additions & 0 deletions .github/actionlint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,8 @@ paths:
- '.*was deprecated.*'
- '.*shellcheck.*:warning:.*'
- '.*shellcheck.*:info:.*'

# The security warning of head.ref being dangerous is painfully stupid.
# It's worried that the commit hash string could be malicious. (Never mind that
# an attacker generating PR's can much more easily just execute malicious code.)
- '.*github.event.pull_request.head.ref.*is potentially untrusted.*'
117 changes: 117 additions & 0 deletions .github/workflows/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -252,13 +252,130 @@ jobs:
if: always()
run: docker stop ${{ steps.start_container.outputs.container_id }}

G4-end-to-end:
# Note this job can run multiple times in parallel because the stack name is unique
# for the run. How much we want to do this is TBD.
runs-on: self-hosted

# Run this on any PR.
# Question: Should we wait until the other tests pass before running this?
#needs:
# - validate-setup-ee
# - test-with-k3s
# - test-sdk

env:
PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_CICD_PAT }}
PYTHONUNBUFFERED: 1
defaults:
run:
working-directory: cicd/pulumi
steps:
- name: Check out code
uses: actions/checkout@v3

- name: Name the stack
run: |
# Set to expire in 60 minutes
EXPIRATION_TIME=$(($(date +%s) + 60 * 60))
STACK_NAME=ee-cicd-${{ github.run_id }}-expires-${EXPIRATION_TIME}
echo "STACK_NAME=${STACK_NAME}" | tee -a $GITHUB_ENV
# We give the stack a name including its expiration time so that the sweeper
# (in sweeper-eeut.yaml) knows when to get rid of it.
# This saves us having to clean up here, which can be quite slow (~7 minutes for a g4)

- name: Check that aws credentials are set
# Credentials come from an IAM profile on the runner instance
run: |
aws sts get-caller-identity

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Pulumi
run: |
curl -fsSL https://get.pulumi.com | sh
export HOME=$(eval echo ~$(whoami))
echo "$HOME/.pulumi/bin" >> $GITHUB_PATH

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Make sure uv is working
run: |
uv --version
uv sync
uv run python --version

- name: Check that pulumi is installed and authenticated
run: |
uv run pulumi whoami

- name: Prepare pulumi stack
run: |
uv run pulumi stack init ${STACK_NAME}
uv run pulumi config

- name: Pick which commit we will test
run: |
echo "This is a bit subtle."
echo "We can't just test on 'main' for fairly obvious reasons - we"
echo "want to test the code in this PR's branch. The current commit"
echo "right here is ${GITHUB_SHA}, which is likely a merge commit."
echo "Merge commits are challenging. They are what would happen if"
echo "this PR were to be merged into its base branch. But they are"
echo "ephemeral things and not available in the public repo. So the"
echo "EEUT can't just check them out. Making them available to the"
echo "EEUT would require pushing them and polluting the repo. So,"
echo "for now, we are going to use the PR's head ref"
echo "${{ github.event.pull_request.head.ref }}, which is the commit"
echo "that was used to create the PR. Recognizing that this doesn't"
echo "reflect what will happen after merge. But it's simpler."

# TODO: test on the merge commit by pushing it to the repo with a temporary
# branch, and then clean up the branch later.

COMMIT_TO_TEST=${{ github.event.pull_request.head.ref }}
uv run pulumi config set ee-cicd:targetCommit ${COMMIT_TO_TEST}

- name: Create the EEUT instance
run: |
uv run pulumi up --yes

- name: Check that EE install succeeded
run: |
uv run fab connect --patience=150
uv run fab wait-for-ee-setup

- name: Wait for K8 to load everything
run: |
uv run fab check-k8-deployments
uv run fab check-server-port

- name: Use groundlight sdk through EE
run: |
EEUT_IP=$(uv run pulumi stack output eeut_private_ip)
export GROUNDLIGHT_ENDPOINT=http://${EEUT_IP}:30101
uv run groundlight whoami
uv run groundlight list-detectors

- name: Thank the worker and shut down
if: always()
run: |
echo "Strong work, G4! Now go to sleep. The grim sweeper will visit soon."
# This saves money and frees up resources
uv run fab shutdown-instance

build-push-edge-endpoint-multiplatform:
if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch' }}
# We only run this action if all the prior test actions succeed
needs:
- test-general-edge-endpoint
- test-sdk
- validate-setup-ee
- G4-end-to-end
runs-on: ubuntu-22.04
steps:
- name: Configure AWS credentials
Expand Down
61 changes: 61 additions & 0 deletions .github/workflows/sweeper-eeut.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: sweeper-eeut
# This workflow tears down old EEUT stacks from pulumi.
# We do this as a background sweeper job, because the teardown is VERY slow (~7 minutes for a g4)
# and we don't want to slow down the main pipeline for that.
on:
schedule:
- cron: '*/15 * * * *' # Every 15 minutes
# Note cron workflows only run from the main branch.
push:
branches:
# If you're working on this stuff, name your branch e2e-something and this will run.
- e2e*
concurrency:
group: sweeper-eeut
env:
PYTHON_VERSION: "3.11"

jobs:
destroy-expired-eeut-stacks:
#runs-on: ubuntu-22.04 # preferably
# Currently running on self-hosted because something is wrong with the AWS perms on the GH runners.
runs-on: self-hosted
env:
PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_CICD_PAT }}
defaults:
run:
working-directory: cicd/pulumi
steps:
- name: Check out code
uses: actions/checkout@v3

- name: Set AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-region: us-west-2
# TODO: move these back to GH-provided secrets
# Currently using IAM roles on the self-hosted runner instance.
#aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
#aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
#aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Pulumi
run: |
curl -fsSL https://get.pulumi.com | sh
export HOME=$(eval echo ~$(whoami))
echo "$HOME/.pulumi/bin" >> $GITHUB_PATH

- name: Check that pulumi is installed and authenticated
run: |
set -ex
pulumi whoami

- name: Destroy old EEUT stacks
working-directory: cicd/pulumi
run: |
./sweep-destroy-eeut-stacks.sh
102 changes: 102 additions & 0 deletions cicd/bin/install-on-ubuntu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#! /bin/bash
# This script is intended to run on a new ubuntu instance to set it up
# Sets up an edge-endpoint environment.
# It is tested in the CICD pipeline to install the edge-endpoint on a new
# g4dn.xlarge EC2 instance with Ubuntu 22.04LTS.

# As a user-data script on ubuntu, this file probably lands at
# /var/lib/cloud/instance/user-data.txt
echo "Setting up Groundlight Edge Endpoint. Follow along at /var/log/cloud-init-output.log" > /etc/motd

echo "Starting cloud init. Uptime: $(uptime)"

# Set up signals about the status of the installation
mkdir -p /opt/groundlight/ee-install-status
touch /opt/groundlight/ee-install-status/installing
SETUP_COMPLETE=0
record_result() {
if [ "$SETUP_COMPLETE" -eq 0 ]; then
echo "Setup failed at $(date)"
touch /opt/groundlight/ee-install-status/failed
echo "Groundlight Edge Endpoint setup FAILED. See /var/log/cloud-init-output.log for details." > /etc/motd
else
echo "Setup complete at $(date)"
echo "Groundlight Edge Endpoint setup complete. See /var/log/cloud-init-output.log for details." > /etc/motd
touch /opt/groundlight/ee-install-status/success
fi
# Remove "installing" at the end to avoid a race where there is no status
rm -f /opt/groundlight/ee-install-status/installing
}
trap record_result EXIT

set -e # Exit on error of any command.

wait_for_apt_lock() {
# We wait for any apt or dpkg processes to finish to avoid lock collisions
# Unattended-upgrades can hold the lock and cause the install to fail
while sudo fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do
echo "Another apt/dpkg process is running. Waiting for it to finish..."
sleep 5
done
}

# Install basic tools
wait_for_apt_lock
sudo apt update
wait_for_apt_lock
sudo apt install -y \
git \
vim \
tmux \
htop \
curl \
wget \
tree \
bash-completion \
ffmpeg

# Download the edge-endpoint code
CODE_BASE=/opt/groundlight/src/
mkdir -p ${CODE_BASE}
cd ${CODE_BASE}
git clone https://github.com/groundlight/edge-endpoint
cd edge-endpoint/
# The launching script should update this to a specific commit.
SPECIFIC_COMMIT="__EE_COMMIT_HASH__"
if [ -n "$SPECIFIC_COMMIT" ]; then
# See if the string got substituted. Note can't compare to the whole thing
# because that would be substituted too!
if [ "${SPECIFIC_COMMIT:0:10}" != "__EE_COMMIT" ]; then
echo "Checking out commit ${SPECIFIC_COMMIT}"
git checkout ${SPECIFIC_COMMIT}
else
echo "It appears the commit hash was not substituted. Staying on main."
fi
else
echo "A blank commit hash was provided. Staying on main."
fi

# Set up k3s with GPU support
./deploy/bin/install-k3s-nvidia.sh

# Set up some shell niceties
TARGET_USER="ubuntu"
echo "alias k='kubectl'" >> /home/${TARGET_USER}/.bashrc
echo "source <(kubectl completion bash)" >> /home/${TARGET_USER}/.bashrc
echo "complete -F __start_kubectl k" >> /home/${TARGET_USER}/.bashrc
echo "set -o vi" >> /home/${TARGET_USER}/.bashrc

# Configure the edge-endpoint with environment variables
export DEPLOYMENT_NAMESPACE="gl-edge"
export INFERENCE_FLAVOR="GPU"
export GROUNDLIGHT_API_TOKEN="api_token_not_set"

# Install the edge-endpoint
kubectl create namespace gl-edge
kubectl config set-context edge --namespace=gl-edge --cluster=default --user=default
kubectl config use-context edge
./deploy/bin/setup-ee.sh

# Indicate that setup is complete
SETUP_COMPLETE=1
echo "EE is installed into kubernetes, which will attempt to finish the setup."
3 changes: 3 additions & 0 deletions cicd/pulumi/.envrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
echo "This is a uv project. Remember to 'uv run ...' everything"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooohhh

uv sync

5 changes: 5 additions & 0 deletions cicd/pulumi/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

*.pyc
venv/
.venv/
__pycache__/
11 changes: 11 additions & 0 deletions cicd/pulumi/Pulumi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
name: ee-cicd
runtime:
name: python
options:
toolchain: uv
description: CI/CD for Edge Endpoint
config:
ee-cicd:instanceType: g4dn.xlarge
# Default to "main" so things are sensible if this doesn't get customized.
# But for testing purposes, this should be set to the specific commit you want to test.
ee-cicd:targetCommit: main
5 changes: 5 additions & 0 deletions cicd/pulumi/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Pulumi automation

Pulumi automation to build an EE from scratch in EC2 and run basic integration tests.


Loading
Loading