Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CICD: Runs a full GPU install on an EC2 instance #157

Open
wants to merge 126 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 125 commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
c9cc231
first crack at pulumi automation for cicd
robotrapta Jan 8, 2025
3c4a81c
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 9, 2025
6fc83cc
Adding e2e test in the main pipeline yaml.
robotrapta Jan 9, 2025
7688c2b
Merge branch 'main' into e2e-cicd
robotrapta Jan 15, 2025
ff802b1
Fixing pulumi typo
robotrapta Jan 15, 2025
f5c493e
moving test-install-g4 onto self-hosted runnner
robotrapta Jan 16, 2025
d18739b
sets default dir
robotrapta Jan 16, 2025
dece567
Commenting out pulumi up
robotrapta Jan 16, 2025
4301572
Changing triggers on main pipeline to only include PR's not every push.
robotrapta Jan 16, 2025
cda512b
Removing redundant runs-on
robotrapta Jan 16, 2025
344aa8d
Adding check on workflow formatting.
robotrapta Jan 16, 2025
8e24ea2
Adding yamllint config
robotrapta Jan 16, 2025
daed2f4
Iterating on yamllint rules.
robotrapta Jan 16, 2025
34f3d29
YAMLlint should be working now.
robotrapta Jan 16, 2025
d0cecdb
Tweaking yamllint. Fixing deliberate failure.
robotrapta Jan 16, 2025
68cbc0e
Working on self-hosted runner check.
robotrapta Jan 16, 2025
7aa119b
Check for this specific PR while developing.
robotrapta Jan 16, 2025
59bf1b6
Fixing path on pulumi
robotrapta Jan 16, 2025
2833ae1
faster iteration
robotrapta Jan 16, 2025
c2b6b8f
Tweaking pulumi auth & install.
robotrapta Jan 16, 2025
bdf03e9
fixing GHA yaml
robotrapta Jan 16, 2025
dcf7bd7
Trying to get pulumi on the path.
robotrapta Jan 16, 2025
cdb1c97
Trying again to set pulumi in the path.
robotrapta Jan 16, 2025
04c1f9e
path path path
robotrapta Jan 16, 2025
39514bb
Switching pulumi to use uv
robotrapta Jan 16, 2025
fbaa838
Iterating on installing uv
robotrapta Jan 17, 2025
dfbec55
iterating.
robotrapta Jan 17, 2025
1d5a5c5
tweak
robotrapta Jan 17, 2025
a7ba440
uv
robotrapta Jan 17, 2025
005a5db
installing python
robotrapta Jan 17, 2025
d7e376f
installing pulumi
robotrapta Jan 17, 2025
5d44aa2
switching to frigging pip
robotrapta Jan 17, 2025
d841730
Cleaning out useless uv stuff.
robotrapta Jan 17, 2025
e0a4bfa
Getting the names right of the network resources.
robotrapta Jan 17, 2025
aa69069
name tag, not name.
robotrapta Jan 17, 2025
2e964ec
Find the firstrun script.
robotrapta Jan 17, 2025
f2e2444
Actually stand up the stack!!!
robotrapta Jan 17, 2025
fce0874
Adding some automated reporting on setup success/failure.
robotrapta Jan 17, 2025
a6ebbad
Using smaller (non-gpu) instance type - maybe faster?
robotrapta Jan 17, 2025
80d09d8
Adding first crack at fabric commands to verify if EEUT is working.
robotrapta Jan 17, 2025
6009351
Adding fab tests, which can't possibly pass yet.
robotrapta Jan 17, 2025
6c2090a
actually gets the private ip of the eeut
Jan 17, 2025
24e90cb
Fab can connect to EEUT
Jan 17, 2025
c2e1a32
Adding a script to connect to eeut.
Jan 17, 2025
b185ba9
rename
Jan 17, 2025
b84a265
Activate fab!
robotrapta Jan 17, 2025
5e8071b
Make fab more patient to connect over ssh
robotrapta Jan 17, 2025
a5a29d1
Disabling ipv6 in EEUT. Fixing fab call for ee-setup check
robotrapta Jan 17, 2025
edfdc1a
More patience waiting for init script to run.
robotrapta Jan 17, 2025
f2f4242
Tweaking EEUT install tests.
robotrapta Jan 17, 2025
0a0fcc7
Give the EEUT a public IP.
robotrapta Jan 18, 2025
fb2b35c
yamllint is not a workflow.
robotrapta Jan 18, 2025
9c08a38
Switching to g4 for test.
robotrapta Jan 18, 2025
f4c355d
Comment on script.
robotrapta Jan 18, 2025
2eaed9e
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 18, 2025
f35cded
Adding workflow to validate workflow yamls.
robotrapta Jan 18, 2025
32dff5f
Taking out the TODO's in the workflows pipeline.
robotrapta Jan 18, 2025
2007c53
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 18, 2025
a4cfb5c
Delays deleting stacks until sweeper runs, to speed up the pipeline.
robotrapta Jan 18, 2025
380ff5d
Tweaking GHA rules.
robotrapta Jan 19, 2025
471d498
yaml lint
robotrapta Jan 19, 2025
e8502b7
Improving the workflow validation to catch semantic errors.
robotrapta Jan 19, 2025
09bfde7
FIxing sweeper-eeut gha yaml
robotrapta Jan 19, 2025
56162e5
Merge remote-tracking branch 'origin/main' into validate-workflow-yamls
robotrapta Jan 19, 2025
cb0b882
Improving the workflow validation to catch semantic errors.
robotrapta Jan 19, 2025
2fc60a9
Fixing comment.
robotrapta Jan 19, 2025
38c4c5f
Runs actionlint twice - once for errors, again for warnings.
robotrapta Jan 19, 2025
0c6861c
Ignoring shellcheck warnings.
robotrapta Jan 19, 2025
5981d7c
Merge branch 'validate-workflow-yamls' into e2e-cicd
robotrapta Jan 19, 2025
d2f1a68
Setting aws region.
robotrapta Jan 19, 2025
1f788a6
Setting wd
robotrapta Jan 19, 2025
4972363
Correct filename
robotrapta Jan 19, 2025
e83f546
Cleanup output on sweep-destroy.
robotrapta Jan 19, 2025
35615ae
Using instance profile with rights to pull from ECR
robotrapta Jan 19, 2025
eacb689
Serious crack at checking k8
robotrapta Jan 19, 2025
911fab9
Finds the instance profile properly
Jan 19, 2025
c90f9ec
Decent looking k8 test.
Jan 19, 2025
c4029c6
Runs the e2e test on all PRs
Jan 19, 2025
02eb995
Runs the check k8 deployment test e2e
Jan 19, 2025
80425a2
Refactoring some checking and expiration code.
Jan 19, 2025
c2c44a8
Further refactoring.
Jan 19, 2025
a33ea3f
Adding a server-port check.
Jan 19, 2025
c393bbb
Using serverport check
Jan 19, 2025
9143ae2
(Barely) functional SDK test
Jan 19, 2025
18a86ed
More disk!
Jan 19, 2025
ba77382
Adding full-check.
Jan 19, 2025
89e4868
Fixup pipeline dependency naming miss.
Jan 19, 2025
88a2a11
Basic OO fail
Jan 19, 2025
32c020d
Avoid collision with unattended-upgrade
Jan 19, 2025
c3df51f
Reordering things.
robotrapta Jan 19, 2025
4946f2d
Longer timeout for GPU to come online. Also installing into /opt/gro…
robotrapta Jan 20, 2025
ebca6c9
bugfix on expiring the stack
robotrapta Jan 20, 2025
4dc4656
Don't rename the stack. Don't `rm` the stack because it's not workin…
robotrapta Jan 20, 2025
d67e1eb
Always terminate g4 at the end.
robotrapta Jan 20, 2025
72e5ac3
Forgot to activate venv
robotrapta Jan 20, 2025
d6d8a17
typo in fab
robotrapta Jan 20, 2025
060b413
Switching to uv for faster pipelines.
robotrapta Jan 20, 2025
eeac387
worfklow syntax error.
robotrapta Jan 20, 2025
2039b38
Tweaking uv setup
robotrapta Jan 20, 2025
338bf54
activating uv's venv
robotrapta Jan 20, 2025
dd713e9
syntax error in uv cache.
robotrapta Jan 20, 2025
09fde70
losing uv venv
robotrapta Jan 20, 2025
cccc3fe
Explicitly installing pulumi again.
robotrapta Jan 20, 2025
d0de5e0
Taking out comments in pipeline.
robotrapta Jan 20, 2025
2588af2
Adding uv sync.
robotrapta Jan 20, 2025
797b38e
Swallows error shutting down instance.
robotrapta Jan 20, 2025
096faad
Makes sure the EEUT uses the code in our current branch - Derp!
robotrapta Jan 20, 2025
5fc91bf
forgot import - tired.
robotrapta Jan 20, 2025
a9cdd0a
WOrking around pulumi stupid
robotrapta Jan 20, 2025
8439e93
tweak
robotrapta Jan 20, 2025
80bbd57
robustificating again.
robotrapta Jan 20, 2025
8998ef4
Trying again to load the correct code.
robotrapta Jan 20, 2025
bd4d841
ANother attempt to set the proper code into the test envirohnment.
robotrapta Jan 20, 2025
28ec988
Simpler
robotrapta Jan 20, 2025
1d91a88
Moving sweeper to self-hosted runners.
robotrapta Jan 20, 2025
d9ae8ef
Trying to understand commit hashes
robotrapta Jan 20, 2025
375b903
USing self-hosted runner aws creds
robotrapta Jan 20, 2025
c7148cc
iterating debugging
robotrapta Jan 20, 2025
f625676
trying more
robotrapta Jan 20, 2025
f7c1d0e
AVoiding merge commit for test.
robotrapta Jan 20, 2025
4867cbf
Taking out the debugging job.
robotrapta Jan 20, 2025
4aee362
minor comments
robotrapta Jan 20, 2025
78f918f
upping GPU ready timeout to 10 minutes
robotrapta Jan 20, 2025
e086943
Deliberately broken YAML for edge deployment.
robotrapta Jan 20, 2025
7c59b30
fixing deliberately broken YAML
robotrapta Jan 20, 2025
8a1adaa
Merge remote-tracking branch 'origin/main' into e2e-cicd
robotrapta Jan 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/.yamllint.yaml
robotrapta marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ rules:
comments: disable
trailing-spaces: disable
empty-lines: disable
new-line-at-end-of-file: disable
28 changes: 28 additions & 0 deletions .github/actionlint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
self-hosted-runner:
# Labels of self-hosted runner in array of strings.
labels: []

# Configuration variables in array of strings defined in your repository or
# organization. `null` means disabling configuration variables check.
# Empty array means no configuration variable is allowed.
config-variables: null

# Configuration for file paths. The keys are glob patterns to match to file
# paths relative to the repository root. The values are the configurations for
# the file paths. Note that the path separator is always '/'.
# The following configurations are available.
# NOTE: Everything from here down is removed in the "Warnings" run of actionlint in the workflow.
paths:
# "ignore" is an array of regular expression patterns. Matched error messages
# are ignored. This is similar to the "-ignore" command line option.
.github/workflows/**/*.{yml,yaml}:
ignore:
- '.*action is too old to run on GitHub Actions.*'
- '.*was deprecated.*'
- '.*shellcheck.*:warning:.*'
- '.*shellcheck.*:info:.*'

# The security warning of head.ref being dangerous is painfully stupid.
# It's worried that the commit hash string could be malicious. (Never mind that
# an attacker generating PR's can much more easily just execute malicious code.)
- '.*github.event.pull_request.head.ref.*is potentially untrusted.*'
129 changes: 125 additions & 4 deletions .github/workflows/pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
name: cicd
on:
push:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"push" means any push, even if not in a PR. I don't think we want that.

pull_request:
branches:
- main
types: [opened, synchronize, reopened]
workflow_dispatch:
# This allows it to be triggered manually in the github console
# You could put inputs here, but we don't need them.
Expand All @@ -10,7 +13,7 @@ concurrency:
cancel-in-progress: true
env:
PYTHON_VERSION: "3.11"
POETRY_VERSION: "1.5.1"
POETRY_VERSION: "1.8.3"
# This is the token associated with "prod-biggies" (with shared credentials on 1password)
GROUNDLIGHT_API_TOKEN: ${{ secrets.GROUNDLIGHT_API_TOKEN }}
# This is the NGINX proxy endpoint
Expand All @@ -24,6 +27,7 @@ jobs:
uses: actions/checkout@v3

- name: Set up python
id: setup_python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
Expand All @@ -41,7 +45,7 @@ jobs:
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{hashFiles('**/poetry.lock') }}
key: venv-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('**/poetry.lock') }}

- name: Install edge-endpoint's python dependencies
run: |
Expand Down Expand Up @@ -217,7 +221,7 @@ jobs:
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{hashFiles('**/poetry.lock') }}
key: venv-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('**/poetry.lock') }}

# Note that we're pulling the latest main from the SDK repo
# This might be ahead of what's published to pypi, but it's useful to test things before they're released.
Expand Down Expand Up @@ -248,13 +252,130 @@ jobs:
if: always()
run: docker stop ${{ steps.start_container.outputs.container_id }}

G4-end-to-end:
# Note this job can run multiple times in parallel because the stack name is unique
# for the run. How much we want to do this is TBD.
runs-on: self-hosted

# Run this on any PR.
# Question: Should we wait until the other tests pass before running this?
#needs:
# - validate-setup-ee
# - test-with-k3s
# - test-sdk

env:
PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_CICD_PAT }}
PYTHONUNBUFFERED: 1
defaults:
run:
working-directory: cicd/pulumi
steps:
- name: Check out code
uses: actions/checkout@v3

- name: Name the stack
run: |
# Set to expire in 60 minutes
EXPIRATION_TIME=$(($(date +%s) + 60 * 60))
STACK_NAME=ee-cicd-${{ github.run_id }}-expires-${EXPIRATION_TIME}
echo "STACK_NAME=${STACK_NAME}" | tee -a $GITHUB_ENV
# We give the stack a name including its expiration time so that the sweeper
# (in sweeper-eeut.yaml) knows when to get rid of it.
# This saves us having to clean up here, which can be quite slow (~7 minutes for a g4)

- name: Check that aws credentials are set
# Credentials come from an IAM profile on the runner instance
run: |
aws sts get-caller-identity

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Pulumi
run: |
curl -fsSL https://get.pulumi.com | sh
export HOME=$(eval echo ~$(whoami))
echo "$HOME/.pulumi/bin" >> $GITHUB_PATH

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Make sure uv is working
run: |
uv --version
uv sync
uv run python --version

- name: Check that pulumi is installed and authenticated
run: |
uv run pulumi whoami

- name: Prepare pulumi stack
run: |
uv run pulumi stack init ${STACK_NAME}
uv run pulumi config

- name: Pick which commit we will test
run: |
echo "This is a bit subtle."
echo "We can't just test on 'main' for fairly obvious reasons - we"
echo "want to test the code in this PR's branch. The current commit"
echo "right here is ${GITHUB_SHA}, which is likely a merge commit."
echo "Merge commits are challenging. They are what would happen if"
echo "this PR were to be merged into its base branch. But they are"
echo "ephemeral things and not available in the public repo. So the"
echo "EEUT can't just check them out. Making them available to the"
echo "EEUT would require pushing them and polluting the repo. So,"
echo "for now, we are going to use the PR's head ref"
echo "${{ github.event.pull_request.head.ref }}, which is the commit"
echo "that was used to create the PR. Recognizing that this doesn't"
echo "reflect what will happen after merge. But it's simpler."

# TODO: test on the merge commit by pushing it to the repo with a temporary
# branch, and then clean up the branch later.

COMMIT_TO_TEST=${{ github.event.pull_request.head.ref }}
uv run pulumi config set ee-cicd:targetCommit ${COMMIT_TO_TEST}

- name: Create the EEUT instance
run: |
uv run pulumi up --yes

- name: Check that EE install succeeded
run: |
uv run fab connect --patience=150
uv run fab wait-for-ee-setup

- name: Wait for K8 to load everything
run: |
uv run fab check-k8-deployments
uv run fab check-server-port

- name: Use groundlight sdk through EE
run: |
EEUT_IP=$(uv run pulumi stack output eeut_private_ip)
export GROUNDLIGHT_ENDPOINT=http://${EEUT_IP}:30101
uv run groundlight whoami
uv run groundlight list-detectors

- name: Thank the worker and shut down
if: always()
run: |
echo "Strong work, G4! Now go to sleep. The grim sweeper will visit soon."
# This saves money and frees up resources
uv run fab shutdown-instance

build-push-edge-endpoint-multiplatform:
if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'workflow_dispatch' }}
# We only run this action if all the prior test actions succeed
needs:
- test-general-edge-endpoint
- test-sdk
- validate-setup-ee
- G4-end-to-end
runs-on: ubuntu-22.04
steps:
- name: Configure AWS credentials
Expand Down
61 changes: 61 additions & 0 deletions .github/workflows/sweeper-eeut.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
name: sweeper-eeut
# This workflow tears down old EEUT stacks from pulumi.
# We do this as a background sweeper job, because the teardown is VERY slow (~7 minutes for a g4)
# and we don't want to slow down the main pipeline for that.
on:
schedule:
- cron: '*/15 * * * *' # Every 15 minutes
# Note cron workflows only run from the main branch.
push:
branches:
# If you're working on this stuff, name your branch e2e-something and this will run.
- e2e*
concurrency:
group: sweeper-eeut
env:
PYTHON_VERSION: "3.11"

jobs:
destroy-expired-eeut-stacks:
#runs-on: ubuntu-22.04 # preferably
# Currently running on self-hosted because something is wrong with the AWS perms on the GH runners.
runs-on: self-hosted
env:
PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_CICD_PAT }}
defaults:
run:
working-directory: cicd/pulumi
steps:
- name: Check out code
uses: actions/checkout@v3

- name: Set AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-region: us-west-2
# TODO: move these back to GH-provided secrets
# Currently using IAM roles on the self-hosted runner instance.
#aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
#aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
#aws-session-token: ${{ secrets.AWS_SESSION_TOKEN }}

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Pulumi
run: |
curl -fsSL https://get.pulumi.com | sh
export HOME=$(eval echo ~$(whoami))
echo "$HOME/.pulumi/bin" >> $GITHUB_PATH

- name: Check that pulumi is installed and authenticated
run: |
set -ex
pulumi whoami

- name: Destroy old EEUT stacks
working-directory: cicd/pulumi
run: |
./sweep-destroy-eeut-stacks.sh
36 changes: 27 additions & 9 deletions .github/workflows/validate-workflow-files.yaml
Original file line number Diff line number Diff line change
@@ -1,29 +1,25 @@
name: Workflow YAML check
# This performs fairly detailed checks on all the .yaml workflow definitions
# Note that without this, a single minor mistake in a workflow YAML
# will cause github to SILENTLY FAIL. It will:
# will cause github to (almost) SILENTLY FAIL. It will:
# - Not run any part of the workflow
# - Not even report that there was an error in the file
# - Show a hard-to-find failure in the "Actions" tab of the repo.
# This could cause a key set of checks to not run, and thus an important
# error to slip by unnoticed.

# TODO: It would be nice to validate the semantics of the workflow files
# not just their basic syntax, but this is a good start.
# e.g. if a job has a "needs:" field but nothing listed under it,
# that will pass linting, but fail at GH. I believe there's a GH API
# we can post to that will validate the workflow files.

on:
pull_request:
paths:
- '.github/workflows/*.yaml'
- '.github/.yamllint.yaml'
- '.github/*.yaml'
types: [opened, synchronize, reopened]
push:
branches:
- main
paths:
- '.github/workflows/*.yaml'
- '.github/.yamllint.yaml'
- '.github/*.yaml'

jobs:
check-workflow-files:
Expand All @@ -49,3 +45,25 @@ jobs:

- name: Run yamllint
run: yamllint -c ../.yamllint.yaml *.yaml

- name: Set up Golang
uses: actions/setup-go@v4
with:
go-version: "1.21"

- name: Install actionlint
run: |
go install github.com/rhysd/actionlint/cmd/actionlint@latest
echo "${HOME}/go/bin" >> $GITHUB_PATH

- name: Run actionlint looking for serious errors
# Actionlint can't find the config file if it's not run from the root
working-directory: .
run: actionlint -oneline

- name: Run actionlint loosely for warnings
working-directory: .
run: |
# Delete all the "ignore" lines in the actionlint.yaml file
sed -i '/^paths:/,$d' .github/actionlint.yaml
actionlint -oneline || echo "actionlint has non-critical warnings"
Loading
Loading