Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GitHub Actions health check workflow #23036

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KevinMind
Copy link
Contributor

@KevinMind KevinMind commented Jan 29, 2025

Relates to: mozilla/addons#15308

Description

Introduces a new GitHub Actions workflow to periodically check service health and version information for specified endpoints. The workflow includes:

  • Scheduled runs every 5 minutes
  • Checks version information
  • Monitors service status
  • Raises an error if any services are down

Context

This PR adds the workflow for checking the health of our services/dependencies. A future PR will integrate a slack notification when things are found

Testing

Running the workflow

You can run the workflow manually via bash (example here)

gh --ref kevinmind/addons/15308 workflow run

Select "Health Check" and then visit https://github.com/mozilla/addons-server/actions/workflows/health_check.yml to see your run

Running locally

To inspect the logic of the health check script you can run it manually passing in an environment to ping.

Note

You should do this inside a running web container if you don't want to have to install dependencies manually on your host

./scripts/health_check.py --env prod

Warning

Passing no environment should error

Checking for failure

You can manually run the bash script the healthcheck runs and modify the monitors json to test for error state.

  • first run the python script with the below script and expect an output.json file to be created
./scripts/health_check.py --verbose --output output.json
  • in the monitors, set at least one service "state" to false and add a status message.
  • manually run the logic in the bash script and expect errors to be printed out

Checklist

  • Add #ISSUENUM at the top of your PR to an existing open issue in the mozilla/addons repository.
  • Successfully verified the change locally.
  • The change is covered by automated tests, or otherwise indicated why doing so is unnecessary/impossible.
  • Add before and after screenshots (Only for changes that impact the UI).
  • Add or update relevant docs reflecting the changes made.

@KevinMind KevinMind changed the base branch from devcontainer to master January 29, 2025 08:58
@KevinMind KevinMind force-pushed the kevinmind/addons/15308 branch 11 times, most recently from 359ed4d to 14df65d Compare January 29, 2025 11:48
@KevinMind KevinMind requested a review from diox January 29, 2025 11:49
Introduces a new GitHub Actions workflow to periodically check service health and version information for specified endpoints. The workflow includes:
- Scheduled runs every 5 minutes
- Checks version information
- Monitors service status
- Raises an error if any services are down

TMP: test

Test it
@KevinMind KevinMind force-pushed the kevinmind/addons/15308 branch from 14df65d to 311332d Compare January 29, 2025 13:15
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python-version: '3.11'
python-version: '3.12'

For consistency with addons-server itself


try:
response = requests.get(url)
response.raise_for_status()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We return a 500 error in the monitor view if one or more services is down alongside a valid json response, so if you raise for status, data is not set and that causes your script to raise UnboundLocalError. We want to do the data = response.json() bit first because it can work even if the response status is not 20x.

Comment on lines +16 to +17
# TODO: maybe we could use the local environmnet here
('test', ''),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove all the special test bits here and below and just use:

Suggested change
# TODO: maybe we could use the local environmnet here
('test', ''),
# For local environments hit the nginx container as set in docker-compose.yml
('local', 'http://nginx'),

'up': {'state': True},
'down': {'state': False, 'status': 'something is wrong'},
}
return self._fetch('services/monitor.json')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely want to check __heartbeat__ and services/__heartbeat__ (separately) instead

Comment on lines +56 to +59
data=$(echo $monitors | jq -r 'to_entries[] | select(.value.state == false) | .key')
for monitor in $data; do
message="$message\n- $monitor: $(echo $monitors | jq -r ".[\"$monitor\"].status")"
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(goes with my suggestion below to not raise_for_status() too early)

I know this is the first step but we likely want to exit 1 here too if one of the state values is false so that the health check action is recorded as failing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants