Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated user-mode data collection and support for flux resource manager #133

Merged
merged 64 commits into from
Dec 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
e793046
wip progress for caching exporter
koomie Oct 16, 2024
d31afff
add standalone wrapper script
koomie Oct 16, 2024
d0114c4
add a getMemoryUsageMB utility
koomie Jul 17, 2024
feb982e
remove rocm prefix filter, include instance label
koomie Oct 16, 2024
d20bc12
re-enable metric filtering based on rocm|rmsjob_info; add exception
koomie Oct 24, 2024
06fea14
add optional logFile argument for Monitor class init; if present, send
koomie Oct 24, 2024
ae8ae1d
update user-mode helper utility to use standalone exporter with
koomie Oct 24, 2024
a4dfdcb
add option to redirect output to logfile
koomie Oct 24, 2024
ba77180
add command-line arg support for victoriametric endpoint hostname and…
koomie Oct 24, 2024
dfc3da7
update victoriametrics-based launch to include endpoint hostname
koomie Oct 24, 2024
c39f51b
tweak rmsjob prefix to allow for annotations
koomie Oct 28, 2024
101cac9
update prefix filter for metric caching
koomie Oct 28, 2024
9f5ade2
add caching of rmsjob info based on modification timestamp of rmsJobS…
koomie Oct 29, 2024
1f35347
add caching of annotation date based on modification timestamp of json
koomie Oct 29, 2024
627c1ca
add option to runBGProcess() to supply additional env variables for
koomie Nov 1, 2024
8f54290
refactor standalone collector to periodcially push data to
koomie Nov 4, 2024
09aee8c
update stopexporters() in victoriametrics mode to use similar
koomie Nov 4, 2024
517a545
remove unused sampling accumulator; include some memory stats in output
koomie Nov 5, 2024
52da429
remove debug print
koomie Nov 6, 2024
b358d39
include timestamp in log output
koomie Nov 6, 2024
1dc7ec2
remove older __data cache variable and innit in Standalone class; update
koomie Nov 6, 2024
9a0a310
Update push logic to victoriametrics endpoint; cached data is now
koomie Nov 6, 2024
e3f849c
add a "--push" command line argument to enable push mode using
koomie Nov 7, 2024
01d3398
remove deprecated dumpCache() routine; bug fix for final data push to
koomie Nov 7, 2024
c21986f
make push model default for user-mode; tweak runtime config variable
koomie Nov 13, 2024
871ec70
update default configfile to include victoriametrics settings
koomie Nov 13, 2024
74fe57b
update default interval for user-mode to be 30 secs; add some error
koomie Nov 13, 2024
14cc636
remove example prometheus settings from configfile; superceded by
koomie Nov 13, 2024
5dde377
rename compose file that uses prometheus
koomie Nov 13, 2024
f454d66
updated compose file to use victoriametrics as replacement for
koomie Nov 13, 2024
d13da38
Update argparsing for --push. Use --no-push to revert to
koomie Nov 13, 2024
b0db707
Fix victoria-metrics overridden datadir
jordap Nov 15, 2024
f20e33f
Add omnistat-standalone to scripts
jordap Nov 15, 2024
083a965
Add victoriametrics endpoint check during init; punt if not working.
koomie Nov 26, 2024
e7da8c1
update periodic push performance via use of alternate data caching
koomie Nov 27, 2024
54ff537
update job query for flux leveraging getattr to determine parent job id
koomie Nov 28, 2024
dc3d1df
pull in support for flux resource manager in standalone user-mode
koomie Nov 28, 2024
629bfa1
apply formatting
koomie Nov 28, 2024
d1c05ee
apply formatting
koomie Nov 28, 2024
8bd4d98
remove unused remote write config
koomie Nov 28, 2024
20b6939
include job step id check with flux
koomie Dec 3, 2024
a8cc8d5
update job stepid conversion to use string to account for flux jobnames
koomie Dec 4, 2024
8b1e6b6
apply json formatter
koomie Dec 4, 2024
ef8b6f8
updated panels in job step summary row - reworked to include the start
koomie Dec 4, 2024
eed8236
show missing legend in job step plots
koomie Dec 6, 2024
2594f36
apply formatter to monitor.py
koomie Dec 9, 2024
2b2c85a
apply formatter to standalone.py
koomie Dec 9, 2024
ba359c7
update existing user-mode test to run in non-push mode (pull mode
koomie Dec 9, 2024
beca950
use more generic name for victoria-metrics server during teardown (for
koomie Dec 9, 2024
f1bd6ec
add a companion user-mode test for default mode which is now
koomie Dec 9, 2024
4ebdcda
test config options for push mode
koomie Dec 9, 2024
122ac38
include victoria-metrics in test container
koomie Dec 9, 2024
501ba4c
include runtime settings for victoria-metrics
koomie Dec 9, 2024
133804b
apply formatter
koomie Dec 9, 2024
bc50464
rename existing user-mode test
koomie Dec 10, 2024
85a3e6d
update test name for pull
koomie Dec 10, 2024
04e53b5
add addition user-mode test for new default config using push model
koomie Dec 10, 2024
19a1a3a
update pytest filename for user-mode pull test
koomie Dec 10, 2024
60bd62b
testing source build only temporarily
koomie Dec 10, 2024
2867e7f
reduce sampling interval for push
koomie Dec 10, 2024
8ebbb00
test addition of sleep after server startup
koomie Dec 10, 2024
b1380b2
apply formatter
koomie Dec 12, 2024
9391de9
restore package-based testing for push mode
koomie Dec 12, 2024
8651b53
tweak top-level CI tests for user mode
koomie Dec 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
name: User mode
name: User mode - Pull
on:
push:
branches: [ main, dev ]
pull_request:
branches: [ main, dev ]
jobs:
test:
name: User-level Omnistat
name: User-level Omnistat (pull/prometheus)
runs-on: ubuntu-22.04
strategy:
matrix:
Expand Down Expand Up @@ -39,4 +39,4 @@ jobs:
run: pip3 install -r test/requirements.txt
- name: Run tests
working-directory: ./test
run: pytest -v test_job_user.py
run: pytest -v test_job_user_pull.py
42 changes: 42 additions & 0 deletions .github/workflows/test-user-push.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: User mode - Push
on:
push:
branches: [ main, dev ]
pull_request:
branches: [ main, dev ]
jobs:
test:
name: User-level Omnistat (push/victoria)
runs-on: ubuntu-22.04
strategy:
matrix:
execution: [ source, package ]
steps:
- name: Check out repository code
uses: actions/checkout@v4
- name: Comment out GPU devices (not available in GitHub)
run: sed -i "/devices:/,+2 s/^/#/" test/docker/slurm/compose.yaml
- name: Disable SMI collector (won't work in GitHub)
run: >
sed -i "s/enable_rocm_smi = True/enable_rocm_smi = False/" \
test/docker/slurm/omnistat-user.config
- name: Start containerized environment
env:
TEST_OMNISTAT_EXECUTION: ${{ matrix.execution }}
run: docker compose -f test/docker/slurm/compose.yaml -f test/docker/slurm/compose-user.yaml up -d
- name: Wait for user-level Omnistat
run: >
timeout 5m bash -c \
'for i in controller node1 node2; do \
until [[ $(docker logs -n 1 slurm-$i) == READY ]]; do \
echo "Waiting for $i..."; \
sleep 5; \
done \
done'
- name: Display node logs
run: for i in node1 node2; do docker logs slurm-$i; done
- name: Install test dependencies
run: pip3 install -r test/requirements.txt
- name: Run tests
working-directory: ./test
run: pytest -s -v test_job_user_push.py
30 changes: 30 additions & 0 deletions docker/compose.prometheus.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
services:
prometheus:
image: prom/prometheus
container_name: prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=1y'
- '--query.max-samples=500000000'
ports:
- 9090:9090
restart: unless-stopped
volumes:
- ./prometheus:/etc/prometheus
- ./prometheus-data:/prometheus
user: "${PROMETHEUS_USER}"
grafana:
image: grafana/grafana
container_name: grafana
ports:
- 3000:3000
restart: unless-stopped
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=grafana
- GF_USERS_DEFAULT_THEME=light
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Editor
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/etc/grafana/provisioning/json-models/index.json
volumes:
- ./grafana:/etc/grafana/provisioning
10 changes: 3 additions & 7 deletions docker/compose.yaml
Original file line number Diff line number Diff line change
@@ -1,18 +1,14 @@
services:
prometheus:
image: prom/prometheus
image: victoriametrics/victoria-metrics
container_name: prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=1y'
- '--query.max-samples=500000000'
- '-httpListenAddr=:9090'
ports:
- 9090:9090
restart: unless-stopped
volumes:
- ./prometheus:/etc/prometheus
- ./prometheus-data:/prometheus
user: "${PROMETHEUS_USER}"
- ./prometheus-data/vicdata/:/victoria-metrics-data
grafana:
image: grafana/grafana
container_name: grafana
Expand Down
Loading