Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated user-mode data collection and support for flux resource manager #133

Merged
merged 64 commits into from
Dec 12, 2024

Conversation

koomie
Copy link
Collaborator

@koomie koomie commented Dec 9, 2024

This PR incudes two main additions:

  1. an updated user-mode data collection process that leverages a "push" model with VictoriaMetrics as the underlying back end as opposed to the previous "pull" model with Prometheus. This approach leverages existing data collector families as is but implements a local polling loop to query the data and cache the results. At periodic intervals (default of 5 minutes), the cached results are pushed to a VictoriaMetrics server running on the master compute node. The results can be queried the same way via a prometheus endpoint provided by Victoria and there is no change required for user-mode Grafana. Adopting a cached push model, GPU telemetry metrics can be sampled at a higher rates (~e.g. down to 10-50 milliseconds).

  2. adds support for the Flux resource manager (in addition to SLURM). This impacts the collector_rms data collector and enables job identification support (along with job steps) on systems running flux. Note that flux jobids are not ordinal integers so there is some corresponding impact to Grafana dashboard configuration.

@koomie koomie added the enhancement New feature or request label Dec 9, 2024
@koomie koomie added this to the 1.1 milestone Dec 9, 2024
koomie added 25 commits December 9, 2024 12:12
check during metrics push

Signed-off-by: Karl W Schulz <[email protected]>
logging output to a file which hostname prepended

Signed-off-by: Karl W Schulz <[email protected]>
victoria metrics back-end. Current enablement hard-coded via
victoriaMode=True setting in main().

Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics during a job; threading added to support flask
endpoint that can be used to terminate the data collector and push
final data (previous file-based termination removed).

Signed-off-by: Karl W Schulz <[email protected]>
"shutdown" endpoint; restrict max number of go processes for
victormetrics server

Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
getMetrics() method to take timestamp in millisecs directly

Signed-off-by: Karl W Schulz <[email protected]>
(deep)copied and shipped to a separate thread to push the data. This
minimizes blocking of main polling loop.

Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics; remove unused remotewrite configuration for
prometheus

Signed-off-by: Karl W Schulz <[email protected]>
names for victoriametrics settings

Signed-off-by: Karl W Schulz <[email protected]>
checking for victoriametrics path; tweak shutdown timeout for exporter
when using victoriametrics

Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics examples

Signed-off-by: Karl W Schulz <[email protected]>
koomie added 11 commits December 9, 2024 12:13
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
time in the Job Step panel; additional queries and transformations
added to sort by the job step time (since we cannot assume the job
step is an ordinal number, thank you flux). Enabled missing legend in
the Average GPU Power panel.

Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
koomie added 15 commits December 9, 2024 17:14
example, account for binary name in ubuntu)

Signed-off-by: Karl W. Schulz <[email protected]>
push-based using victoria

Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
@koomie koomie merged commit cf49e2f into main Dec 12, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants