-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updated user-mode data collection and support for flux resource manager #133
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
check during metrics push Signed-off-by: Karl W Schulz <[email protected]>
logging output to a file which hostname prepended Signed-off-by: Karl W Schulz <[email protected]>
victoria metrics back-end. Current enablement hard-coded via victoriaMode=True setting in main(). Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
… port Signed-off-by: Karl W Schulz <[email protected]>
command-line setting Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
…tepFile Signed-off-by: Karl W Schulz <[email protected]>
annotation file Signed-off-by: Karl W Schulz <[email protected]>
execution Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics during a job; threading added to support flask endpoint that can be used to terminate the data collector and push final data (previous file-based termination removed). Signed-off-by: Karl W Schulz <[email protected]>
"shutdown" endpoint; restrict max number of go processes for victormetrics server Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
getMetrics() method to take timestamp in millisecs directly Signed-off-by: Karl W Schulz <[email protected]>
(deep)copied and shipped to a separate thread to push the data. This minimizes blocking of main polling loop. Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics; remove unused remotewrite configuration for prometheus Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics Signed-off-by: Karl W Schulz <[email protected]>
names for victoriametrics settings Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W Schulz <[email protected]>
checking for victoriametrics path; tweak shutdown timeout for exporter when using victoriametrics Signed-off-by: Karl W Schulz <[email protected]>
victoriametrics examples Signed-off-by: Karl W Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
time in the Job Step panel; additional queries and transformations added to sort by the job step time (since we cannot assume the job step is an ordinal number, thank you flux). Enabled missing legend in the Average GPU Power panel. Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
using prometheus) Signed-off-by: Karl W. Schulz <[email protected]>
example, account for binary name in ubuntu) Signed-off-by: Karl W. Schulz <[email protected]>
push-based using victoria Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
with victoria Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Signed-off-by: Karl W. Schulz <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR incudes two main additions:
an updated user-mode data collection process that leverages a "push" model with VictoriaMetrics as the underlying back end as opposed to the previous "pull" model with Prometheus. This approach leverages existing data collector families as is but implements a local polling loop to query the data and cache the results. At periodic intervals (default of 5 minutes), the cached results are pushed to a VictoriaMetrics server running on the master compute node. The results can be queried the same way via a prometheus endpoint provided by Victoria and there is no change required for user-mode Grafana. Adopting a cached push model, GPU telemetry metrics can be sampled at a higher rates (~e.g. down to 10-50 milliseconds).
adds support for the Flux resource manager (in addition to SLURM). This impacts the
collector_rms
data collector and enables job identification support (along with job steps) on systems running flux. Note that flux jobids are not ordinal integers so there is some corresponding impact to Grafana dashboard configuration.