-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/rmsv2 #111
base: features/m2m
Are you sure you want to change the base?
Feature/rmsv2 #111
Conversation
I haven't looked at the code yet. But before considering support for multiple jobs in the same node, I think it would be important to understand the changes you are proposing to the RMS collector and its impact on query complexity and performance. I'd suggest providing:
|
hi @jordap, @koomie samples from grafana (GPU utilization, slurm job is using 2 nodes with 2 gpus each, [8 total gpus in each node]) queries: (amdsmi to be replaced by rocm prefix) |
I looked into the changes and performed some additional testing. The approach in this PR can work, particularly at smaller scales. But it introduces a significant number of changes and has an impact on scalability. This is a useful feature, and I think it would be best for us to spend more time to figure out if this is the best approach to support shared nodes in the long run. More detailed comments and description of some of the issues we need to consider:
I think we need to keep looking into the performance of each single sample, try to find a way to make the data/queries compatible between exclusive and shared, and figure out what the plans are for combining data with |
For reference, in a hypothetical cluster with 10k nodes,100k GPUs, and a moderate churn of 10-minute jobs, the number of active metrics grows significantly when adding the We could decide that that's OK for this feature, particularly if we can make this work in a subset of nodes. But that's only for GPUs. If we were to add more labels for other resources like cores, which can have values in the |
One option to make the same dashboards work in both scenarios (with and without
This will attempt to perform both queries, but only one of them will succeed. The downside is the obvious duplication and increased complexity in every query. |
draft PR. to be review and worked on before any merges.
collector_rms proposed changes:
add a new method to fetch slurm job data using scontrol
a.
using scontrol, we can fetch all data from a central place, no need to collect logs from each compute nodechanged behavior to exact;y the same as now, with addtional data that can come from
scontrol show job id <id> -d --json
b. added support for multiple jobs running on the same compute node
c. added support for GPU id that is being used in each job.
d. added support for CPU id that is being used in each job.
TODO
a.
feature (label) parity with existing collector_rmsDoneb.
test jobs that do not use GPUsDonec.
grafana dashboard samples with new gpu/cpu idsDone, will be committed at a later date, in a separate PR.performance wise, we see ~
150100 ms being added to the Prometheus query on a 15 active job cluster. (~30 for just amdsmi, ~180130 amdsmi + rms v2)will conduct more performance testing once we get a quorum to continue this work.
Sample output:
rmsjob_info{account="AMD",batchflag="0",card="0",jobid="300",jobstep="0",nodes="2",partition="AMD",type="slurm",user="omri-amd"} 1.0
rmsjob_info{account="AMD",batchflag="0",card="1",jobid="300",jobstep="0",nodes="2",partition="AMD",type="slurm",user="omri-amd"} 1.0