Feature/rmsv2 #111

omri-amd · 2024-09-27T01:51:09Z

draft PR. to be review and worked on before any merges.

collector_rms proposed changes:

add a new method to fetch slurm job data using scontrol
a. ~~using scontrol, we can fetch all data from a central place, no need to collect logs from each compute node~~
changed behavior to exact;y the same as now, with addtional data that can come from
scontrol show job id <id> -d --json
b. added support for multiple jobs running on the same compute node
c. added support for GPU id that is being used in each job.
d. added support for CPU id that is being used in each job.

TODO
a. ~~feature (label) parity with existing collector_rms~~ Done
b. ~~test jobs that do not use GPUs~~ Done
c. ~~grafana dashboard samples with new gpu/cpu ids~~ Done, will be committed at a later date, in a separate PR.

performance wise, we see ~ ~~150~~ 100 ms being added to the Prometheus query on a 15 active job cluster. (~30 for just amdsmi, ~ ~~180~~ 130 amdsmi + rms v2)

will conduct more performance testing once we get a quorum to continue this work.

Sample output:
rmsjob_info{account="AMD",batchflag="0",card="0",jobid="300",jobstep="0",nodes="2",partition="AMD",type="slurm",user="omri-amd"} 1.0
rmsjob_info{account="AMD",batchflag="0",card="1",jobid="300",jobstep="0",nodes="2",partition="AMD",type="slurm",user="omri-amd"} 1.0

jordap · 2024-09-27T08:49:51Z

I haven't looked at the code yet. But before considering support for multiple jobs in the same node, I think it would be important to understand the changes you are proposing to the RMS collector and its impact on query complexity and performance. I'd suggest providing:

An example of how one of the current queries in the job dashboard needs to be updated.
Estimation of performance impact for dashboard queries.

omri-amd · 2024-10-01T02:50:18Z

hi @jordap, @koomie
we have made additional changes to keep it in line with existing implementation, once we review it we can consider merging it with an additional config flag
rms_extra_info=True for example, as its 100% compatible with existing collector_rms code.

samples from grafana (GPU utilization, slurm job is using 2 nodes with 2 gpus each, [8 total gpus in each node])

top chart is our modified chart, showing 100% usage (of 4 of the requested GPU)
bottom chart is what currently exists, showing 25% usage across the 2 nodes. (4 GPUs at 100% out of 16)

queries: (amdsmi to be replaced by rocm prefix)
to add support for Card, add on (instance, card)
avg(amdsmi_average_gfx_activity * on (instance, card) group_left() rmsjob_info{jobid="$jobid"})
to add support to existing panel for multi job (only required if you dont work in exclusive mode)
group by (instance)
avg(amdsmi_average_gfx_activity * on (instance) group_left() group by (instance) (rmsjob_info{jobid="$jobid"}))

jordap · 2024-10-10T23:28:36Z

I looked into the changes and performed some additional testing. The approach in this PR can work, particularly at smaller scales. But it introduces a significant number of changes and has an impact on scalability. This is a useful feature, and I think it would be best for us to spend more time to figure out if this is the best approach to support shared nodes in the long run.

More detailed comments and description of some of the issues we need to consider:

Dashboards. We'll need dedicated sets of dashboards for clusters with this option enabled. I was hoping we could keep it compatible and be able to enable this option in a subset of nodes. Unfortunately, all queries need to be updated and they are incompatible with existing queries (different join). We'd need to automate the conversion of existing dashboards with a script (doable), but all nodes in the cluster will need to use the same collector. It will also break some panels like the job allocation timeline in the Node dashboard, which won't be as easy to fix/automate.
Annotations. The current implementation breaks annotations and we should find a way to make it work.
Implementation. It's currently an entirely separate collector, but ideally we'd like it to be an option of the rms collector. There is also room to improve the performance of the current implementation (~10ms for rms vs ~110ms for rmsv2).
Node exporter. It's not clear what to do with node_exporter metrics. With this approach, the obvious option to support node_exporter is to add more labels for different resources (cores, memory). But there is no way to keep track of all resources (think IO, network). It would make sense to simply remove all node_exporter panels when this option is enabled, but that's a significant downgrade. Alternatively, we could display some of these resources as "shared" with other users/jobs, but that may not always be desirable.
Scale. One of the main concerns is the amount of labels this change introduces, particularly if we want to support CPU and additional resources. card itself isn't a very high-cardinality label, in current systems it will only have ~8 possible values. But we need to keep in mind that other labels in rmsjob_info already have high cardinality and churn potential. If we keep adding more labels we'll run into issues sooner than later.

I think we need to keep looking into the performance of each single sample, try to find a way to make the data/queries compatible between exclusive and shared, and figure out what the plans are for combining data with node_exporter. It may also be helpful for us to estimate the number of active metrics for different cluster configurations and set scalability targets.

jordap · 2024-10-11T17:48:47Z

Scale. One of the main concerns is the amount of labels this change introduces, particularly if we want to support CPU and additional resources. card itself isn't a very high-cardinality label, in current systems it will only have ~8 possible values. But we need to keep in mind that other labels in rmsjob_info already have high cardinality and churn potential. If we keep adding more labels we'll run into issues sooner than later.

For reference, in a hypothetical cluster with 10k nodes,100k GPUs, and a moderate churn of 10-minute jobs, the number of active metrics grows significantly when adding the card label to rmsjob_info: from ~1.3M metrics to ~2.4M metrics. And it can get worse if jobs are shorter.

We could decide that that's OK for this feature, particularly if we can make this work in a subset of nodes. But that's only for GPUs. If we were to add more labels for other resources like cores, which can have values in the [0-512) range, the number of metrics will explode really fast and I don't think this approach will work.

jordap · 2024-10-11T17:54:26Z

Dashboards. We'll need dedicated sets of dashboards for clusters with this option enabled. I was hoping we could keep it compatible and be able to enable this option in a subset of nodes. Unfortunately, all queries need to be updated and they are incompatible with existing queries (different join). We'd need to automate the conversion of existing dashboards with a script (doable), but all nodes in the cluster will need to use the same collector. It will also break some panels like the job allocation timeline in the Node dashboard, which won't be as easy to fix/automate.

One option to make the same dashboards work in both scenarios (with and without card label in rmsjob_info) is using an or:

(rocm_utilization_percentage * on (instance,card) group_left() rmsjob_info{jobid="$jobid",card!=""}) or (rocm_utilization_percentage * on (instance) group_left() rmsjob_info{jobid="$jobid",card=""})

This will attempt to perform both queries, but only one of them will succeed. The downside is the obvious duplication and increased complexity in every query.

omri-amd added 2 commits September 26, 2024 19:37

rms v2 init

db5858f

rms v2 add config flags

c22d6dc

omri-amd requested review from koomie and jordap September 27, 2024 01:51

Omri Avraham added 5 commits September 30, 2024 15:38

Merge branch 'features/m2m' into feature/rmsv2

398599b

refactor code, collector needs to run on each compute node

fee4a2e

added check for multiple rms collector enabled

9dff053

changed prefix back to amdsmi_

f3663d6

update to use json output, different handling for gpus

cfe01ad

Omri Avraham added 6 commits September 30, 2024 21:31

Merge branch 'features/m2m' into feature/rmsv2

8f5aaed

better error handling for timeout errors

bb93050

bug fix with gpu check

d61faf0

update tres/gres message to debug

cc5de20

fixed bug with incorrectly removing metrics when job is done

3a7042a

Merge branch 'features/m2m' into feature/rmsv2

60fc065

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/rmsv2 #111

Feature/rmsv2 #111

omri-amd commented Sep 27, 2024 •

edited

Loading

jordap commented Sep 27, 2024

omri-amd commented Oct 1, 2024

jordap commented Oct 10, 2024 •

edited

Loading

jordap commented Oct 11, 2024

jordap commented Oct 11, 2024

Feature/rmsv2 #111

Are you sure you want to change the base?

Feature/rmsv2 #111

Conversation

omri-amd commented Sep 27, 2024 • edited Loading

jordap commented Sep 27, 2024

omri-amd commented Oct 1, 2024

jordap commented Oct 10, 2024 • edited Loading

jordap commented Oct 11, 2024

jordap commented Oct 11, 2024

omri-amd commented Sep 27, 2024 •

edited

Loading

jordap commented Oct 10, 2024 •

edited

Loading