Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to execute sacct to get job historic data for metrics #857

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

abujeda
Copy link

@abujeda abujeda commented Oct 23, 2024

Draft implementation to add support to the Slurm adapter to execute sacct command to get historic job data to calculate metrics

Fixes: #856

@abujeda
Copy link
Author

abujeda commented Oct 23, 2024

Struggling to make it Slurm agnostic. Currently it returns the raw response from Slurm as an array of hashes.

@abujeda
Copy link
Author

abujeda commented Oct 23, 2024

Sample response from sacct command:

[root@c1 /]# sacct -nP --units=G --format=JobId,User,Elapsed,ReqMem,AllocCPUS,ReqCPUS,Timelimit,State,TotalCPU,MaxRSS,Submit,Start,ReqTRES --state=CA,CD,F,OOM,TO,R -S 2024-10-01T00:00:00 -E 2024-10-30T23:59:59
1|ood|01:11:29|0.49G|1|1|01:00:00|TIMEOUT|00:12.565||2024-10-11T09:22:40|2024-10-11T09:22:40|billing=1,cpu=1,mem=0.49G,node=1
1.batch||01:11:29||1|1||CANCELLED|00:12.565|0.11G|2024-10-11T09:22:40|2024-10-11T09:22:40|
2|ood|01:00:01|0.49G|1|1|01:00:00|TIMEOUT|00:05.037||2024-10-14T14:21:35|2024-10-14T14:21:35|billing=1,cpu=1,mem=0.49G,node=1
2.batch||01:00:02||1|1||CANCELLED|00:05.037|0.17G|2024-10-14T14:21:35|2024-10-14T14:21:35|
3|ood|01:00:21|0.49G|1|1|01:00:00|TIMEOUT|00:01.386||2024-10-14T14:21:44|2024-10-14T14:21:45|billing=1,cpu=1,mem=0.49G,node=1
3.batch||01:00:21||1|1||CANCELLED|00:01.386|0.03G|2024-10-14T14:21:45|2024-10-14T14:21:45|
4|ood|01:17:41|0.49G|1|1|01:00:00|TIMEOUT|00:05.184||2024-10-14T22:18:55|2024-10-14T22:18:55|billing=1,cpu=1,mem=0.49G,node=1
4.batch||01:17:42||1|1||CANCELLED|00:05.184|0.18G|2024-10-14T22:18:55|2024-10-14T22:18:55|

@treydock
Copy link
Contributor

I don't think OnDemand supports getting job step data, so maybe need --allocations flag? That won't show memory usage though, but I'd advocate utilizing things like Grafana and the current ability to link Grafana panels to a job's view as Slurm's accounting database isn't really great for collecting metrics about jobs like memory or CPU usage. If you do not use --allocations maybe only care about the job ID and batch step as I believe batch step will have usage for whole job and not individual step.

@abujeda
Copy link
Author

abujeda commented Oct 24, 2024

Thanks Trey. We are only interested in overall job data, but we need the data in the job/batch steps for the memory usage.
We are currently merging all the job related data into a single row. Using the max value for CPU, Memory and Elapsed. We do the merging as part of the metrics calculations.

We could do the merging inside the adapter code and create a partially populated Info object. But I am not sure this merging should be done at this level.

I'd advocate utilizing things like Grafana and the current ability to link Grafana panels to a job's view

We will be looking at using Grafana for other metrics after completing the MVP for the Slurm metrics widget.
Could you point me to an example on how to link a Grafana panel to a job's view

Comment on lines 758 to 760
def sacct_metrics(job_ids: [], states: [], from: nil, to: nil)
@slurm.sacct_metrics(job_ids, states, from, to)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an actual historic_info API on the adapter class itself to mimic and extend the info API.

Not 100% sure on the method signature here, but these are all keywords so it should be OK for now.

Note that this API should probably respond with an array of Info objects (currently returns an array of hashes?).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes to return an array of info objects and added support to disable job steps.

Still when job steps are enabled, it will return them as regular info objects. Not sure this is the best approach at this point, but for our use case, we need the steps for memory metrics calculations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I also wanted the top level historic_info on the Adapter class that this is then the implementation for.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - will make the changes and see how that looks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with a first implementation to add the historic info interface

@treydock
Copy link
Contributor

We will be looking at using Grafana for other metrics after completing the MVP for the Slurm metrics widget. Could you point me to an example on how to link a Grafana panel to a job's view

https://osc.github.io/ood-documentation/latest/customizations.html?highlight=grafana#grafana-support

https://grafana.com/grafana/dashboards/12093-ondemand-clusters/

Example:
Screenshot 2024-10-24 at 9 19 39 AM

@treydock
Copy link
Contributor

The Detailed Metrics link will open a URL to the dashboard that is using parameters to show just that job's data and nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slurm Metrics - sacct command to get job historic data
3 participants