Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm data for online monitoring - sonar slurmps #240

Open
lars-t-hansen opened this issue Jan 31, 2025 · 0 comments
Open

Slurm data for online monitoring - sonar slurmps #240

lars-t-hansen opened this issue Jan 31, 2025 · 0 comments
Labels
enhancement New feature or request Logging

Comments

@lars-t-hansen
Copy link
Collaborator

This is a little speculative but it's something we probably want for integrating sonar data properly with Slurm data. For sonar slurm we extract information about jobs that have completed in the last hour. But we probably want the dashboard, which has profiling data, to have some slurm data while the job is running. So probably we want a lightweight-ish capability to extract information at the state changes (created)->PENDING, PENDING->RUNNING, and (whatever)->(completed). For the PENDING->RUNNING transition we want to have information about allocated resources, in particular, gpu cards - this information is lost once the job has completed. Not sure that slurm gives us anything else interesting at those stages, though clearly the info we get with sonar slurm would be interesting to have once the job reaches the completed state.

This job runs only on a single host on the cluster (maybe with some redundancy) and will not normally overload the compute nodes, it can run on a login or admin node. It should run as often as sonar ps and ideally roughly at the same time.

@lars-t-hansen lars-t-hansen added enhancement New feature or request Logging labels Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Logging
Projects
None yet
Development

No branches or pull requests

1 participant