Slurm data for online monitoring - `sonar slurmps` #240

lars-t-hansen · 2025-01-31T12:42:04Z

This is a little speculative but it's something we probably want for integrating sonar data properly with Slurm data. For sonar slurm we extract information about jobs that have completed in the last hour. But we probably want the dashboard, which has profiling data, to have some slurm data while the job is running. So probably we want a lightweight-ish capability to extract information at the state changes (created)->PENDING, PENDING->RUNNING, and (whatever)->(completed). For the PENDING->RUNNING transition we want to have information about allocated resources, in particular, gpu cards - this information is lost once the job has completed. Not sure that slurm gives us anything else interesting at those stages, though clearly the info we get with sonar slurm would be interesting to have once the job reaches the completed state.

This job runs only on a single host on the cluster (maybe with some redundancy) and will not normally overload the compute nodes, it can run on a login or admin node. It should run as often as sonar ps and ideally roughly at the same time.

The text was updated successfully, but these errors were encountered:

lars-t-hansen added enhancement New feature or request Logging labels Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm data for online monitoring - `sonar slurmps` #240

Slurm data for online monitoring - `sonar slurmps` #240

lars-t-hansen commented Jan 31, 2025

Slurm data for online monitoring - sonar slurmps #240

Slurm data for online monitoring - sonar slurmps #240

Comments

lars-t-hansen commented Jan 31, 2025

Slurm data for online monitoring - `sonar slurmps` #240

Slurm data for online monitoring - `sonar slurmps` #240