Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus exporter #274

Closed
rezib opened this issue May 14, 2024 · 2 comments · Fixed by #363
Closed

Prometheus exporter #274

rezib opened this issue May 14, 2024 · 2 comments · Fixed by #363
Assignees
Labels
feature New feature or enhancement to develop. need sponsor Rackslab needs funding from customers to work on this task.
Milestone

Comments

@rezib
Copy link
Contributor

rezib commented May 14, 2024

Export endpoint with Slurm scheduling, jobs and nodes metrics to be consumed by Prometheus for external monitoring.

@rezib rezib added the feature New feature or enhancement to develop. label May 14, 2024
@rezib rezib added the need sponsor Rackslab needs funding from customers to work on this task. label Jul 3, 2024
@rezib
Copy link
Contributor Author

rezib commented Jul 3, 2024

Please note that Rackslab needs financial support from customers to work on this task.

Slurm-web is a free software (GPLv3) without licence fee. Rackslab strongly believes in this model in which everyone can use Slurm-web regardless of their situation. However, Slurm-web development cannot happen without funding from some organizations. This is truly essential to make this software project sustainable and durable.

If your team wishes this feature to land in Slurm-web, your organization can order its development to Rackslab. You will have the opportunity to take part of its functional specifications to match your needs. This is the best way to secure its integration and delivery in the next major release. Contact us to get a quote!

@rezib
Copy link
Contributor Author

rezib commented Sep 11, 2024

As an example, here are the metrics produced on a development cluster by https://github.com/vpenso/prometheus-slurm-exporter:

# HELP slurm_account_cpus_running Running cpus for account
# TYPE slurm_account_cpus_running gauge
slurm_account_cpus_running{account="biology"} 2
# HELP slurm_account_fairshare FairShare for account
# TYPE slurm_account_fairshare gauge
slurm_account_fairshare{account="admin"} 0
slurm_account_fairshare{account="biology"} 0
slurm_account_fairshare{account="physic"} 0
slurm_account_fairshare{account="root"} 1
slurm_account_fairshare{account="scientists"} 0
# HELP slurm_account_jobs_pending Pending jobs for account
# TYPE slurm_account_jobs_pending gauge
slurm_account_jobs_pending{account="admin"} 2
slurm_account_jobs_pending{account="biology"} 6
slurm_account_jobs_pending{account="optic"} 2
# HELP slurm_account_jobs_running Running jobs for account
# TYPE slurm_account_jobs_running gauge
slurm_account_jobs_running{account="biology"} 2
# HELP slurm_cpus_alloc Allocated CPUs
# TYPE slurm_cpus_alloc gauge
slurm_cpus_alloc 2
# HELP slurm_cpus_idle Idle CPUs
# TYPE slurm_cpus_idle gauge
slurm_cpus_idle 2
# HELP slurm_cpus_other Mix CPUs
# TYPE slurm_cpus_other gauge
slurm_cpus_other 0
# HELP slurm_cpus_total Total CPUs
# TYPE slurm_cpus_total gauge
slurm_cpus_total 4
# HELP slurm_node_cpu_alloc Allocated CPUs per node
# TYPE slurm_node_cpu_alloc gauge
slurm_node_cpu_alloc{node="cn1",status="mixed"} 1
slurm_node_cpu_alloc{node="cn2",status="mixed"} 1
# HELP slurm_node_cpu_idle Idle CPUs per node
# TYPE slurm_node_cpu_idle gauge
slurm_node_cpu_idle{node="cn1",status="mixed"} 1
slurm_node_cpu_idle{node="cn2",status="mixed"} 1
# HELP slurm_node_cpu_other Other CPUs per node
# TYPE slurm_node_cpu_other gauge
slurm_node_cpu_other{node="cn1",status="mixed"} 0
slurm_node_cpu_other{node="cn2",status="mixed"} 0
# HELP slurm_node_cpu_total Total CPUs per node
# TYPE slurm_node_cpu_total gauge
slurm_node_cpu_total{node="cn1",status="mixed"} 2
slurm_node_cpu_total{node="cn2",status="mixed"} 2
# HELP slurm_node_mem_alloc Allocated memory per node
# TYPE slurm_node_mem_alloc gauge
slurm_node_mem_alloc{node="cn1",status="mixed"} 1
slurm_node_mem_alloc{node="cn2",status="mixed"} 1
# HELP slurm_node_mem_total Total memory per node
# TYPE slurm_node_mem_total gauge
slurm_node_mem_total{node="cn1",status="mixed"} 1
slurm_node_mem_total{node="cn2",status="mixed"} 1
# HELP slurm_nodes_alloc Allocated nodes
# TYPE slurm_nodes_alloc gauge
slurm_nodes_alloc 0
# HELP slurm_nodes_comp Completing nodes
# TYPE slurm_nodes_comp gauge
slurm_nodes_comp 0
# HELP slurm_nodes_down Down nodes
# TYPE slurm_nodes_down gauge
slurm_nodes_down 0
# HELP slurm_nodes_drain Drain nodes
# TYPE slurm_nodes_drain gauge
slurm_nodes_drain 0
# HELP slurm_nodes_err Error nodes
# TYPE slurm_nodes_err gauge
slurm_nodes_err 0
# HELP slurm_nodes_fail Fail nodes
# TYPE slurm_nodes_fail gauge
slurm_nodes_fail 0
# HELP slurm_nodes_idle Idle nodes
# TYPE slurm_nodes_idle gauge
slurm_nodes_idle 0
# HELP slurm_nodes_maint Maint nodes
# TYPE slurm_nodes_maint gauge
slurm_nodes_maint 0
# HELP slurm_nodes_mix Mix nodes
# TYPE slurm_nodes_mix gauge
slurm_nodes_mix 2
# HELP slurm_nodes_resv Reserved nodes
# TYPE slurm_nodes_resv gauge
slurm_nodes_resv 0
# HELP slurm_partition_cpus_allocated Allocated CPUs for partition
# TYPE slurm_partition_cpus_allocated gauge
slurm_partition_cpus_allocated{partition="normal"} 2
# HELP slurm_partition_cpus_idle Idle CPUs for partition
# TYPE slurm_partition_cpus_idle gauge
slurm_partition_cpus_idle{partition="normal"} 2
# HELP slurm_partition_cpus_total Total CPUs for partition
# TYPE slurm_partition_cpus_total gauge
slurm_partition_cpus_total{partition="normal"} 4
# HELP slurm_partition_jobs_pending Pending jobs for partition
# TYPE slurm_partition_jobs_pending gauge
slurm_partition_jobs_pending{partition="normal"} 10
# HELP slurm_queue_cancelled Cancelled jobs in the cluster
# TYPE slurm_queue_cancelled gauge
slurm_queue_cancelled 0
# HELP slurm_queue_completed Completed jobs in the cluster
# TYPE slurm_queue_completed gauge
slurm_queue_completed 1
# HELP slurm_queue_completing Completing jobs in the cluster
# TYPE slurm_queue_completing gauge
slurm_queue_completing 0
# HELP slurm_queue_configuring Configuring jobs in the cluster
# TYPE slurm_queue_configuring gauge
slurm_queue_configuring 0
# HELP slurm_queue_failed Number of failed jobs
# TYPE slurm_queue_failed gauge
slurm_queue_failed 0
# HELP slurm_queue_node_fail Number of jobs stopped due to node fail
# TYPE slurm_queue_node_fail gauge
slurm_queue_node_fail 0
# HELP slurm_queue_pending Pending jobs in queue
# TYPE slurm_queue_pending gauge
slurm_queue_pending 10
# HELP slurm_queue_pending_dependency Pending jobs because of dependency in queue
# TYPE slurm_queue_pending_dependency gauge
slurm_queue_pending_dependency 0
# HELP slurm_queue_preempted Number of preempted jobs
# TYPE slurm_queue_preempted gauge
slurm_queue_preempted 0
# HELP slurm_queue_running Running jobs in the cluster
# TYPE slurm_queue_running gauge
slurm_queue_running 2
# HELP slurm_queue_suspended Suspended jobs in the cluster
# TYPE slurm_queue_suspended gauge
slurm_queue_suspended 0
# HELP slurm_queue_timeout Jobs stopped by timeout
# TYPE slurm_queue_timeout gauge
slurm_queue_timeout 0
# HELP slurm_scheduler_backfill_depth_mean Information provided by the Slurm sdiag command, scheduler backfill mean depth
# TYPE slurm_scheduler_backfill_depth_mean gauge
slurm_scheduler_backfill_depth_mean 9
# HELP slurm_scheduler_backfill_last_cycle Information provided by the Slurm sdiag command, scheduler backfill last cycle time in (microseconds)
# TYPE slurm_scheduler_backfill_last_cycle gauge
slurm_scheduler_backfill_last_cycle 1148
# HELP slurm_scheduler_backfill_mean_cycle Information provided by the Slurm sdiag command, scheduler backfill mean cycle time in (microseconds)
# TYPE slurm_scheduler_backfill_mean_cycle gauge
slurm_scheduler_backfill_mean_cycle 1153
# HELP slurm_scheduler_backfilled_heterogeneous_total Information provided by the Slurm sdiag command, number of heterogeneous job components started thanks to backfilling since last Slurm start
# TYPE slurm_scheduler_backfilled_heterogeneous_total gauge
slurm_scheduler_backfilled_heterogeneous_total 0
# HELP slurm_scheduler_backfilled_jobs_since_cycle_total Information provided by the Slurm sdiag command, number of jobs started thanks to backfilling since last time stats where reset
# TYPE slurm_scheduler_backfilled_jobs_since_cycle_total gauge
slurm_scheduler_backfilled_jobs_since_cycle_total 3
# HELP slurm_scheduler_backfilled_jobs_since_start_total Information provided by the Slurm sdiag command, number of jobs started thanks to backfilling since last slurm start
# TYPE slurm_scheduler_backfilled_jobs_since_start_total gauge
slurm_scheduler_backfilled_jobs_since_start_total 3
# HELP slurm_scheduler_cycle_per_minute Information provided by the Slurm sdiag command, number scheduler cycles per minute
# TYPE slurm_scheduler_cycle_per_minute gauge
slurm_scheduler_cycle_per_minute 1
# HELP slurm_scheduler_dbd_queue_size Information provided by the Slurm sdiag command, length of the DBD agent queue
# TYPE slurm_scheduler_dbd_queue_size gauge
slurm_scheduler_dbd_queue_size 0
# HELP slurm_scheduler_last_cycle Information provided by the Slurm sdiag command, scheduler last cycle time in (microseconds)
# TYPE slurm_scheduler_last_cycle gauge
slurm_scheduler_last_cycle 251
# HELP slurm_scheduler_mean_cycle Information provided by the Slurm sdiag command, scheduler mean cycle time in (microseconds)
# TYPE slurm_scheduler_mean_cycle gauge
slurm_scheduler_mean_cycle 971
# HELP slurm_scheduler_queue_size Information provided by the Slurm sdiag command, length of the scheduler queue
# TYPE slurm_scheduler_queue_size gauge
slurm_scheduler_queue_size 0
# HELP slurm_scheduler_threads Information provided by the Slurm sdiag command, number of scheduler threads 
# TYPE slurm_scheduler_threads gauge
slurm_scheduler_threads 3
# HELP slurm_user_cpus_running Running cpus for user
# TYPE slurm_user_cpus_running gauge
slurm_user_cpus_running{user="cgross"} 1
slurm_user_cpus_running{user="dshaw"} 1
# HELP slurm_user_jobs_pending Pending jobs for user
# TYPE slurm_user_jobs_pending gauge
slurm_user_jobs_pending{user="cgross"} 4
slurm_user_jobs_pending{user="jsherman"} 2
slurm_user_jobs_pending{user="rwallace"} 2
slurm_user_jobs_pending{user="sharrison"} 2
# HELP slurm_user_jobs_running Running jobs for user
# TYPE slurm_user_jobs_running gauge
slurm_user_jobs_running{user="cgross"} 1
slurm_user_jobs_running{user="dshaw"} 1

We should explore the possibility to support all these metrics with https://github.com/prometheus/client_python.

@rezib rezib added this to the v4.0.0 milestone Sep 30, 2024
@rezib rezib self-assigned this Oct 9, 2024
rezib added a commit that referenced this issue Oct 23, 2024
Add optional /metrics endpoint with various Slurm metrics in OpenMetrics
format designed to be scraped by Prometheus or compatible.

fix #274
rezib added a commit that referenced this issue Oct 24, 2024
Add optional /metrics endpoint with various Slurm metrics in OpenMetrics
format designed to be scraped by Prometheus or compatible.

fix #274
@rezib rezib closed this as completed in 93ed234 Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or enhancement to develop. need sponsor Rackslab needs funding from customers to work on this task.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant