diff --git a/CHANGELOG.md b/CHANGELOG.md index 5b7aff48..71dc8be9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -25,7 +25,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - show-conf: Introduce `slurm-web-show-conf` utility to dump current configuration settings of gateway and agent components with their origin, which can either be configuration definition file or site override (#349). -- docs: Add manpage for `slurm-web-show-conf` command. +- docs: + - Add manpage for `slurm-web-show-conf` command. + - Add metrics export configuration documentation. + - Mention metrics export optional feature in quickstart guide. + - Mention metrics export feature in overview page. + - Mention possible Prometheus integration in architecture page. ### Changed - docs: Update configuration reference documentation. diff --git a/docs/modules/conf/nav.adoc b/docs/modules/conf/nav.adoc index 98a6965d..84b9087c 100644 --- a/docs/modules/conf/nav.adoc +++ b/docs/modules/conf/nav.adoc @@ -6,3 +6,4 @@ ** xref:conf/gateway.adoc[Gateway] ** xref:conf/agent.adoc[Agent] * xref:policy.adoc[] +* xref:metrics.adoc[] diff --git a/docs/modules/conf/pages/metrics.adoc b/docs/modules/conf/pages/metrics.adoc new file mode 100644 index 00000000..5300c917 --- /dev/null +++ b/docs/modules/conf/pages/metrics.adoc @@ -0,0 +1,121 @@ += Metrics Export + +Slurm-web agent can export metrics in standard OpenMetrics format on `/metrics` +endpoint. This is designed to be scraped by Prometheus (or compatible) in order +to store metrics in timeseries databases and draw diagrams of historical data. + +This page explains how to enable and secure this feature by +<> to specific hosts and +<> to scrap these metrics. It also provides a +<>. + +== Configuration + +The metrics export feature is disabled by default. It can be enabled with the +following lines in [.path]#`/etc/slurm-web/agent.ini`#: + +[source,ini] +---- +[metrics] +enabled=yes +---- + +.More details +**** +* xref:conf/agent.adoc#_metrics[Agent configuration metrics section reference documentation]. +**** + +[#restrict] +== Host Restriction + +For security reasons, Slurm-web agent restrict access to `/metrics` endpoint to +localhost only. When Prometheus is running on external hosts, you must define +`restrict` parameter in [.path]#`/etc/slurm-web/agent.ini`# to allow other +networks explicitely. For example: + +[source,ini] +---- +[metrics] +enabled=yes +restrict= + 192.168.1.0/24 + 10.0.0.251/32 +---- + +In this example, all IP addresses in range `192.168.1.[0-254]` and `10.0.0.251` +are permitted to request metrics. + +.More details +**** +* xref:conf/agent.adoc#_metrics[Agent configuration reference documentation for metrics section]. +**** + +[#prometheus] +== Prometheus Integration + +Prometheus must be configured to request `/metrics` endpoint of Slurm-web agent. +Edit [.path]#`/etc/prometheus/prometheus.yml`# to add one of the following +configuration snippets depending of your setup: + +* Slurm-web agent running as native service (ie. with +`slurm-web-agent.service`): + +[source,yaml] +---- +scrape_configs: + - job_name: slurm + scrape_interval: 30s + static_configs: + - targets: ['localhost:5012'] +---- + +* Slurm-web agent running on xref:wsgi/index.adoc[production HTTP server]: + +[source,yaml] +---- +scrape_configs: + - job_name: slurm + scrape_interval: 30s + metrics_path: /agent/metrics + static_configs: + - targets: ['localhost:80'] +---- + +NOTE: You may need to adjust the target hostname, typically if Prometheus is +running on a remote host, and destination port (for example 443 for HTTPS). + +.Reference +**** +* https://prometheus.io/docs/prometheus/latest/configuration/configuration/[Prometheus Official Configuration Documentation]. +**** + +[#reference] +== Available Metrics + +This table describes all metrics exported by Slurm-web: + +[cols="1l,3a"] +|=== +|Metric|Description + +|slurm_nodes[state] +|Number of compute nodes in a given state. Supported states are: _idle_, +_mixed_, _allocated_, _down_, _drain_ and _unknown_. + +|slurm_nodes_total +|Total number of compute nodes managed by Slurm. + +|slurm_cores[state] +|Number of cores of compute nodes in a given state. Supported states are: +_idle_, _mixed_, _allocated_, _down_, _drain_ and _unknown_. + +|slurm_cores_total +|Total number of cores on compute nodes managed by Slurm. + +|slurm_jobs[state] +|Number of jobs in a given state in Slurm controller queue. Supported states +are: _running_, _completed_, _completing_, _cancelled_, _pending_ and _unknown_. + +|slurm_jobs_total +|Total number of jobs in Slurm controller queue. +|=== diff --git a/docs/modules/install/pages/quickstart.adoc b/docs/modules/install/pages/quickstart.adoc index 887db4d9..62dd8011 100644 --- a/docs/modules/install/pages/quickstart.adoc +++ b/docs/modules/install/pages/quickstart.adoc @@ -637,6 +637,28 @@ xref:misc:troubleshooting.adoc#wsgi[troubleshooting guide] for help. * xref:conf:wsgi/index.adoc[Production HTTP server setup guide]. **** +== Metrics (optional) + +Slurm-web offers the possibility to +xref:overview:overview.adoc#metrics[export Slurm metrics] in +https://openmetrics.io/[OpenMetrics] format and integrate with +https://prometheus.io/[Prometheus]. This feature can be used to store metrics in +timeseries databases and draw diagrams of historical data. + +This feature is disabled by default. It can be enabled with the following lines +in [.path]#`/etc/slurm-web/agent.ini`#: + +[source,ini] +---- +[metrics] +enabled=yes +---- + +.More details +**** +* xref:conf:metrics.adoc[Metrics export configuration documentation]. +**** + == Multi-clusters Slurm-web is designed to support diff --git a/docs/modules/overview/images/arch/slurm-web_integration.png b/docs/modules/overview/images/arch/slurm-web_integration.png index 45296b19..98bc43c8 100644 Binary files a/docs/modules/overview/images/arch/slurm-web_integration.png and b/docs/modules/overview/images/arch/slurm-web_integration.png differ diff --git a/docs/modules/overview/images/arch/slurm-web_integration.svg b/docs/modules/overview/images/arch/slurm-web_integration.svg index 8ef99889..82bfdacb 100644 --- a/docs/modules/overview/images/arch/slurm-web_integration.svg +++ b/docs/modules/overview/images/arch/slurm-web_integration.svg @@ -3,18 +3,18 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - slurmrestd - slurmrestd - slurmdbd - slurmdbd - frontend - frontend - - - - - - - - - gateway - gateway - - Slurm-web - Slurm-webSlurm - Slurm - - - - - - - - - - - - - LDAP directory - LDAP directory - - - - - - - - - - - - - Redis cache - Redis cache - Policy - Policy - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - RacksDB - RacksDB - - - - - - agent - - - - - - - - - - - - - - - - + id="path1035" />Prometheustimeseriesagent diff --git a/docs/modules/overview/images/slurm-web_metrics.png b/docs/modules/overview/images/slurm-web_metrics.png new file mode 100644 index 00000000..e9c2e42a Binary files /dev/null and b/docs/modules/overview/images/slurm-web_metrics.png differ diff --git a/docs/modules/overview/images/slurm-web_metrics.svg b/docs/modules/overview/images/slurm-web_metrics.svg new file mode 100644 index 00000000..261c3063 --- /dev/null +++ b/docs/modules/overview/images/slurm-web_metrics.svg @@ -0,0 +1,175 @@ + + + +Slurm-webSlurm diff --git a/docs/modules/overview/pages/architecture.adoc b/docs/modules/overview/pages/architecture.adoc index b093f4cd..d3e4c949 100644 --- a/docs/modules/overview/pages/architecture.adoc +++ b/docs/modules/overview/pages/architecture.adoc @@ -47,8 +47,12 @@ permissions associated to roles and LDAP groups. The component also extracts cluster racking topology from xref:racksdb:overview:start.adoc[RacksDB] database to generate xref:overview.adoc#nodes-status[graphical representation of nodes status] in the -racks. Finally, it connects to https://redis.io/[Redis] in-memory key/value -database to save cached data from Slurm. +racks. + +Optionally, the *agent* can connect to https://redis.io/[Redis] in-memory +key/value database to save cached data from Slurm. It can also +xref:overview.adoc#metrics[export metrics] to https://prometheus.io/[Prometheus] +(or compatible) in order to store values in timeseries databases. [#protocols] == Protocols diff --git a/docs/modules/overview/pages/overview.adoc b/docs/modules/overview/pages/overview.adoc index 4129591d..f716f4d3 100644 --- a/docs/modules/overview/pages/overview.adoc +++ b/docs/modules/overview/pages/overview.adoc @@ -125,3 +125,20 @@ image::slurm-web_transparent_cache.png[] Users are able to track jobs list in near real-time very efficiently. Finally drop the load generated by infinite loops of `squeue`! + +[#metrics] +== Metrics + +image::slurm-web_metrics.png[] + +Slurm-web can export many metrics of the clusters statuses and the jobs. These +metrics are exported in standard https://openmetrics.io/[OpenMetrics] format, +designed to be scraped by https://prometheus.io/[Prometheus] (or any compatible +solution) to store in timeseries databases. Diagrams of these metrics provide +historical views of your production. + +[sidebar] +-- +.More links +* xref:conf:metrics.adoc[Metrics export configuration documentation] +-- diff --git a/docs/utils/build.yaml b/docs/utils/build.yaml index 5561ec79..a10b67b2 100644 --- a/docs/utils/build.yaml +++ b/docs/utils/build.yaml @@ -3,6 +3,7 @@ # generated. diagrams: modules/overview/images/slurm-web_transparent_cache.svg: medium + modules/overview/images/slurm-web_metrics.svg: large modules/overview/images/arch/slurm-web_architecture.svg: medium modules/overview/images/arch/slurm-web_distribution.svg: medium modules/overview/images/arch/slurm-web_integration.svg: medium