Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics endpoint #363

Merged
merged 10 commits into from
Oct 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [unreleased]

### Added
- agent: Return RacksDB infrastructure name in `/info` endpoint in complement of
the cluster name.
- agent:
- Return RacksDB infrastructure name in `/info` endpoint in complement of
the cluster name.
- Add optional `/metrics` endpoint with various Slurm metrics in OpenMetrics
format designed to be scraped by Prometheus or compatible (#274).
- gateway: Return RacksDB infrastructure name of every clusters in
`/clusters` endpoint.
- frontend:
- Request RacksDB with the infrastructure name provided by the gateway (#348).
- Display time limit of running jobs in job details page (#352).
- conf: Add `racksdb` > `infrastructure` parameter for the agent.
- conf:
- Add `racksdb` > `infrastructure` parameter for the agent.
- Add `metrics` > `enabled` parameter for the agent.
- Add `metrics` > `restrict` parameter for the agent.
- show-conf: Introduce `slurm-web-show-conf` utility to dump current
configuration settings of gateway and agent components with their origin,
which can either be configuration definition file or site override (#349).
- docs: Add manpage for `slurm-web-show-conf` command.
- docs:
- Add manpage for `slurm-web-show-conf` command.
- Add metrics export configuration documentation.
- Mention metrics export optional feature in quickstart guide.
- Mention metrics export feature in overview page.
- Mention possible Prometheus integration in architecture page.

### Changed
- docs: Update configuration reference documentation.
- conf:
- Convert `[cache]` > `password` agent parameter from string to password type.
- Convert `[ldap]` > `bind_password` gateway parameter from string to password
type.
- pkgs: Add requirement on RFL.settings and RFL.core >= 1.1.0.
- pkgs:
- Add requirement on RFL.core >= 1.1.0.
- Add requirement on RFL.settings >= 1.1.1.
- Add dependency on prometheus-client for the agent.

### Fixed
- agent:
Expand Down
15 changes: 15 additions & 0 deletions conf/vendor/agent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -354,3 +354,18 @@ cache:
type: int
default: 60
doc: Expiration delay in seconds for accounts in cache

metrics:
enabled:
type: bool
default: false
doc: |
Determine if metrics feature and integration with Prometheus (or
compatible) is enabled.
restrict:
type: list
content: network
default:
- 127.0.0.0/24
doc: |
Restricted list of IP networks permitted to request metrics.
3 changes: 3 additions & 0 deletions dev/conf/agent.ini.j2
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,6 @@ infrastructure={{ infrastructure }}
enabled={{ cache_enabled }}
port={{ redis_port }}
password={{ redis_password }}

[metrics]
enabled=yes
13 changes: 13 additions & 0 deletions docs/modules/conf/examples/agent.ini
Original file line number Diff line number Diff line change
Expand Up @@ -457,3 +457,16 @@ reservations=60
#
# Default value: 60
accounts=60

[metrics]

# Determine if metrics feature and integration with Prometheus (or
# compatible) is enabled.
enabled=no

# Restricted list of IP networks permitted to request metrics.
#
# Default value:
# - 127.0.0.0/24
restrict=
127.0.0.0/24
1 change: 1 addition & 0 deletions docs/modules/conf/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@
** xref:conf/gateway.adoc[Gateway]
** xref:conf/agent.adoc[Agent]
* xref:policy.adoc[]
* xref:metrics.adoc[]
121 changes: 121 additions & 0 deletions docs/modules/conf/pages/metrics.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
= Metrics Export

Slurm-web agent can export metrics in standard OpenMetrics format on `/metrics`
endpoint. This is designed to be scraped by Prometheus (or compatible) in order
to store metrics in timeseries databases and draw diagrams of historical data.

This page explains how to enable and secure this feature by
<<restrict,restricting access>> to specific hosts and
<<prometheus,configure Prometheus>> to scrap these metrics. It also provides a
<<reference,reference list of all available metrics>>.

== Configuration

The metrics export feature is disabled by default. It can be enabled with the
following lines in [.path]#`/etc/slurm-web/agent.ini`#:

[source,ini]
----
[metrics]
enabled=yes
----

.More details
****
* xref:conf/agent.adoc#_metrics[Agent configuration metrics section reference documentation].
****

[#restrict]
== Host Restriction

For security reasons, Slurm-web agent restrict access to `/metrics` endpoint to
localhost only. When Prometheus is running on external hosts, you must define
`restrict` parameter in [.path]#`/etc/slurm-web/agent.ini`# to allow other
networks explicitely. For example:

[source,ini]
----
[metrics]
enabled=yes
restrict=
192.168.1.0/24
10.0.0.251/32
----

In this example, all IP addresses in range `192.168.1.[0-254]` and `10.0.0.251`
are permitted to request metrics.

.More details
****
* xref:conf/agent.adoc#_metrics[Agent configuration reference documentation for metrics section].
****

[#prometheus]
== Prometheus Integration

Prometheus must be configured to request `/metrics` endpoint of Slurm-web agent.
Edit [.path]#`/etc/prometheus/prometheus.yml`# to add one of the following
configuration snippets depending of your setup:

* Slurm-web agent running as native service (ie. with
`slurm-web-agent.service`):

[source,yaml]
----
scrape_configs:
- job_name: slurm
scrape_interval: 30s
static_configs:
- targets: ['localhost:5012']
----

* Slurm-web agent running on xref:wsgi/index.adoc[production HTTP server]:

[source,yaml]
----
scrape_configs:
- job_name: slurm
scrape_interval: 30s
metrics_path: /agent/metrics
static_configs:
- targets: ['localhost:80']
----

NOTE: You may need to adjust the target hostname, typically if Prometheus is
running on a remote host, and destination port (for example 443 for HTTPS).

.Reference
****
* https://prometheus.io/docs/prometheus/latest/configuration/configuration/[Prometheus Official Configuration Documentation].
****

[#reference]
== Available Metrics

This table describes all metrics exported by Slurm-web:

[cols="1l,3a"]
|===
|Metric|Description

|slurm_nodes[state]
|Number of compute nodes in a given state. Supported states are: _idle_,
_mixed_, _allocated_, _down_, _drain_ and _unknown_.

|slurm_nodes_total
|Total number of compute nodes managed by Slurm.

|slurm_cores[state]
|Number of cores of compute nodes in a given state. Supported states are:
_idle_, _mixed_, _allocated_, _down_, _drain_ and _unknown_.

|slurm_cores_total
|Total number of cores on compute nodes managed by Slurm.

|slurm_jobs[state]
|Number of jobs in a given state in Slurm controller queue. Supported states
are: _running_, _completed_, _completing_, _cancelled_, _pending_ and _unknown_.

|slurm_jobs_total
|Total number of jobs in Slurm controller queue.
|===
41 changes: 41 additions & 0 deletions docs/modules/conf/partials/conf-agent.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -839,3 +839,44 @@ _No default value_
|===



== `metrics`

[cols="2l,1,5a,^1"]
|===
|Parameter|Type|Description|Required


|enabled
|bool
|Determine if metrics feature and integration with Prometheus (or
compatible) is enabled.





*Default:* `False`

|-

|restrict
|list[network]
|Restricted list of IP networks permitted to request metrics.





*Default:*


* `127.0.0.0/24`


|-


|===


22 changes: 22 additions & 0 deletions docs/modules/install/pages/quickstart.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -637,6 +637,28 @@ xref:misc:troubleshooting.adoc#wsgi[troubleshooting guide] for help.
* xref:conf:wsgi/index.adoc[Production HTTP server setup guide].
****

== Metrics (optional)

Slurm-web offers the possibility to
xref:overview:overview.adoc#metrics[export Slurm metrics] in
https://openmetrics.io/[OpenMetrics] format and integrate with
https://prometheus.io/[Prometheus]. This feature can be used to store metrics in
timeseries databases and draw diagrams of historical data.

This feature is disabled by default. It can be enabled with the following lines
in [.path]#`/etc/slurm-web/agent.ini`#:

[source,ini]
----
[metrics]
enabled=yes
----

.More details
****
* xref:conf:metrics.adoc[Metrics export configuration documentation].
****

== Multi-clusters

Slurm-web is designed to support
Expand Down
Binary file modified docs/modules/overview/images/arch/slurm-web_integration.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading