diff --git a/site/content/en/docs/installation/_index.md b/site/content/en/docs/installation/_index.md index a7f0176f7d..60485eb18e 100644 --- a/site/content/en/docs/installation/_index.md +++ b/site/content/en/docs/installation/_index.md @@ -264,6 +264,7 @@ The currently supported features are: | `KeepQuotaForProvReqRetry` | `false` | Deprecated | 0.9 | 0.9 | | `ManagedJobsNamespaceSelector` | `true` | Beta | 0.10 | | | `LocalQueueDefaulting` | `false` | Alpha | 0.10 | | +| `LocalQueueMetrics` | `false` | Alpha | 0.10 | | ## What's next diff --git a/site/content/en/docs/reference/metrics.md b/site/content/en/docs/reference/metrics.md index 0bd8973087..5be21f55a8 100644 --- a/site/content/en/docs/reference/metrics.md +++ b/site/content/en/docs/reference/metrics.md @@ -6,7 +6,8 @@ description: > Prometheus metrics exported by Kueue --- Kueue exposes [prometheus](https://prometheus.io) metrics to monitor the health -of the system and the status of [ClusterQueues](/docs/concepts/cluster_queue). +of the system and the status of [ClusterQueues](/docs/concepts/cluster_queue) +and [LocalQueues](/docs/concepts/local_queue). ## Kueue health @@ -15,7 +16,7 @@ Use the following metrics to monitor the health of the kueue controllers: | Metric name | Type | Description | Labels | | -------------------------------------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- | -| `kueue_admission_attempts_total` | Counter | The total number of attempts to[admit](/docs/concepts#admission) workloads. Each admission attempt might try to admit more than one workload. | `result`: possible values are `success` or `inadmissible` | +| `kueue_admission_attempts_total` | Counter | The total number of attempts to [admit](/docs/concepts#admission) workloads. Each admission attempt might try to admit more than one workload. | `result`: possible values are `success` or `inadmissible` | | `kueue_admission_attempt_duration_seconds` | Histogram | The latency of an admission attempt. | `result`: possible values are `success` or `inadmissible` | ## ClusterQueue status @@ -34,22 +35,28 @@ Use the following metrics to monitor the status of your ClusterQueues: | `kueue_admission_checks_wait_time_seconds` | Histogram | The time from when a workload got the quota reservation until admission. | `cluster_queue`: the name of the ClusterQueue | | `kueue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished) | `cluster_queue`: the name of the ClusterQueue | | `kueue_cluster_queue_status` | Gauge | Reports the status of the ClusterQueue | `cluster_queue`: The name of the ClusterQueue
`status`: Possible values are `pending`, `active` or `terminated`. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses. | +| `kueue_reserving_active_workloads` | Gauge | The number of Workloads that are reserving quota, per `cluster_queue`. | `cluster_queue`: the name of the ClusterQueue | +| `kueue_admission_cycle_preemption_skips` | Gauge | The number of Workloads in the ClusterQueue that got preemption candidates but had to be skipped because other ClusterQueues needed the same resources in the same cycle | `cluster_queue`: the name of the ClusterQueue | +| `kueue_preempted_workloads_total` | Counter | The number of preempted workloads per `preempting_cluster_queue` | `preempting_cluster_queue`: the name of the ClusterQueue
`reason`: possible values are `InClusterQueue` means that the workload was preempted by a workload in the same ClusterQueue; `InCohortReclamation` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota; `InCohortFairSharing` means that the workload was preempted by a workload in the same cohort due to fair sharing; `InCohortReclaimWhileBorrowing` means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota while borrowing | ## LocalQueue Status (alpha) +The following metrics are available only if `LocalQueueMetrics` feature gate is enabled. Check the [Change the feature gates configuration](/docs/installation/#change-the-feature-gates-configuration) section of the [Installation](/docs/installation/) for details. -| Metric Name | Type | Description | Labels | -| ------------------------------------------------ | ----------- | ------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `local_queue_pending_workloads` | Gauge | The number of pending workloads, per 'local_queue' and 'status'. | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`status`: can be either `active` for the number of active pending workloads or `inadmissible` | -| `local_queue_quota_reserved_workloads_total` | Counter | The number of workloads with quota reserved in a LocalQueue | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | -| `local_queue_quota_reserved_wait_time_seconds` | Histogram | The time between a workload was created or requeued until it got quota reservation, per`local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | -| `local_queue_admitted_workloads_total` | Counter | The total number of admitted workloads per`local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | -| `local_queue_admission_wait_time_seconds` | Histogram | The time between a workload was created or requeued until admission, per`local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | -| `local_queue_evicted_workloads_total` | Counter | The number of evicted workloads per`local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`reason`: the reason the workload was pre-empted. It can have the following values ["Preempted", "PodsReadyTimeout", "AdmissionCheck", "ClusterQueueStopped", "Deactivated"] | -| `local_queue_reserving_active_workloads` | Gauge | The number of Workloads that are reserving quota, per`localQueue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | -| `local_queue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished), per`localQueue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | -| `local_queue_status` | Gauge | Reports a LocalQueue's`active` status (ability to schedule workloads) | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`active`: one of [`True`, `False`, `Unknown`] and exclusively one is positive at any given time | -| `local_queue_resource_usage` | Gauge | Reports the LocalQueue's total resource usage within all the`flavors` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`flavor`: the name of the flavor which resources are being consumed from
`resource`: the resource which is being consumed | +| Metric Name | Type | Description | Labels | +|--------------------------------------------------------|-----------|-------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `kueue_local_queue_pending_workloads` | Gauge | The number of pending workloads, per `local_queue` and `status`. | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`status`: can be either `active` for the number of active pending workloads or `inadmissible` | +| `kueue_local_queue_quota_reserved_workloads_total` | Counter | The number of workloads with quota reserved in a LocalQueue | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_quota_reserved_wait_time_seconds` | Histogram | The time between a workload was created or requeued until it got quota reservation, per `local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_admitted_workloads_total` | Counter | The total number of admitted workloads per `local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_admission_checks_wait_time_seconds` | Histogram | The time from when a workload got the quota reservation until admission, per `local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_admission_wait_time_seconds` | Histogram | The time between a workload was created or requeued until admission, per `local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_evicted_workloads_total` | Counter | The number of evicted workloads per `local_queue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`reason`: the reason the workload was pre-empted. It can have the following values ["Preempted", "PodsReadyTimeout", "AdmissionCheck", "ClusterQueueStopped", "Deactivated"] | +| `kueue_local_queue_reserving_active_workloads` | Gauge | The number of Workloads that are reserving quota, per `localQueue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_admitted_active_workloads` | Gauge | The number of admitted Workloads that are active (unsuspended and not finished), per `localQueue` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in | +| `kueue_local_queue_status` | Gauge | Reports a LocalQueue's `active` status (ability to schedule workloads) | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`active`: one of [`True`, `False`, `Unknown`] and exclusively one is positive at any given time | +| `kueue_local_queue_resource_reservation` | Gauge | Reports the LocalQueue's total resource usage within all the`flavors` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`flavor`: the name of the flavor which resources are being consumed from
`resource`: the resource which is being consumed | +| `kueue_local_queue_resource_usage` | Gauge | Reports the localQueue's total resource reservation within all the `flavors` | `name`: the name of the LocalQueue
`namespace`: the namespace that the LocalQueue resides in
`flavor`: the name of the flavor which resources are being consumed from
`resource`: the resource which is being consumed | ### Optional metrics @@ -58,7 +65,9 @@ The following metrics are available only if `metrics.enableClusterQueueResources | Metric name | Type | Description | Labels | | --------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `kueue_cluster_queue_resource_reservation` | Gauge | Reports the cluster_queue's total resource reservation within all the flavors | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | | `kueue_cluster_queue_resource_usage` | Gauge | Reports the ClusterQueue's total resource usage | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | | `kueue_cluster_queue_nominal_quota` | Gauge | Reports the ClusterQueue's resource quota | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | | `kueue_cluster_queue_borrowing_limit` | Gauge | Reports the ClusterQueue's resource borrowing limit | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | +| `kueue_cluster_queue_lending_limit` | Gauge | Reports the cluster_queue's resource lending limit within all the flavors | `cohort`: The cohort in which the queue belongs
`cluster_queue`: The name of the ClusterQueue
`flavor`: referenced flavor
`resource`: The resource name | | `kueue_cluster_queue_weighted_share` | Gauge | Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue. | `cluster_queue`: The name of the ClusterQueue |