forked from jupyterhub/grafana-dashboards
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Copy over existing panel descriptions into the docs
- Loading branch information
1 parent
3b645a7
commit 3b85f7e
Showing
7 changed files
with
194 additions
and
21 deletions.
There are no files selected for viewing
50 changes: 30 additions & 20 deletions
50
docs/reference/cluster-dashboard.md → docs/reference/dashboards/cluster.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,47 +1,57 @@ | ||
# The Cluster Information dashboard | ||
# Cluster Information | ||
|
||
The cluster dashboard contains several panels that show relevant cluster-wide information. | ||
|
||
## Running Users | ||
```{warning} | ||
This section is a Work in Progress! | ||
``` | ||
|
||
## Cluster Stats | ||
|
||
### Running Users | ||
|
||
Count of running users, grouped by namespace. | ||
|
||
## Memory commitment % | ||
### Memory commitment % | ||
|
||
% of total memory in the cluster currently requested by to non-placeholder pods. | ||
Percentage of total memory in the cluster currently requested by to non-placeholder pods. | ||
If autoscaling is efficient, this should be a fairly constant, high number (>70%). | ||
|
||
## CPU commitment % | ||
### CPU commitment % | ||
|
||
% of total CPU in the cluster currently requested by to non-placeholder pods. | ||
Percentage of total CPU in the cluster currently requested by to non-placeholder pods. | ||
JupyterHub users mostly are capped by memory, so this is not super useful. | ||
|
||
## Node CPU Commit % | ||
### Node count | ||
|
||
### Pods not in Running state | ||
|
||
Pods in states other than 'Running'. | ||
In a functional clusters, pods should not be in non-Running states for long. | ||
|
||
## Node stats | ||
|
||
### Node CPU Commit % | ||
|
||
% of each node guaranteed to pods on it. | ||
Percentage of each node guaranteed to pods on it. | ||
|
||
## Node Memory Commit % | ||
### Node Memory Commit % | ||
|
||
% of each node guaranteed to pods on it. | ||
Percentage of each node guaranteed to pods on it. | ||
|
||
## Node Memory Utilization % | ||
### Node Memory Utilization % | ||
|
||
% of available Memory currently in use. | ||
Percentage of available Memory currently in use. | ||
|
||
## Node CPU Utilization % | ||
### Node CPU Utilization % | ||
|
||
% of available CPUs currently in use. | ||
Percentage of available CPUs currently in use. | ||
|
||
## Out of Memory kill count | ||
### Out of Memory kill count | ||
|
||
Number of Out of Memory (OOM) kills in a given node. | ||
|
||
When users use up more memory than they are allowed, the notebook kernel they | ||
were running usually gets killed and restarted. This graph shows the number of times | ||
that happens on any given node, and helps validate that a notebook kernel restart was | ||
in fact caused by an OOM. | ||
|
||
## Pods not in Running state | ||
|
||
Pods in states other than 'Running'. | ||
In a functional clusters, pods should not be in non-Running states for long. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Global Usage | ||
|
||
Contains "global" dashboards with useful stats computed across all datasources. | ||
|
||
```{warning} | ||
This section is a Work in Progress! | ||
``` | ||
|
||
## 'Active users (over 7 days) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
# JupyterHub Dashboard | ||
|
||
The JupyterHub dashboard contains several panels with useful stats about usage & diagnostics. | ||
|
||
```{warning} | ||
This section is a Work in Progress! | ||
``` | ||
|
||
## Currently Active Users | ||
|
||
## Daily Active Users | ||
|
||
Number of unique users who were active within the preceeding 24h period. | ||
|
||
Requires JupyterHub 3.1. | ||
|
||
## Weekly Active Users | ||
|
||
Number of unique users who were active within the preceeding 7d period. | ||
|
||
Requires JupyterHub 3.1. | ||
|
||
## Monthly Active Users | ||
|
||
Number of unique users who were active within the preceeding 7d period. | ||
|
||
Requires JupyterHub 3.1. | ||
|
||
## Hub DB Disk Space Availability % | ||
|
||
% of disk space left in the disk storing the JupyterHub sqlite database. If goes to 0, the hub will fail. | ||
|
||
## Server Start Times | ||
|
||
## Server Start Failures | ||
|
||
Attempts by users to start servers that failed. | ||
|
||
## Users per node | ||
|
||
## Non Running Pods | ||
|
||
Pods in a non-running state in the hub's namespace. | ||
|
||
Pods stuck in non-running states often indicate an error condition. | ||
|
||
## Free space (%) in shared volume (Home directories, etc.) | ||
|
||
% of disk space left in a shared storage volume, typically used for users' home directories. | ||
|
||
Requires an additional node_exporter deployment to work. If this graph is empty, look at the README for jupyterhub/grafana-dashboards to see what extra deployment is needed. | ||
|
||
## Very old user pods | ||
|
||
User pods that have been running for a long time (>8h). | ||
|
||
This often indicates problems with the idle culler | ||
|
||
## User Pods with high CPU usage (>0.5) | ||
|
||
User pods using a lot of CPU | ||
|
||
This could indicate a runaway process consuming resources unnecessarily. | ||
|
||
## User pods with high memory usage (>80% of limit) | ||
|
||
User pods getting close to their memory limit | ||
|
||
Once they hit their memory limit, user kernels will start dying. | ||
|
||
## Images used by user pods | ||
|
||
Number of user servers using a container image. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# NFS and Support Information | ||
|
||
The NFS and Support Information dashboard contains several panels with useful information about support resources. | ||
|
||
```{warning} | ||
This section is a Work in Progress! | ||
``` | ||
|
||
## User Nodes NFS Ops | ||
|
||
## NFS Operation Types on user nodes | ||
|
||
## NFS Server CPU | ||
|
||
## NFS Server Disk ops | ||
|
||
## NFS Server disk write latency | ||
|
||
## NFS Server disk write latency | ||
|
||
## Prometheus Memory (Working Set) | ||
|
||
## Prometheus CPU | ||
|
||
## Prometheus Free Disk space | ||
|
||
## Prometheus Network Usage | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Usage Report | ||
|
||
```{warning} | ||
This section is a Work in Progress! | ||
``` | ||
|
||
## User pod memory usage | ||
|
||
## Dask-gateway worker pod memory usage | ||
|
||
## Dask-gateway scheduler pod memory usage | ||
|
||
## GPU pod memory usage |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# User Diagnostics | ||
|
||
```{warning} | ||
This section is a Work in Progress! | ||
``` | ||
|
||
## Memory Usage | ||
|
||
Per-user per-server memory usage | ||
|
||
## CPU Usage | ||
|
||
Per-user per-server CPU usage | ||
|
||
## Home Directory Usage (on shared home directories) | ||
|
||
Per user home directory size, when using a shared home directory. | ||
|
||
Requires https://github.com/yuvipanda/prometheus-dirsize-exporter to | ||
be set up. | ||
|
||
Similar to server pod names, user names will be *encoded* here | ||
using the escapism python library (https://github.com/minrk/escapism). | ||
You can unencode them with the following python snippet: | ||
|
||
from escapism import unescape | ||
unescape('<escaped-username>', '-') | ||
|
||
## Memory Requests | ||
|
||
Per-user per-server memory Requests | ||
|
||
## CPU Requests | ||
|
||
Per-user per-server CPU Requests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters