Skip to content

Commit

Permalink
Copy over existing panel descriptions into the docs
Browse files Browse the repository at this point in the history
  • Loading branch information
GeorgianaElena committed Feb 8, 2024
1 parent 3b645a7 commit 3b85f7e
Show file tree
Hide file tree
Showing 7 changed files with 194 additions and 21 deletions.
Original file line number Diff line number Diff line change
@@ -1,47 +1,57 @@
# The Cluster Information dashboard
# Cluster Information

The cluster dashboard contains several panels that show relevant cluster-wide information.

## Running Users
```{warning}
This section is a Work in Progress!
```

## Cluster Stats

### Running Users

Count of running users, grouped by namespace.

## Memory commitment %
### Memory commitment %

% of total memory in the cluster currently requested by to non-placeholder pods.
Percentage of total memory in the cluster currently requested by to non-placeholder pods.
If autoscaling is efficient, this should be a fairly constant, high number (>70%).

## CPU commitment %
### CPU commitment %

% of total CPU in the cluster currently requested by to non-placeholder pods.
Percentage of total CPU in the cluster currently requested by to non-placeholder pods.
JupyterHub users mostly are capped by memory, so this is not super useful.

## Node CPU Commit %
### Node count

### Pods not in Running state

Pods in states other than 'Running'.
In a functional clusters, pods should not be in non-Running states for long.

## Node stats

### Node CPU Commit %

% of each node guaranteed to pods on it.
Percentage of each node guaranteed to pods on it.

## Node Memory Commit %
### Node Memory Commit %

% of each node guaranteed to pods on it.
Percentage of each node guaranteed to pods on it.

## Node Memory Utilization %
### Node Memory Utilization %

% of available Memory currently in use.
Percentage of available Memory currently in use.

## Node CPU Utilization %
### Node CPU Utilization %

% of available CPUs currently in use.
Percentage of available CPUs currently in use.

## Out of Memory kill count
### Out of Memory kill count

Number of Out of Memory (OOM) kills in a given node.

When users use up more memory than they are allowed, the notebook kernel they
were running usually gets killed and restarted. This graph shows the number of times
that happens on any given node, and helps validate that a notebook kernel restart was
in fact caused by an OOM.

## Pods not in Running state

Pods in states other than 'Running'.
In a functional clusters, pods should not be in non-Running states for long.
9 changes: 9 additions & 0 deletions docs/reference/dashboards/global.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Global Usage

Contains "global" dashboards with useful stats computed across all datasources.

```{warning}
This section is a Work in Progress!
```

## 'Active users (over 7 days)
73 changes: 73 additions & 0 deletions docs/reference/dashboards/jupyterhub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# JupyterHub Dashboard

The JupyterHub dashboard contains several panels with useful stats about usage & diagnostics.

```{warning}
This section is a Work in Progress!
```

## Currently Active Users

## Daily Active Users

Number of unique users who were active within the preceeding 24h period.

Requires JupyterHub 3.1.

## Weekly Active Users

Number of unique users who were active within the preceeding 7d period.

Requires JupyterHub 3.1.

## Monthly Active Users

Number of unique users who were active within the preceeding 7d period.

Requires JupyterHub 3.1.

## Hub DB Disk Space Availability %

% of disk space left in the disk storing the JupyterHub sqlite database. If goes to 0, the hub will fail.

## Server Start Times

## Server Start Failures

Attempts by users to start servers that failed.

## Users per node

## Non Running Pods

Pods in a non-running state in the hub's namespace.

Pods stuck in non-running states often indicate an error condition.

## Free space (%) in shared volume (Home directories, etc.)

% of disk space left in a shared storage volume, typically used for users' home directories.

Requires an additional node_exporter deployment to work. If this graph is empty, look at the README for jupyterhub/grafana-dashboards to see what extra deployment is needed.

## Very old user pods

User pods that have been running for a long time (>8h).

This often indicates problems with the idle culler

## User Pods with high CPU usage (>0.5)

User pods using a lot of CPU

This could indicate a runaway process consuming resources unnecessarily.

## User pods with high memory usage (>80% of limit)

User pods getting close to their memory limit

Once they hit their memory limit, user kernels will start dying.

## Images used by user pods

Number of user servers using a container image.
28 changes: 28 additions & 0 deletions docs/reference/dashboards/support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# NFS and Support Information

The NFS and Support Information dashboard contains several panels with useful information about support resources.

```{warning}
This section is a Work in Progress!
```

## User Nodes NFS Ops

## NFS Operation Types on user nodes

## NFS Server CPU

## NFS Server Disk ops

## NFS Server disk write latency

## NFS Server disk write latency

## Prometheus Memory (Working Set)

## Prometheus CPU

## Prometheus Free Disk space

## Prometheus Network Usage

13 changes: 13 additions & 0 deletions docs/reference/dashboards/usage-report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Usage Report

```{warning}
This section is a Work in Progress!
```

## User pod memory usage

## Dask-gateway worker pod memory usage

## Dask-gateway scheduler pod memory usage

## GPU pod memory usage
35 changes: 35 additions & 0 deletions docs/reference/dashboards/user.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# User Diagnostics

```{warning}
This section is a Work in Progress!
```

## Memory Usage

Per-user per-server memory usage

## CPU Usage

Per-user per-server CPU usage

## Home Directory Usage (on shared home directories)

Per user home directory size, when using a shared home directory.

Requires https://github.com/yuvipanda/prometheus-dirsize-exporter to
be set up.

Similar to server pod names, user names will be *encoded* here
using the escapism python library (https://github.com/minrk/escapism).
You can unencode them with the following python snippet:

from escapism import unescape
unescape('<escaped-username>', '-')

## Memory Requests

Per-user per-server memory Requests

## CPU Requests

Per-user per-server CPU Requests
7 changes: 6 additions & 1 deletion docs/reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,10 @@ Please see our [contributing guide](contributing) if you'd like to add to it.
% that they appear in the table of contents
```{toctree}
:maxdepth: 2
cluster-dashboard.md
dashboards/cluster.md
dashboards/jupyterhub.md
dashboards/support.md
dashboards/usage-report.md
dashboards/user.md
dashboards/global.md
```

0 comments on commit 3b85f7e

Please sign in to comment.