Copy over existing panel descriptions into the docs

GeorgianaElena · Feb 8, 2024 · 3b85f7e · 3b85f7e
1 parent 3b645a7
commit 3b85f7e
Show file tree

Hide file tree

Showing 7 changed files with 194 additions and 21 deletions.
diff --git a/docs/reference/cluster-dashboard.md → docs/reference/dashboards/cluster.md b/docs/reference/cluster-dashboard.md → docs/reference/dashboards/cluster.md
@@ -1,47 +1,57 @@
-# The Cluster Information dashboard
+# Cluster Information
 
 The cluster dashboard contains several panels that show relevant cluster-wide information.
 
-## Running Users
+```{warning}
+This section is a Work in Progress!
+```
+
+## Cluster Stats
+
+### Running Users
 
 Count of running users, grouped by namespace.
 
-## Memory commitment %
+### Memory commitment %
 
-% of total memory in the cluster currently requested by to non-placeholder pods.
+Percentage of total memory in the cluster currently requested by to non-placeholder pods.
 If autoscaling is efficient, this should be a fairly constant, high number (>70%).
 
-## CPU commitment %
+### CPU commitment %
 
-% of total CPU in the cluster currently requested by to non-placeholder pods.
+Percentage of total CPU in the cluster currently requested by to non-placeholder pods.
 JupyterHub users mostly are capped by memory, so this is not super useful.
 
-## Node CPU Commit %
+### Node count
+
+### Pods not in Running state
+
+Pods in states other than 'Running'.
+In a functional clusters, pods should not be in non-Running states for long.
+
+## Node stats
+
+### Node CPU Commit %
 
-% of each node guaranteed to pods on it.
+Percentage of each node guaranteed to pods on it.
 
-## Node Memory Commit %
+### Node Memory Commit %
 
-% of each node guaranteed to pods on it.
+Percentage of each node guaranteed to pods on it.
 
-## Node Memory Utilization %
+### Node Memory Utilization %
 
-% of available Memory currently in use.
+Percentage of available Memory currently in use.
 
-## Node CPU Utilization %
+### Node CPU Utilization %
 
-% of available CPUs currently in use.
+Percentage of available CPUs currently in use.
 
-## Out of Memory kill count
+### Out of Memory kill count
 
 Number of Out of Memory (OOM) kills in a given node.
 
 When users use up more memory than they are allowed, the notebook kernel they
 were running usually gets killed and restarted. This graph shows the number of times
 that happens on any given node, and helps validate that a notebook kernel restart was
 in fact caused by an OOM.
-
-## Pods not in Running state
-
-Pods in states other than 'Running'.
-In a functional clusters, pods should not be in non-Running states for long.
diff --git a/docs/reference/dashboards/global.md b/docs/reference/dashboards/global.md
@@ -0,0 +1,9 @@
+# Global Usage
+
+Contains "global" dashboards with useful stats computed across all datasources.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## 'Active users (over 7 days)
diff --git a/docs/reference/dashboards/jupyterhub.md b/docs/reference/dashboards/jupyterhub.md
@@ -0,0 +1,73 @@
+# JupyterHub Dashboard
+
+The JupyterHub dashboard contains several panels with useful stats about usage & diagnostics.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## Currently Active Users
+
+## Daily Active Users
+
+Number of unique users who were active within the preceeding 24h period.
+
+Requires JupyterHub 3.1.
+
+## Weekly Active Users
+
+Number of unique users who were active within the preceeding 7d period.
+
+Requires JupyterHub 3.1.
+
+## Monthly Active Users
+
+Number of unique users who were active within the preceeding 7d period.
+
+Requires JupyterHub 3.1.
+
+## Hub DB Disk Space Availability %
+
+% of disk space left in the disk storing the JupyterHub sqlite database. If goes to 0, the hub will fail.
+
+## Server Start Times
+
+## Server Start Failures
+
+Attempts by users to start servers that failed.
+
+## Users per node
+
+## Non Running Pods
+
+Pods in a non-running state in the hub's namespace.
+
+Pods stuck in non-running states often indicate an error condition.
+
+## Free space (%) in shared volume (Home directories, etc.)
+
+% of disk space left in a shared storage volume, typically used for users' home directories.
+
+Requires an additional node_exporter deployment to work. If this graph is empty, look at the README for jupyterhub/grafana-dashboards to see what extra deployment is needed.
+
+## Very old user pods
+
+User pods that have been running for a long time (>8h).
+
+This often indicates problems with the idle culler
+
+## User Pods with high CPU usage (>0.5)
+
+User pods using a lot of CPU
+
+This could indicate a runaway process consuming resources unnecessarily.
+
+## User pods with high memory usage (>80% of limit)
+
+User pods getting close to their memory limit
+
+Once they hit their memory limit, user kernels will start dying.
+
+## Images used by user pods
+
+Number of user servers using a container image.
diff --git a/docs/reference/dashboards/support.md b/docs/reference/dashboards/support.md
@@ -0,0 +1,28 @@
+# NFS and Support Information
+
+The NFS and Support Information dashboard contains several panels with useful information about support resources.
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## User Nodes NFS Ops
+
+## NFS Operation Types on user nodes
+
+## NFS Server CPU
+
+## NFS Server Disk ops
+
+## NFS Server disk write latency
+
+## NFS Server disk write latency
+
+## Prometheus Memory (Working Set)
+
+## Prometheus CPU
+
+## Prometheus Free Disk space
+
+## Prometheus Network Usage
+
diff --git a/docs/reference/dashboards/usage-report.md b/docs/reference/dashboards/usage-report.md
@@ -0,0 +1,13 @@
+# Usage Report
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## User pod memory usage
+
+## Dask-gateway worker pod memory usage
+
+## Dask-gateway scheduler pod memory usage
+
+## GPU pod memory usage
diff --git a/docs/reference/dashboards/user.md b/docs/reference/dashboards/user.md
@@ -0,0 +1,35 @@
+# User Diagnostics
+
+```{warning}
+This section is a Work in Progress!
+```
+
+## Memory Usage
+
+Per-user per-server memory usage
+
+## CPU Usage
+
+Per-user per-server CPU usage
+
+## Home Directory Usage (on shared home directories)
+
+Per user home directory size, when using a shared home directory.
+
+Requires https://github.com/yuvipanda/prometheus-dirsize-exporter to
+    be set up.
+
+Similar to server pod names, user names will be *encoded* here
+using the escapism python library (https://github.com/minrk/escapism).
+You can unencode them with the following python snippet:
+
+from escapism import unescape
+unescape('<escaped-username>', '-')
+
+## Memory Requests
+
+Per-user per-server memory Requests
+
+## CPU Requests
+
+Per-user per-server CPU Requests
diff --git a/docs/reference/index.md b/docs/reference/index.md
@@ -17,5 +17,10 @@ Please see our [contributing guide](contributing) if you'd like to add to it.
 % that they appear in the table of contents
 ```{toctree}
 :maxdepth: 2
-cluster-dashboard.md
+dashboards/cluster.md
+dashboards/jupyterhub.md
+dashboards/support.md
+dashboards/usage-report.md
+dashboards/user.md
+dashboards/global.md
 ```