note for monitoring utilization of independent GPU nodes

当你的组里有很多独立的机器时，比如我的组有9台单机，一共30张卡(RTX 3090/4090)，想知道一段时间里这些机器的GPU利用率（一般商用的集群都会有这样的功能），可以用到下面说的开源工具。

Use jupyterlab_nvdashboard

https://github.com/rapidsai/jupyterlab-nvdashboard

Run this on the node you want to monitor

python -m jupyterlab_nvdashboard.server 9988

Then you can open a html to see the log http://10.30.8.195:9988/GPU-Resources The log is only saved on the web page. No database is used. So you need to keep the web page open.

use rntop from run-ai

https://github.com/run-ai/rntop

This is better for monitoring multiple nodes. 原理：通过 ssh user@IP nvidia-smi 每个节点，然后记录GPU利用率到一个文件。

Suppose you have a head node. Need to put the head node's id_rsa.pub to all the GPU nodes' authorized_keys.

Run the following in a screen. The output file will be saved to /home/junweil/nv_monitoring/rntop.log

(base) junweil@junwei-home-lab:~/nv_monitoring$ sudo docker run -it --rm -v $HOME/.ssh:/root/.ssh -v $HOME/nv_monitoring:/host runai/rntop --output /host/rntop.log junweil@machine-1-IP junweil@machine-2-IP ...

Then you can upload the log file to https://run-ai.github.io/rntop-board/ and see the utilization over time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

note_nv_monitoring.md

note_nv_monitoring.md

note for monitoring utilization of independent GPU nodes

Files

note_nv_monitoring.md

Latest commit

History

note_nv_monitoring.md

File metadata and controls

note for monitoring utilization of independent GPU nodes