Skip to content

Latest commit

 

History

History
67 lines (42 loc) · 1.34 KB

README.md

File metadata and controls

67 lines (42 loc) · 1.34 KB

GPU energy usage counter for AMD/ROCm

Reads the current GPU energy counters for AMD GPU cards using the ROCm SMI library.

Installation

To compile just run make.

NOTE: on LUMI it is pre-installed in /appl/local/csc/soft/ai/bin/gpu-energy.

Usage

Print current counter values for all visible devices:

gpu-energy

Save counters to a temporary file for later use:

gpu-energy --save [filename]

if no filename is given, it will try to figure out a good name based on the Slurm environment.

Print energy usage difference since last save:

gpu-energy --diff [filename]

Typical usage in a Slurm script:

gpu-energy --save

# run job here

gpu-energy --diff

Multi node job:

srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 gpu-energy --save

# run job here

srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 gpu-energy --diff

If you're using a module (like CSC's pytorch) that sets the SLURM_MPI_TYPE environment variable, you need to run it like this (otherwise it will not detect MPI and will not calculate the energy sum over nodes).

srun --mpi=cray_shasta --ntasks=$SLURM_NNODES --ntasks-per-node=1 gpu-energy --save

# run job here

srun --mpi=cray_shasta --ntasks=$SLURM_NNODES --ntasks-per-node=1 gpu-energy --diff