sonar

Tool to profile usage of HPC resources by regularly probing processes using ps.

All it really does is to run ps -e --no-header -o pid,user:22,pcpu,pmem,size,comm under the hood, and then filters and groups the output and prints it to stdout, comma-separated.

Image: Midjourney, CC BY-NC 4.0

Changes since v0.5.0

This tool focuses on how resources are used. What is actually running. Its focus is not (anymore) whether and how resources are under-used compared to Slurm allocations. But this functionality can be re-inserted.

We have rewritten it from Python to Rust. The motivation was to have one self-contained binary, without any other dependencies or environments to load, so that the call can execute in milliseconds and so that it has minimal impact on the resources on a large computing cluster. You can find the Python version on the python branch.

Versions until 0.5.0 are available on PyPI.

You can find the old code on the with-slurm-data branch.

Installation

Make sure you have Rust installed (I install Rust through rustup)
Clone this project
Build it: cargo build --release
The binary is then located at target/release/sonar
Copy it to where-ever it needs to be

Collect processes with `sonar ps`

Available options:

$ sonar

Usage: sonar <COMMAND>

Commands:
  ps       Take a snapshot of the currently running processes
  analyze  Not yet implemented
  help     Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

We run sonar ps every 5 minutes on every compute node.

$ sonar ps --help

Take a snapshot of the currently running processes

Usage: sonar ps [OPTIONS]

Options:
      --cpu-cutoff-percent <CPU_CUTOFF_PERCENT>  [default: 0.5]
      --mem-cutoff-percent <MEM_CUTOFF_PERCENT>  [default: 5]
  -h, --help                                     Print help

The code will list all processes that are above --cpu-cutoff-percent or --mem-cutoff-percent.

Here is an example output:

$ sonar ps

2023-01-31T13:34:47.683582663+00:00,somehost,8,user,,alacritty,3.7,214932
2023-01-31T13:34:47.683582663+00:00,somehost,8,user,,slack,2.4,1328412
2023-01-31T13:34:47.683582663+00:00,somehost,8,user,,X,0.8,173148
2023-01-31T13:34:47.683582663+00:00,somehost,8,user,,brave,15.5,7085968
2023-01-31T13:34:47.683582663+00:00,somehost,8,user,,.zoom,37.8,1722564

The columns are:

time stamp
hostname
number of cores on this node
user
Slurm job ID (empty if not applicable)
process
CPU percentage (as they come out of ps)
memory used in KiB

Collect results with `sonar analyze` 🚧

This part is work in progress. Currently we only collect the data since we use it also in another tool. The mapping files can be found in the data folder.

Authors

Henrik Rojas Nagel
Mathias Bockwoldt
Radovan Bast

Design goals and design decisions

Easy installation
Minimal overhead for recording
Can be used as health check tool
Does not need root permissions

Use ps instead of top: We started using top but it turned out that top is dependent on locale, so it displays floats with comma instead of decimal point in many non-English locales. ps always uses decimal points. In addition, ps is (arguably) more versatile/configurable and does not print the header that top prints. All these properties make the ps output easier to parse than the top output.

Do not interact with the Slurm database at all: The initial version correlated information we gathered from ps (what is actually running) with information from Slurm (what was requested). This was useful and nice to have but became complicated to maintain since Slurm could become unresponsive and then processes were piling up.

Why not also recording the pid?: Because we sum over processes of the same name that may be running over many cores to have less output so that we can keep logs in plain text (csv) and don't have to maintain a database or such.

Security and robustness

The tool does not need root permissions.

It does not modify anything and only writes to stdout.

The only external command called by sonar ps is ps -e --no-header -o pid,user:22,pcpu,pmem,size,comm and the tool gives up and stops if the latter subprocess does not return within 2 seconds to avoid a pile-up of processes.

How we run sonar on a cluster

We let cron execute the following script every 5 minutes on every compute node:

#!/usr/bin/env bash

set -euf -o pipefail

sonar_directory=/cluster/shared/sonar/data

year=$(date +'%Y')
month=$(date +'%m')
day=$(date +'%d')

output_directory=${sonar_directory}/${year}/${month}/${day}

mkdir -p ${output_directory}

/cluster/bin/sonar ps >> ${output_directory}/${HOSTNAME}.csv

This produces ca. 10 MB data per day.

Similar and related tools

Reference implementation which serves as inspiration: https://github.com/UNINETTSigma2/appusage
TACC Stats
Ganglia Monitoring System

Name		Name	Last commit message	Last commit date
Latest commit History 333 Commits
.github/workflows		.github/workflows
data		data
img		img
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sonar

Changes since v0.5.0

Installation

Collect processes with `sonar ps`

Collect results with `sonar analyze` 🚧

Authors

Design goals and design decisions

Security and robustness

How we run sonar on a cluster

Similar and related tools

About

Releases

Packages

Languages

License

benteb/sonar

Folders and files

Latest commit

History

Repository files navigation

sonar

Changes since v0.5.0

Installation

Collect processes with sonar ps

Collect results with sonar analyze 🚧

Authors

Design goals and design decisions

Security and robustness

How we run sonar on a cluster

Similar and related tools

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Collect processes with `sonar ps`

Collect results with `sonar analyze` 🚧

Packages