Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Srun failed at test #14

Open
billcsm opened this issue Sep 6, 2024 · 0 comments
Open

[Issue]: Srun failed at test #14

billcsm opened this issue Sep 6, 2024 · 0 comments

Comments

@billcsm
Copy link

billcsm commented Sep 6, 2024

Problem Description

We installed Slurm 21.08.8-2 and OpenMPI 4.1.6 for AMD MI250X GPU system.
For MPI, we ran "mpirun_rochpl -P 2 -Q 4 -N 256000 --NB 512" for 8 GCD. It passed.
For Slurm, we did "srun -N 1 -n 8 run_rochpl -P 2 -Q 4 -p 2 -q 4 -N 128000 --NB 512". It got the following failure:

libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 96 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 112 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 64 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Logical CPU number 80 out of range

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD EPYC 7713 64-Core Processor X 2

GPU

AMD Instinct MI250X, AMD Instinct MI250

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

Further debug and found the environment variable OMP_PLACES was different between openmpi and slurm.
For OpenMPI, its value was
OMP_PLACES = '{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'

For Slurm, its value was
OMP_PLACES = '{3}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant