You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We installed Slurm 21.08.8-2 and OpenMPI 4.1.6 for AMD MI250X GPU system.
For MPI, we ran "mpirun_rochpl -P 2 -Q 4 -N 256000 --NB 512" for 8 GCD. It passed.
For Slurm, we did "srun -N 1 -n 8 run_rochpl -P 2 -Q 4 -p 2 -q 4 -N 128000 --NB 512". It got the following failure:
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 96 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 112 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 64 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Logical CPU number 80 out of range
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD EPYC 7713 64-Core Processor X 2
GPU
AMD Instinct MI250X, AMD Instinct MI250
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Further debug and found the environment variable OMP_PLACES was different between openmpi and slurm.
For OpenMPI, its value was
OMP_PLACES = '{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'
For Slurm, its value was
OMP_PLACES = '{3}'
The text was updated successfully, but these errors were encountered:
Problem Description
We installed Slurm 21.08.8-2 and OpenMPI 4.1.6 for AMD MI250X GPU system.
For MPI, we ran "mpirun_rochpl -P 2 -Q 4 -N 256000 --NB 512" for 8 GCD. It passed.
For Slurm, we did "srun -N 1 -n 8 run_rochpl -P 2 -Q 4 -p 2 -q 4 -N 128000 --NB 512". It got the following failure:
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 96 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 112 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Number of places reduced from 61 to 1 because some places didn't contain any usable logical CPUs
libgomp: Logical CPU number 64 out of range
libgomp: Invalid value for environment variable OMP_PLACES
libgomp: Logical CPU number 80 out of range
Operating System
Ubuntu 22.04.4 LTS (Jammy Jellyfish)
CPU
AMD EPYC 7713 64-Core Processor X 2
GPU
AMD Instinct MI250X, AMD Instinct MI250
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
Further debug and found the environment variable OMP_PLACES was different between openmpi and slurm.
For OpenMPI, its value was
OMP_PLACES = '{32},{33},{34},{35},{36},{37},{38},{39},{40},{41},{42},{43},{44},{45},{46},{47},{1},{2},{3},{4},{5},{6},{7},{8},{9},{10},{11},{12},{13},{14},{15},{17},{18},{19},{20},{21},{22},{23},{24},{25},{26},{27},{28},{29},{30},{31},{49},{50},{51},{52},{53},{54},{55},{56},{57},{58},{59},{60},{61},{62},{63}'
For Slurm, its value was
OMP_PLACES = '{3}'
The text was updated successfully, but these errors were encountered: