Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with MPS #12

Open
sjsprecious opened this issue Aug 25, 2022 · 3 comments
Open

Issue with MPS #12

sjsprecious opened this issue Aug 25, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@sjsprecious
Copy link

I built a simple MPI+OpenACC example on Gust. It was compiled successfully but ran into an error when I turned on MPS.

The error message looks like:
gu0017.hsn.gu.hpc.ucar.edu: rank 0 died from signal 11 and dumped core

If I turned off MPS and re-ran the program, it worked just fine.

The run command is mpiexec --cpu-bind depth -n 1 -ppn 1 -d 1 ./mpi_mps.exe and I only request a single GPU. I tried 2 MPI ranks per node and got the same error.

My environment module list is:

  1. ncarenv/22.08 (S) 3) nvhpc/22.7 5) cray-mpich/8.1.18
  2. craype/2.7.17 (S) 4) ncarcompilers/0.6.2
@vanderwb vanderwb added the bug Something isn't working label Aug 25, 2022
@vanderwb
Copy link
Collaborator

This might be tied to an issue with MPI & GPUs and the OS version on the compute nodes. @jbaksta is looking at upgrading the OS to the supported version soon. If nothing else, that will allow us to report segfaults and get support.

@vanderwb
Copy link
Collaborator

Currently cray-mpich is not working with MPS. We are working with HPE to resolve. An OpenMPI build has been introduced. I recommend you give that a try.

@vanderwb
Copy link
Collaborator

So this does seem to be resolved on Derecho with newer network drivers. I am fairly confident MPS will be working properly for you when we launch.

@vanderwb vanderwb self-assigned this Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants