Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install fail on external PMIx, I get "error: 'PMIX_SESSION_PROVISION' undeclared" #12915

Open
DaXor-0 opened this issue Nov 9, 2024 · 5 comments

Comments

@DaXor-0
Copy link

DaXor-0 commented Nov 9, 2024

For a university project I'm trying to build a rasberry pi cluster with slurm.

I've had quite a few issues on trying to run srun with mpi and I've settled to install openmpi from git repo specifying external pmix, hwloc and libevent for pmix/slurm integration.

I'm building openmpi version: 5.1.0a1 on a raspberry pi 5 cluster managed with slurm. (nodes have raspberry pi os lite)

What I've done so far:

HWLOC (v2.11 git clone)--->
./configure --disable-rsmi --prefix=/hwloc-install-prefix
make
make install

LIBEVENT (latest git clone)--->
./configure --prefix==/libevent-install-prefix
make
make install

OPENPMIX (latest git clone)--->
./configure --with-slurm --with-libevent=/libevent-install-prefix --with-hwloc=/hwloc-install-prefix --prefix=/pmix-install-prefix
make
make install

OPENMPI (latest git clone)--->
./configure --disable-sphinx --with-slurm --with-libevent=libevent-install-prefix --with-hwloc=hwloc-install-prefix --with-pmi=pmix-install-prefix --prefix=ompi-prefix
make --------> I fail here

(note that I'm disabling sphinx because I've not yet installed a python module on the cluster)

The output of pmix configure correctly indicates slurm support and the paths to external libevent and hwloc.
Also the output of ompi configure correctly indicates pmi, libevent and hwloc as external.

When I try to run openmpi make I'm not able to build it for this error:

In file included from /clusterfs/apps/openpmix/include/pmix_common.h:2797,
                 from /clusterfs/apps/openpmix/include/pmix/src/class/pmix_list.h:78,
                 from /clusterfs/src/ompi/3rd-party/prrte/src/pmix/pmix-internal.h:26,
                 from prted/pmix/pmix_server_session.c:12:
prted/pmix/pmix_server_session.c: In function 'process_directive':
prted/pmix/pmix_server_session.c:145:50: error: 'PMIX_SESSION_PROVISION' undeclared (first use in this function); did you mean 'PMIX_SESSION_PROVISION_NODES'?
  145 |         } else if (PMIX_CHECK_KEY(&req->info[n], PMIX_SESSION_PROVISION) ||
      |                                                  ^~~~~~~~~~~~~~~~~~~~~~
/clusterfs/apps/openpmix/include/pmix_deprecated.h:497:30: note: in definition of macro 'PMIX_CHECK_KEY'
  497 |     PMIx_Check_key((a)->key, b)
      |                              ^
prted/pmix/pmix_server_session.c:145:50: note: each undeclared identifier is reported only once for each function it appears in
  145 |         } else if (PMIX_CHECK_KEY(&req->info[n], PMIX_SESSION_PROVISION) ||
      |                                                  ^~~~~~~~~~~~~~~~~~~~~~
/clusterfs/apps/openpmix/include/pmix_deprecated.h:497:30: note: in definition of macro 'PMIX_CHECK_KEY'
  497 |     PMIx_Check_key((a)->key, b)
      |                              ^
prted/pmix/pmix_server_session.c: At top level:
prted/pmix/pmix_server_session.c:416:1: fatal error: opening dependency file prted/pmix/.deps/libprrte_la-pmix_server_session.Tpo: Permission denied
  416 | }
      | ^
compilation terminated.
make[4]: *** [Makefile:1655: prted/pmix/libprrte_la-pmix_server_session.lo] Error 1
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory '/clusterfs/src/ompi/3rd-party/prrte/src'
make[3]: *** [Makefile:1862: all-recursive] Error 1
make[3]: Leaving directory '/clusterfs/src/ompi/3rd-party/prrte/src'
make[2]: *** [Makefile:795: all-recursive] Error 1
make[2]: Leaving directory '/clusterfs/src/ompi/3rd-party/prrte'
make[1]: *** [Makefile:1385: all-recursive] Error 1
make[1]: Leaving directory '/clusterfs/src/ompi/3rd-party'
make: *** [Makefile:1512: all-recursive] Error 1
@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2024

This option isn't correct: with-pmi=pmix-install-prefix should be with-pmix. The output indicates you picked up some other version of PMIx that doesn't include some of the definitions to be found in upstream PMIx master branch.

@DaXor-0
Copy link
Author

DaXor-0 commented Nov 9, 2024

I get the same error

this is the config I ran

./configure --with-slurm --disable-sphinx --with-pmix=/clusterfs/apps/openpmix --with-hwloc=/clusterfs/apps/hwloc --with-libevent=/clusterfs/apps/libevent --prefix=/clusterfs/apps/openmpi

And this is the config output

Open MPI configuration:
-----------------------
Version: 5.1.0a1
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: no                          Build MPI Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)
Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: no documentation available
hwloc: external
libevent: external
Open UCC: no
pmix: external
PRRTE: internal
Threading Package: pthreads

Transports
-----------------------                                 Cisco usNIC: no
Intel Omnipath (PSM2): no (not found)                   Open UCX: no
OpenFabrics OFI Libfabric: no (not found)               Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: no
Intel ZE support: no                                    ROCm support: no

OMPIO File Systems                                      -----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)

@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2024

Afraid I cannot help you much - something is quite wrong here. You should not be able to configure with an external hwloc, libevent, and pmix - and then use an internal PRRTE. Configure is supposed to error out on that attempt as all must be either internal or all must be external.

Setting that weirdness aside, I can only tell you that you are not in fact building against a head of the PMIx master branch. I don't know if you incorrectly checked out some other branch, or have some older PMIx install on your system, or...? I only know that PRRTE is looking at an old version of PMIx, which is what is causing the error.

@DaXor-0
Copy link
Author

DaXor-0 commented Nov 9, 2024

Ok, thanks for the advice.

My hypothesis is that something strange is going on due to the fact that I'm on ARM and something somewhere is breaking for this reason

@rhc54
Copy link
Contributor

rhc54 commented Nov 9, 2024

Doubt that it has anything to do with ARM as many of us (myself included) operate regularly on that hardware. You should check to see if you have another PMIx install somewhere on the system that is causing the confusion. Try building everything internal (instead of using the external libs) and see if that works. Etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants