Issue with Parallel Computation Across Two Compute Nodes #3836

nuomi68 · 2024-04-22T07:42:55Z

nuomi68
Apr 22, 2024

I am currently solving a problem with 2 million degrees of freedom using a hexahedral mesh that is partitioned into a 130x130x130 grid, and the FEFamily set to MONOMIAL. I encountered an issue during the mesh partitioning phase using MeshTools::Generation::build_cube. Here is the error log from the system:

2024-04-22 13:21:06.468 ( 34.072s) [ 6E1ECCC0] vtkMPICommunicator.cxx:64 WARN| MPI had an error
------------------------------------------------
Invalid MPI_Op, error stack:
internal_Test(91)...............: MPI_Test(request=0x55b87d993750, flag=0x7fff063bf5e4, status=0x1) failed
MPIR_Test(317)..................: 
MPIR_Test_state(277)............: 
MPIDI_CH3I_Progress(401)........: 
pkt_CTS_handler(324)............: 
MPID_nem_lmt_shm_start_send(254): 
MPID_nem_delete_shm_region(864).: 
(unknown)(): Invalid MPI_Op

2024-04-22 13:18:57.838 (  64.001s) [        5C212CC0] vtkMPICommunicator.cxx:64    WARN| MPI had an error
------------------------------------------------
Unknown error class, error stack:
internal_Irecv(123)........: MPI_Irecv(buf=0x55f0acaca7f0, count=14740, dtype=USER<resized>, 117, 134217736, MPI_COMM_WORLD, request=0x55f0a72e4498) failed
MPID_Irecv(64).............: 
MPIDI_CH3_EagerSyncAck(177): 
MPIDI_CH3_iStartMsg(30)....: Communication error with rank 117

I noticed the error involves vtk functions, so I recompiled libMesh without vtk and reran it. Although the previous error did not reappear, the code still hangs during mesh partitioning.

It could also be a hardware issue. While my two compute nodes are connected via an internal network, the network speed might not be sufficient. For instance, allocating 120 processes to node1 and 20 processes to node2 works without issue, but allocating 120 processes to node1 and 30 processes to node2 causes the above error.

Is it possible to resolve this issue under my current hardware conditions? I would like to allocate 120 processes on both node1 and node2.

Note: The version of libMesh I compiled is using DistributedMesh.

roystgnr · 2024-04-22T14:54:28Z

roystgnr
Apr 22, 2024
Maintainer

Interconnect speed problems can cause poor scalability (I wouldn't be surprised if that 140-rank run was slower than a 120-rank run on a single node, for instance), but they should never cause an MPI error.

I'd be looking for a software issue, but if you're seeing problems from both libMesh directly and from parallel VTK, I'd look at the MPI level. If MPICH is giving you trouble, see if you can get better from OpenMPI, or from a newer MPICH version. (or from an older MPICH version? I'd digging through logs to see which ones have had libMesh-affecting bugs, and one of the first results is from someone who had hangs with 4.2.0b1 that were resolved by downgrading to 4.1.1) There was even a bug in MPICH 4.0, a default Ubuntu LTS version, that broke most of libMesh - that one couldn't be your problem, since it wasn't interconnect specific and would break even small jobs, but it's a good example of the sort of rugpull you can still get hit by from the MPI stack these days.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Parallel Computation Across Two Compute Nodes #3836

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Issue with Parallel Computation Across Two Compute Nodes #3836

nuomi68 Apr 22, 2024

Replies: 1 comment

roystgnr Apr 22, 2024 Maintainer

nuomi68
Apr 22, 2024

roystgnr
Apr 22, 2024
Maintainer