Replies: 1 comment
-
Interconnect speed problems can cause poor scalability (I wouldn't be surprised if that 140-rank run was slower than a 120-rank run on a single node, for instance), but they should never cause an MPI error. I'd be looking for a software issue, but if you're seeing problems from both libMesh directly and from parallel VTK, I'd look at the MPI level. If MPICH is giving you trouble, see if you can get better from OpenMPI, or from a newer MPICH version. (or from an older MPICH version? I'd digging through logs to see which ones have had libMesh-affecting bugs, and one of the first results is from someone who had hangs with 4.2.0b1 that were resolved by downgrading to 4.1.1) There was even a bug in MPICH 4.0, a default Ubuntu LTS version, that broke most of libMesh - that one couldn't be your problem, since it wasn't interconnect specific and would break even small jobs, but it's a good example of the sort of rugpull you can still get hit by from the MPI stack these days. |
Beta Was this translation helpful? Give feedback.
-
I am currently solving a problem with 2 million degrees of freedom using a hexahedral mesh that is partitioned into a 130x130x130 grid, and the FEFamily set to MONOMIAL. I encountered an issue during the mesh partitioning phase using
MeshTools::Generation::build_cube.
Here is the error log from the system:I noticed the error involves vtk functions, so I recompiled libMesh without vtk and reran it. Although the previous error did not reappear, the code still hangs during mesh partitioning.
It could also be a hardware issue. While my two compute nodes are connected via an internal network, the network speed might not be sufficient. For instance, allocating 120 processes to node1 and 20 processes to node2 works without issue, but allocating 120 processes to node1 and 30 processes to node2 causes the above error.
Is it possible to resolve this issue under my current hardware conditions? I would like to allocate 120 processes on both node1 and node2.
Note: The version of libMesh I compiled is using DistributedMesh.
Beta Was this translation helpful? Give feedback.
All reactions