Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation Fault at PMPI_Iallreduce in distributed Galois (missing MPI_Init?) #425

Open
Barenya255 opened this issue Jan 20, 2025 · 0 comments

Comments

@Barenya255
Copy link

Hello,
I've been trying to run distributed Galois for quite some time. I've tried running all the provided apps and have been encountering a segmentation fault.

Command used :
./sssp-pull --startNode=0 $graphPath

Error observed:

[0] Master distribution time : 0.239983 seconds to read 168 bytes in 20 seeks (0.00070005 MBPS)
[0] Starting graph reading.
[0] Reading graph complete.
[0] Edge inspection time: 0.246308 seconds to read 148615096 bytes (603.371 MBPS)
Loading edge-data while creating edges
[0] Edge loading time: 0.529808 seconds to read 271105352 bytes (511.705 MBPS)
[0] Graph construction complete.
[0] InitializeGraph::go called
[0] SSSP::go run 0 called

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6dc8604 in PMPI_Iallreduce () from /lfs/sware/openmpi411/lib/libmpi.so.40
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.x86_64 numactl-libs-2.0.9-7.el7.x86_64

Here are some of my observations from plugging the binary onto gdb.

The segfault has been occuring in the library libdist/libgalois_dist_async.a on a PMPI_Iallreduce call.

After observing the segfault, I opened gdb and noticed a constant address that is outside the process's memory bounds being accessed.

This address was being pushed into r9 in the preamble to calling MPI_IallReduce and was moved into rbp. This address does not seem to be accessible ever.

As per SystemV ABI, r9 would be the sixth argument being passed into a function, which for our case is MPI_COMM_WORLD.

This could happen if MPI_COMM_WORLD was not initialised, which would indicate the code flow lacking an MPI_Init().

Also, gdb could only set a future breakpoint in MPI_Init, and the segfault int MPI_Iallreduce before MPI_Init breakpoint. I don't notice any boost_mpi libraries.

This is the preamble to PMPI_Iallreduce call:

   0x478441 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+145>:	mov    $0x44000000,%r9d
   0x478447 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+151>:	movb   $0x0,-0xe8(%rbp)
   0x47844e <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+158>:	mov    $0x58000001,%r8d
   0x478454 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+164>:	mov    $0x4c000808,%ecx
   0x478459 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+169>:	mov    $0x1,%edx
   0x47845e <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+174>:	mov    %rsi,-0x3a8(%rbp)
   0x478465 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+181>:	mov    %rdi,-0x3a0(%rbp)
   0x47846c <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+188>:	mov    %rax,-0x388(%rbp)
   0x478473 <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+195>:	vmovdqa %xmm3,-0x100(%rbp)
   0x47847b <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+203>:	push   %rax
   0x47847c <_ZN4SSSPILb1EE2goERN6galois6graphs9DistGraphI8NodeDatajEE+204>:	callq  0x413230 <MPI_Iallreduce@plt>

This is where the segfault is happening.

  
   0x00007ffff6dc85c8 <+24>:	mov    %r9,%rbp
   0x00007ffff6dc85cb <+27>:	push   %rbx
   0x00007ffff6dc85cc <+28>:	sub    $0x28,%rsp
   0x00007ffff6dc85d0 <+32>:	mov    0x29e041(%rip),%rax        # 0x7ffff7066618
   0x00007ffff6dc85d7 <+39>:	mov    0x60(%rsp),%rbx
   0x00007ffff6dc85dc <+44>:	cmpb   $0x0,(%rax)
   0x00007ffff6dc85df <+47>:	je     0x7ffff6dc8648 <PMPI_Iallreduce+152>
   0x00007ffff6dc85e1 <+49>:	mov    0x29e8d0(%rip),%rax        # 0x7ffff7066eb8
   0x00007ffff6dc85e8 <+56>:	mov    (%rax),%eax
   0x00007ffff6dc85ea <+58>:	sub    $0x2,%eax
   0x00007ffff6dc85ed <+61>:	cmp    $0x2,%eax
   0x00007ffff6dc85f0 <+64>:	ja     0x7ffff6dc8750 <PMPI_Iallreduce+416>
   0x00007ffff6dc85f6 <+70>:	test   %rbp,%rbp
   0x00007ffff6dc85f9 <+73>:	je     0x7ffff6dc8612 <PMPI_Iallreduce+98>
   0x00007ffff6dc85fb <+75>:	cmp    0x29e1c6(%rip),%rbp        # 0x7ffff70667c8
   0x00007ffff6dc8602 <+82>:	je     0x7ffff6dc8612 <PMPI_Iallreduce+98>
=> 0x00007ffff6dc8604 <+84>:	mov    0xe8(%rbp),%eax

Note the address 0x44000000. This address seems not accessible.

(gdb) p/x *0x44000000
Cannot access memory at address 0x44000000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant