You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenMPI v5.0.5 with sm btl is causing my workload to hang/crash.
All code (application & MPI) compiled with Clang 19.1.0.
Official tarballs used for building MPI library.
OS: Ubuntu 22.04.5 LTS.
Hardware: AMD Threadripper Pro 5995WX
MPI runs in funneled mode.
Parameters are being checked: --mca mpi_param_check=1.
At a high-level, my workload consists of an array of requests that are progressed via calls to MPI_Testsome. This array grows, shrinks, and shuffles over time. Requests don't necessarily belong to the same MPI_Comm. Requests are mainly related to p2p communication via MPI_Irecv, MPI_Recv_init, etc. Some requests are long-lived whereas others are short-lived.
If I use sm (via --mca btl self,sm), application starts hanging pretty quickly.
If I make most, if not all, sends synchronous (i.e. MPI_Issend), the application hangs immediately (i.e. synchronous makes things worse).
To get rid of most the hanging, I have to add an MPI_Testall before every call to MPI_Testsome, but the application still usually hangs after a while.
If I do the following, everything works perfectly fine regardless of if sending is synchronous (e.g. MPI_Issend), and whether or not MPI_Testall is injected before MPI_Testsome:
tcp is forcefully used instead of sm (via --mca btl self,tcp).
or, OpenMPI v4.1.6 is used.
or, MPICH v4.2.2 is used.
If I sanitise my application with ASan+UBSan (I did not sanitise OpenMPI itself as in the past it's given me lots of false positives), I get heap-buffer-overflow errors originating from OpenMPI's sm implementation details.
Example 1:
#0 0x5de326394471 in memcpy sanitizer_common_interceptors_memintrinsics.inc:115:5
#1 0x75daae589067 in sm_prepare_src btl_sm_module.c
#2 0x75daaf20f6fa in mca_pml_ob1_send_request_schedule_once (/openmpi-5.0.5/lib/libmpi.so.40+0x20f6fa)
#3 0x75daaf2083a7 in mca_pml_ob1_recv_frag_callback_ack (/openmpi-5.0.5/lib/libmpi.so.40+0x2083a7)
#4 0x75daae58a231 in mca_btl_sm_component_progress btl_sm_component.c
#5 0x75daae50fdec in opal_progress (/openmpi-5.0.5/lib/libopen-pal.so.80+0x21dec)
#6 0x75daaf083a5d in ompi_request_default_test_all (/openmpi-5.0.5/lib/libmpi.so.40+0x83a5d)
#7 0x75daaf0c5062 in MPI_Testall (/openmpi-5.0.5/lib/libmpi.so.40+0xc5062)
SUMMARY: AddressSanitizer: heap-buffer-overflow btl_sm_module.c in sm_prepare_src
Shadow bytes around the buggy address:
0x5040000bb780: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 00 fa
0x5040000bb800: fa fa 00 00 00 00 06 fa fa fa 00 00 00 00 00 fa
0x5040000bb880: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 06 fa
0x5040000bb900: fa fa fa fa fa fa fa fa fa fa 00 00 00 00 00 fa
0x5040000bb980: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 00 00
=>0x5040000bba00:[fa]fa 00 00 00 00 03 fa fa fa 00 00 00 00 00 fa
0x5040000bba80: fa fa 00 00 00 00 00 fa fa fa fd fd fd fd fd fa
0x5040000bbb00: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 fa
0x5040000bbb80: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 00 fa
0x5040000bbc00: fa fa fd fd fd fd fd fd fa fa fd fd fd fd fd fa
0x5040000bbc80: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 00 fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Example 2:
#0 0x589dde322471 in memcpy sanitizer_common_interceptors_memintrinsics.inc:115:5
#1 0x70e2eb5077e7 in mca_btl_sm_fbox_sendi btl_sm_sendi.c
#2 0x70e2eb50742c in mca_btl_sm_sendi (/openmpi-5.0.5/lib/libopen-pal.so.80+0x9d42c)
#3 0x70e2ec1fdebc in mca_pml_ob1_process_pending_packets (/openmpi-5.0.5/lib/libmpi.so.40+0x1fdebc)
#4 0x70e2ec20939f in mca_pml_ob1_rget_completion pml_ob1_recvreq.c
#5 0x70e2eb507ba5 in mca_btl_sm_get (/openmpi-5.0.5/lib/libopen-pal.so.80+0x9dba5)
#6 0x70e2ec20a013 in mca_pml_ob1_recv_request_progress_rget (/openmpi-5.0.5/lib/libmpi.so.40+0x20a013)
#7 0x70e2ec20c45e in mca_pml_ob1_recv_req_start (/openmpi-5.0.5/lib/libmpi.so.40+0x20c45e)
#8 0x70e2ec200a87 in mca_pml_ob1_irecv (/openmpi-5.0.5/lib/libmpi.so.40+0x200a87)
#9 0x70e2ec0b3a1b in PMPI_Irecv (/openmpi-5.0.5/lib/libmpi.so.40+0xb3a1b)
0x5290005092b8 is located 0 bytes after 16568-byte region [0x529000505200,0x5290005092b8)
allocated by thread T0 here:
#0 0x589dde324160 in malloc /clang-19.1.0/compiler-rt/lib/asan/asan_malloc_linux.cpp:68:3
#1 0x70e2eb48e5c9 in opal_free_list_grow_st (/openmpi-5.0.5/lib/libopen-pal.so.80+0x245c9)
SUMMARY: AddressSanitizer: heap-buffer-overflow btl_sm_sendi.c in mca_btl_sm_fbox_sendi
Shadow bytes around the buggy address:
0x529000509000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x529000509080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x529000509100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x529000509180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x529000509200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x529000509280: 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa fa
0x529000509300: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x529000509380: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x529000509400: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x529000509480: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x529000509500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
The text was updated successfully, but these errors were encountered:
OpenMPI v5.0.5 with
sm
btl is causing my workload to hang/crash.--mca mpi_param_check=1
.At a high-level, my workload consists of an array of requests that are progressed via calls to
MPI_Testsome
. This array grows, shrinks, and shuffles over time. Requests don't necessarily belong to the sameMPI_Comm
. Requests are mainly related to p2p communication viaMPI_Irecv
,MPI_Recv_init
, etc. Some requests are long-lived whereas others are short-lived.sm
(via--mca btl self,sm
), application starts hanging pretty quickly.MPI_Issend
), the application hangs immediately (i.e. synchronous makes things worse).MPI_Testall
before every call toMPI_Testsome
, but the application still usually hangs after a while.If I do the following, everything works perfectly fine regardless of if sending is synchronous (e.g.
MPI_Issend
), and whether or notMPI_Testall
is injected beforeMPI_Testsome
:tcp
is forcefully used instead ofsm
(via--mca btl self,tcp
).If I sanitise my application with ASan+UBSan (I did not sanitise OpenMPI itself as in the past it's given me lots of false positives), I get
heap-buffer-overflow
errors originating from OpenMPI'ssm
implementation details.Example 1:
Example 2:
The text was updated successfully, but these errors were encountered: