You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a similar problem. In my case, I need to assign GPU buffer for completion queue. I have Tesla K40 and connectx-4. Nvidia_peermem is loaded. But I get segmentation fault - bad address error with GPU memory address (returned by cudaMalloc). However, this problem does not happen with CPU address (returned by malloc). I wonder if you have been able to solve the issue you mentioned and if so, how?
The GDB BackTrace is :
#0 0x00007ffff6d16cb4 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1 0x00007ffc805e7b16 in copy_to_scat (scat=0x7ff9bc18f6e0, buf=buf@entry=0x7ff9bc1894c0, size=size@entry=0x7ffa167fe2ec,
max=max@entry=1, ctx=ctx@entry=0x1c1e8780) at ../providers/mlx5/qp.c:88
#2 0x00007ffc805e7e07 in copy_to_scat (ctx=0x1c1e8780, max=1, size=0x7ffa167fe2ec, buf=0x7ff9bc1894c0, scat=)
at ../providers/mlx5/qp.c:78
#3 mlx5_copy_to_send_wqe (qp=qp@entry=0x7ff9bc18a230, idx=, buf=0x7ff9bc1894c0, size=)
at ../providers/mlx5/qp.c:161
#4 0x00007ffc805e51a4 in mlx5_parse_cqe (lazy=0, cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=,
cur_rsc=, cqe=, cqe64=, cq=) at ../providers/mlx5/cq.c:743
#5 mlx5_poll_one (cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=, cur_rsc=, cq=)
at ../providers/mlx5/cq.c:904
#6 poll_cq (cqe_ver=1, wc=, ne=, ibcq=0x7ff9bc188d40) at ../providers/mlx5/cq.c:932
#7 mlx5_poll_cq_v1 (ibcq=0x7ff9bc188d40, ne=32, wc=) at ../providers/mlx5/cq.c:1306
#8 0x00007ffce1248ab2 in ibv_poll_cq (wc=0x7ffa167fe5a0, num_entries=32, cq=)
/include/infiniband/verbs.h:2456
It seems like the ibv_poll_cq failed. But when I change to cpu addr, this problem will not happen.
I wonder what happened.
The text was updated successfully, but these errors were encountered: