Modified interesting event of RDMA socket poll #13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In a High Availability (HA) environment with RDMA connections, after forcibly powering off one of the nodes, fio is unable to continue performing I/O operations.
Description:
Environment Details:
We are using Pacemaker to set up a High Availability (HA) environment with properly arranged resource startup orders. The systems are configured as follows:
beegfs-oss1
with RHEL 9p1, kernel 5.14.0-162.6.1.el9_1.x86_64 (IP: 10.10.2.231 forgraidstor1
, floating VIP 10.10.2.233 forgraidstor2
)beegfs-oss2
with RHEL 9p1, kernel 5.14.0-162.6.1.el9_1.x86_64rocky9p1
with Rocky Linux 9p1, kernel 5.14.0-162.23.1.el9_1.x86_64beegfs-mgt
with RHEL 9p1, kernel 5.14.0-162.6.1.el9_1.x86_64Issue Reproduction Steps:
graidstor1
(10.10.2.231) is running onbeegfs-oss1
.graidstor2
(10.10.2.233) is running onbeegfs-oss2
.On the client node, run the following fio command:
On
beegfs-oss2
, execute:echo b > /proc/sysrq-trigger
to simulate Power Failure.At this point, the fio process on the client temporarily becomes unresponsive
graidstor2
(10.10.2.233) is migrated by Pacemaker tobeegfs-oss1
and activated there.However, even after the migration, fio on the client remains unresponsive.
Debug Information:
After enabling BEEGFS_DEBUG and checking the dmesg log, it appears that the process enters an infinite loop in
IBVSocket_poll
and cannot exit. The relevant dmesg log and code snippet are provided below./opt/beegfs/src/client/client_module_7/source/common/toolkit/SocketTk.c
We discovered that when handling RDMA socket polling, the "interesting event occurred" behavior is not the same as with standard sockets. RDMA sockets will consider all poll events as interesting. This seems to cause the
pollerr
state to continue looping after a migration, preventing the socket from reconnecting to the new node.To address this, we tested modifying the conditions for interesting events to be more similar to those for standard sockets. However, we kept only the
poll._events
as interesting, allowingpollerr
to exit the infinite loop and re-establish the connection.We would like to ask if you have any better suggestions or alternative modifications for this issue?
infinite loop issue, dmesg log
infinite loop issue, beegfs-client.log