-
Notifications
You must be signed in to change notification settings - Fork 673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
host: Fix RFNoC graph action queue lockup on action exceptions #730
host: Fix RFNoC graph action queue lockup on action exceptions #730
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read the CLA Document and I hereby sign the CLA |
recheck |
Processing of the action queue gets locked up when any action being executed in the send_action call throws an exception. Exceptions are not caught in the loop handling the action queue, resulting in the handling_ongoing queue locking flag to never be released. Any subsequent call to enqueue_action will return on the early exit with the assumption that we're already handling the actions, yet the previous handler exited with an exception. This fix uses a RAII wrapper rather than a manually claimed and released atomic flag to ensure that the handling_ongoing will be released even under exceptional conditions.
42a1756
to
0d68db5
Compare
recheck |
@hannodewind Don't worry about the CLA checker bot, it's a misconfig on our end (I think). This is all we need for now: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannodewind your analysis is correct, and the solution is fine, too. I do want to double-check if we can have the atomic lock without adding this special-case class, but overall, this is an excellent solution.
@hannodewind I think I will modify this to use UHD's |
@mbr0wn Thank you for the feedback, I am keen to see the |
The change is now in master: 0f2007f |
Pull Request Details
Description
Processing of the action queue gets locked up when any action being executed in the
send_action
call throws an exception. Exceptions are not caught in the loop handling the action queue, resulting in thehandling_ongoing
queue locking flag to never be released. Any subsequent call toenqueue_action
will return on the early exit with the assumption that we're already handling the actions, yet the previous handler exited with an exception.This fix uses a RAII wrapper rather than a manually claimed and released atomic flag to ensure that the handling_ongoing will be released even under exceptional conditions.
Related Issue
Relates to issue #611
Which devices/areas does this affect?
UHD hosts using RFNoC graph
Testing Done
X310 with dual 10GbE links to server, running both RF inputs at 200MHz sample rate using 2x RX streamers.
Stress the server with CPU load (can use
stress-ng
), inducing UDP packet drops. (Also relates to #611, which stressed the link using iperf, probably also causing UDP packet drops).At some point (difficult to reproduce, but does happen every so often), one of the RX streamers will experience an overrun, which calls the _overrun_handler -> ACTION_KEY_RX_RESTART_REQ which calls
get_time_now()
, doing a peek64 to the device. Thispeek64
then throws an exception due to an ACK timeout.This exception is caught all the way up in thread that called
recv
on the RX streamer, but the stream is irrecoverable since the graph action queue is locked up.Checklist
MPM compat, noc_shell, specific RFNoC block, ...)