Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop in vrpn_Endpoint_IP::handle_udp_messages #562

Open
tomm opened this issue Aug 20, 2017 · 9 comments
Open

Infinite loop in vrpn_Endpoint_IP::handle_udp_messages #562

tomm opened this issue Aug 20, 2017 · 9 comments

Comments

@tomm
Copy link

tomm commented Aug 20, 2017

I'm not sure if to report this bug in vrpn, OSVR-Core or OSVR-RenderManager.

Sometimes the RenderManagerOpenGL*Examples hang when the first call to osvrClientUpdate() is made.
I can reproduce this on Debian 9 if I build the examples like this:
g++ -g -Wall -lGLEW -lGL -losvrRenderManager -losvrClient -losvrClientKit -losvrCommon RenderManagerOpenGLCAPIExample.cpp -o bad

But they work normally if I build them like this:
g++ -g -Wall -lGLEW -lGL -losvrClient -losvrRenderManager -losvrClientKit -losvrCommon RenderManagerOpenGLCAPIExample.cpp -o good

The only difference is the link order. The 'bad' link order is the one used in the cmake build of OSVR-RenderManager, so all these RenderManagerOpenGL*Examples hang on Debian 9.

Debugging this a little, osvrClientUpdate() is ultimately calling vrpn_Endpoint_IP::handle_udp_messages without a timeout, and the handling of these messages appears to be taking long enough that more packets are arriving before the previous has been handled, resulting in the loop not terminating.

@russell-taylor
Copy link
Contributor

The VRPN library used for transport has a function Jane_stop_this_crazy_thing() defined in vrpn_Connection.h that is meant to stop this behavior, which happens most often with video data where there is always more data coming. I was assuming that perhaps the client library or RenderManager was calling this, so that the linking order changed the value of that setting somehow, but neither of them seems to call this. I can't explain how the linking order changes things, but using this call to place a limit on the number of packets handled during each loop iteration should stop the infinite loop.

If the client application is taking longer than one inter-arrival time to handle the packets, this will convert the infinite loop into a bunch of dropped packets and latency (UDP drops more-recent packets). Changing the client code to not do heavyweight processing in the callback should remove the problem in a more robust way.

@russell-taylor
Copy link
Contributor

Given that this is one of the example programs and indeed the example program is using the state interface rather than the callback interface, this means that OSVR Core is taking too long to handle the update, which is surprising to me. I still can't make sense of how changing the linking order changes things, because VRPN is statically linked into all of the libraries that use it.

@russell-taylor
Copy link
Contributor

It is interesting that although it needs the include directories to depend on osvrClient, neither the library code nor the example code require linking against osvrClient (only osvrClientKit) on Windows.

@russell-taylor
Copy link
Contributor

I modified the build in the fix-infinite-loop branch of OSVR-RenderManager to not link against osvrClient, which works on Windows but I'm having a link failure on boost/config.hpp, included by boost/units/quantity.hpp, included by osvr/Util/Angles.h when I try to compile RenderManager on Linux so I can't test it there. (Not sure how OSVR-Core is compiling; it must be defining -fext-numeric-literals along the way...) Okay, after defining that compiler flag I can now compile and will test as soon as I reboot to update my X server. Can you pull that branch and test and see if it fixes the problem on Debian. If so, I'll go ahead and issue a pull request.

@tomm
Copy link
Author

tomm commented Aug 21, 2017

The examples still hang on Debian 9 when built from the fix-infinite-loop branch.

I'm thinking there might be an issue with libvrpn.a being linked in RenderManager, and also in Core.

A few more data points that might be useful: Here's the top 4 functions in the callgrind output of running the 'good' demo:
85,783,736 /home/tom/work/deps/OSVR-Core/vendor/vrpn/vrpn_Connection.C:vrpn_noint_select(int, fd_set*, fd_set*, fd_set*, timeval*) [/usr/local/lib/libosvrCommon.so.0.6]
55,006,938 /home/tom/work/deps/OSVR-Core/vendor/vrpn/vrpn_Shared.C:vrpn_htond(double) [/usr/local/lib/libosvrCommon.so.0.6]
51,649,424 /home/tom/work/deps/OSVR-Core/vendor/eigen/Eigen/src/Jacobi/Jacobi.h:Eigen::JacobiSVD<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 2>::compute(Eigen::Matrix<double, 3, 3, 0, 3, 3> const&, unsigned int)
46,212,986 /home/tom/work/deps/OSVR-Core/vendor/vrpn/vrpn_Connection.C:vrpn_Endpoint_IP::handle_udp_messages(timeval const*) [/usr/local/lib/libosvrCommon.so.0.6]

And here's the top 4 from 'bad':
24,278,260 /usr/include/eigen3/Eigen/src/Core/util/XprHelper.h:Eigen::internal::variable_if_dynamic<long, 0>::variable_if_dynamic(long) [/usr/local/lib/libosvrRenderManager.so]
20,941,376 /usr/include/eigen3/Eigen/src/Core/CoreEvaluators.h:Eigen::internal::evaluator<Eigen::PlainObjectBase<Eigen::Matrix<float, 2, 1, 0, 2, 1> > >::evaluator(Eigen::PlainObjectBase<Eigen::Matrix<float, 2, 1, 0, 2, 1> > const&) [/usr/local/lib/libosvrRenderManager.so]
19,140,285 /usr/include/eigen3/Eigen/src/Jacobi/Jacobi.h:void Eigen::internal::apply_rotation_in_the_plane<Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true>, Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true>, double>(Eigen::DenseBase<Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true> >&, Eigen::DenseBase<Eigen::Block<Eigen::Matrix<double, 3, 3, 0, 3, 3>, 3, 1, true> >&, Eigen::JacobiRotation const&) [/usr/local/lib/libosvrRenderManager.so]
18,650,966 /build/glibc-p3Km7c/glibc-2.24/elf/dl-lookup.c:do_lookup_x [/lib/x86_64-linux-gnu/ld-2.24.so]

@russell-taylor
Copy link
Contributor

These traces are consistent with OSVR-Core doing a lot more work in response to incoming messages in one case than it is in the other, thus not being able to complete processing for one report before the next arrives. I'm assuming that you're running the same server in both cases and thus talking to the same hardware devices, so that the only difference is the linking order.

RenderManager and Core both use and link against VRPN, but they also both use Eigen. If Eigen is being called in one case but not the other, or taking more time in one case than the other, this would cause the behavior you're seeing. When I build RenderManager, I point it at the Eigen header files in Core -- are you pointing at them or at another set? (Considering whether this is a header/library mismatch issue.)

@tomm
Copy link
Author

tomm commented Sep 12, 2017

I still have the issue when I build OSVR-RenderManager, pointing it at eigen headers in Core

@rpavlik
Copy link
Member

rpavlik commented Sep 19, 2017 via email

@russell-taylor
Copy link
Contributor

We just made a merge in OSVR-Rendermanager that may address this issue. It looks like it may have been overlinking. Have a look at latest master or at least fa2e92d5daa2f8c0ff79011cfa8d6a96a427c55b and see if that fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants