Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This fixes a deadlock that can happen rarely when trying to connect to a system. It's a typical deadlock where two different mutexes are locked in opposite order, so what happens is that: - the first thread locks mutex A and then tries to lock mutex B, and - the second thread locks mutex B and then tries to lock mutex A. At this point both threads are waiting on the other thread and neither can continue. In this case the two relevant locks are: - _new_system_callback_mutex, and - _systems_mutex 1. The user calls subscribe_on_new_system() which locks _new_system_callback_mutex, and then checks whether there is already a system that they should be notified about. In is_any_system_connected(), the systems() are accessed requiring the _systems_mutex. 2. At the same time, a message from a system arrives which grabs the _systems_mutex and creates the System. It then forwards the first heartbeat to the new system which in turn calls notify_on_discover() which, of course, requires the _new_system_callback_mutex, so the lock that the first thread has already taken. We deadlock. The possible fixes I thought of were: 1. Always lock these locks in the same order. This can make sense for strictly hierarchical data but doesn't really seem right here. 2. Make scope of where locks are used smaller, potentially preventing that multiple locks need to be locked at once. I didn't find a way to do that in this case. For both cases the scope seemed correct, and making it smaller would introduce new problems (invalidating iterator or calling the callback twice by mistake). 2. Collapse/merge the two conflicting locks. This had the drawback that I had to switch to the recursive_mutex to prevent any stalls when we're trying to lock the same mutex twice. We could have added methods with and without lock and then call the correct one, however, when calling the methods out of SystemImpl the context is not so obvious and it's not clear which one we should call. Basically, it makes it rather convoluted so recursive_mutex seems slightly cleaner.
- Loading branch information