Skip to content

Commit

Permalink
core: fix deadlock on discovery
Browse files Browse the repository at this point in the history
This fixes a deadlock that can happen rarely when trying to connect to
a system.

It's a typical deadlock where two different mutexes are locked in
opposite order, so what happens is that:
- the first thread locks mutex A and then tries to lock mutex B, and
- the second thread locks mutex B and then tries to lock mutex A.
At this point both threads are waiting on the other thread and neither
can continue.

In this case the two relevant locks are:
- _new_system_callback_mutex, and
- _systems_mutex

1. The user calls subscribe_on_new_system() which locks
   _new_system_callback_mutex, and then checks whether there is already
   a system that they should be notified about.
   In is_any_system_connected(), the systems() are accessed requiring
   the _systems_mutex.
2. At the same time, a message from a system arrives which grabs the
   _systems_mutex and creates the System. It then forwards the first
   heartbeat to the new system which in turn calls notify_on_discover()
   which, of course, requires the _new_system_callback_mutex, so the
   lock that the first thread has already taken. We deadlock.

The possible fixes I thought of were:
1. Always lock these locks in the same order. This can make sense for
   strictly hierarchical data but doesn't really seem right here.
2. Make scope of where locks are used smaller, potentially preventing
   that multiple locks need to be locked at once. I didn't find a way to
   do that in this case. For both cases the scope seemed correct, and
   making it smaller would introduce new problems (invalidating iterator
   or calling the callback twice by mistake).
2. Collapse/merge the two conflicting locks. This had the drawback that
   I had to switch to the recursive_mutex to prevent any stalls when
   we're trying to lock the same mutex twice. We could have added
   methods with and without lock and then call the correct one, however,
   when calling the methods out of SystemImpl the context is not so
   obvious and it's not clear which one we should call.
   Basically, it makes it rather convoluted so recursive_mutex seems
   slightly cleaner.
  • Loading branch information
julianoes committed Apr 17, 2022
1 parent dd039ac commit dcbf7b6
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 11 deletions.
14 changes: 7 additions & 7 deletions src/mavsdk/core/mavsdk_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ MavsdkImpl::~MavsdkImpl()
}

{
std::lock_guard<std::mutex> lock(_systems_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);
_systems.clear();
}

Expand Down Expand Up @@ -109,7 +109,7 @@ std::vector<std::shared_ptr<System>> MavsdkImpl::systems() const
{
std::vector<std::shared_ptr<System>> systems_result{};

std::lock_guard<std::mutex> lock(_systems_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);
for (auto& system : _systems) {
// We ignore the 0 entry because it's just a null system.
// It's only created because the older, deprecated API needs a
Expand Down Expand Up @@ -212,7 +212,7 @@ void MavsdkImpl::receive_message(mavlink_message_t& message, Connection* connect
return;
}

std::lock_guard<std::mutex> lock(_systems_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);

// The only situation where we create a system with sysid 0 is when we initialize the connection
// to the remote.
Expand Down Expand Up @@ -371,7 +371,7 @@ ConnectionResult MavsdkImpl::setup_udp_remote(
if (ret == ConnectionResult::Success) {
new_conn->add_remote(remote_ip, remote_port);
add_connection(new_conn);
std::lock_guard<std::mutex> lock(_systems_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);
make_system_with_component(0, 0, true);
}
return ret;
Expand Down Expand Up @@ -500,7 +500,7 @@ void MavsdkImpl::make_system_with_component(

void MavsdkImpl::notify_on_discover()
{
std::lock_guard<std::mutex> lock(_new_system_callback_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);
if (_new_system_callback) {
auto temp_callback = _new_system_callback;
call_user_callback([temp_callback]() { temp_callback(); });
Expand All @@ -509,7 +509,7 @@ void MavsdkImpl::notify_on_discover()

void MavsdkImpl::notify_on_timeout()
{
std::lock_guard<std::mutex> lock(_new_system_callback_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);
if (_new_system_callback) {
auto temp_callback = _new_system_callback;
call_user_callback([temp_callback]() { temp_callback(); });
Expand All @@ -518,7 +518,7 @@ void MavsdkImpl::notify_on_timeout()

void MavsdkImpl::subscribe_on_new_system(const Mavsdk::NewSystemCallback& callback)
{
std::lock_guard<std::mutex> lock(_new_system_callback_mutex);
std::lock_guard<std::recursive_mutex> lock(_systems_mutex);
_new_system_callback = callback;

if (_new_system_callback != nullptr && is_any_system_connected()) {
Expand Down
5 changes: 1 addition & 4 deletions src/mavsdk/core/mavsdk_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,8 @@ class MavsdkImpl {
std::mutex _connections_mutex{};
std::vector<std::shared_ptr<Connection>> _connections{};

mutable std::mutex _systems_mutex{};

mutable std::recursive_mutex _systems_mutex{};
std::vector<std::pair<uint8_t, std::shared_ptr<System>>> _systems{};

std::mutex _new_system_callback_mutex{};
Mavsdk::NewSystemCallback _new_system_callback{nullptr};

Time _time{};
Expand Down

0 comments on commit dcbf7b6

Please sign in to comment.