Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid dialog destroy while waiting for dialog lock #4064

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

nanangizz
Copy link
Member

@nanangizz nanangizz commented Sep 4, 2024

A dialog may be destroyed while other thread is waiting for dialog lock. For example when caller & callee trying to disconnect the call simultaneously:

  1. Thread 1 is about to sending a BYE, but gets context-switched when waiting for dialog lock, which may happen in pjsip_dlg_create_request(), pjsip_inv_send_msg(), or pjsip_dlg_send_request().
  2. Thread 2 receives BYE, sends 200/BYE response. The BYE transaction termination leads to call disconnection and both dialog & invite-session get destroyed.
  3. Back to thread 1, the dialog lock being waited for has just been destroyed (may lead to undefined behavior).

This PR patch offers best effort to avoid the premature dialog destroy by:

  • incrementing the dialog lock reference before waiting for the dialog lock in pjsip_dlg_inc_lock(),
  • using invite-session reference when dialog lock is not used, note that dialog won't be destroyed when there is still an invite session using it.

This approach won't work if the context switch happens before the reference counter (of dialog or invite-session) is incremented, however that should be able to be handled in the app layer (see the PJSUA approach below).

Note that PJSUA uses acquire_call() to acquire the dialog lock before accessing the call's dialog. The acquire_call() calls pjsip_dlg_try_inc_lock() instead of pjsip_dlg_inc_lock() so it won't block when the dialog lock is not available (and perhaps safer when the lock is destroyed). The pattern is like:

pj_status_t pjsua_call_hangup(..)
{
  // try to lock dialog
  status = acquire_call(..., &dlg);
  if (status != PJ_SUCCESS) {
    // failed to lock dialog
    return status;
  }

  // create & send BYE
  pjsip_inv_end_session(..);
  pjsip_inv_send_msg(..);

  // release dialog
  pjsip_dlg_dec_lock(dlg);
}

@nanangizz nanangizz merged commit cbccd49 into master Sep 9, 2024
36 checks passed
@nanangizz nanangizz deleted the race-dlg-destroy branch September 9, 2024 03:14
@pwinckles
Copy link

I am using the java pjsua2 bindings and built off head today. Placing a call results in a segfault, and I believe it may be related to this PR. I tried again, building off the 2.14.1 tag, and had no issues. I think it is related to this PR because I set the log level to 6, and the last thing that's logged is "UAC dialog created" and "Entering pjsip_dlg_inc_lock(), sess_count=" is never logged.

@nanangizz
Copy link
Member Author

I am using the java pjsua2 bindings and built off head today. Placing a call results in a segfault, and I believe it may be related to this PR. I tried again, building off the 2.14.1 tag, and had no issues. I think it is related to this PR because I set the log level to 6, and the last thing that's logged is "UAC dialog created" and "Entering pjsip_dlg_inc_lock(), sess_count=" is never logged.

I've just retested both Java & pjsua app sample from the latest master, outgoing & incoming calls seem to work just fine. Actually the Github CI include many tests with calls. Also, Java/PJSUA2 or C/PJSUA should behave the same for changes in this PR (which are done to PJSIP layer). So the crash may be caused by something else.

@pwinckles
Copy link

I've just retested both Java & pjsua app sample from the latest master, outgoing & incoming calls seem to work just fine. Actually the Github CI include many tests with calls. Also, Java/PJSUA2 or C/PJSUA should behave the same for changes in this PR (which are done to PJSIP layer). So the crash may be caused by something else.

Could very well be caused by something else. I have no proof other than that code that works with 2.13.1 and 2.14.1 consistently segfaults when placing a call off head and that the segfault occurs immediately after "UAC dialog created" is logged.

In the setup that produced the error, it was running two accounts and one of the accounts was attempting to place a call to the other, if that matters.

@nanangizz
Copy link
Member Author

One of my repro scenario was calling self on the same account. Considering the PJSIP layer does not really recognize account, I don't think different accounts will behave differently, especially when the problem seems to occur in the early stage of making call, i.e: creating the UAC dialog.

If you have the call stack when the crash happens, perhaps it can provide a bit more hints. Also, if it is possible for you to reproduce it using a sample app (e.g: sample.java), perhaps with some slight modifaction to trigger the problematic scenario, it may help the investigation.

@sauwming
Copy link
Member

Do we need to change our group lock implementation as well?
i.e. in grp_lock_acquire(), do we need to call pj_grp_lock_add_ref(glock) first before trying to acquire the locks?

@nanangizz
Copy link
Member Author

Do we need to change our group lock implementation as well? i.e. in grp_lock_acquire(), do we need to call pj_grp_lock_add_ref(glock) first before trying to acquire the locks?

Not sure, haven't heard any other issue, perhaps not for now?

@pwinckles
Copy link

pwinckles commented Sep 12, 2024

@nanangizz You're correct that it wasn't this PR that is responsible. I'm in the process of doing a bisect to identify the correct commit.

[edit] #4073

I'm sorry for spamming this merged PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants