Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New session mutex in Streaming plugin (see #2106) #2115

Merged
merged 1 commit into from
Apr 29, 2020

Conversation

lminiero
Copy link
Member

First attempt at addressing the race condition described in #2106. Apparently, the cause is always the session->mountpoint changing from when we check it to when we use it, which is problematic when we use that info to decrease references. This patch adds a new mutex to the janus_streaming_session structure, that we can use when updating or checking the mountpoint property in critical sections.

I didn't test this extensively: I'll let @atoppi do his magic with the script he had to replicate the issue. Hopefully that will solve the problem AND avoid regressions (e.g., deadlocks that may occur due to the way the new mutex works). Until then, consider this WIP.

@atoppi
Copy link
Member

atoppi commented Apr 28, 2020

Kudos to @TomFFF for the script.
I've been running 4 of them in the last 30 minutes, no locks or crashes detected so far.
So lgtm 👍

@TomFFF
Copy link
Contributor

TomFFF commented Apr 28, 2020

Thanks @atoppi and @lminiero for the quick fix !

I'm also testing it and I don't see any problems so far

@lminiero
Copy link
Member Author

Good to know, thanks for the feedback! I'll let you test it for a bit longer, then, and if everything is still ok by tomorrow morning I'll merge 👍

@lminiero
Copy link
Member Author

@TomFFF any update with your tests in the past few hours? Can we merge this?

@TomFFF
Copy link
Contributor

TomFFF commented Apr 29, 2020

For me it looks good ! Couldn't reproduce it anymore. So I think it can be
merged and the issue closed.

@lminiero
Copy link
Member Author

Ack, merging then 👍

@lminiero lminiero merged commit 09391e8 into master Apr 29, 2020
@lminiero lminiero deleted the switch-hangup-race branch April 29, 2020 10:13
@TomFFF
Copy link
Contributor

TomFFF commented May 5, 2020

@lminiero , not sure if you should continue here or open a new issue ?

But we found a deadlock in the streaming plugin caused by this change.

When there is no more data in the janus_streaming_relay_thread there is some Notify users this mountpoint is done code

This takes a lock on the mountpoint mutex first, and then a lock on each session of the viewer (L7893)

janus_mutex_lock(&mountpoint->mutex);
GList *viewer = g_list_first(mountpoint->viewers);
/* Prepare JSON event */
json_t *event = json_object();
json_object_set_new(event, "streaming", json_string("event"));
json_t *result = json_object();
json_object_set_new(result, "status", json_string("stopped"));
json_object_set_new(event, "result", result);
while(viewer) {
janus_streaming_session *session = (janus_streaming_session *)viewer->data;
if(session == NULL) {
mountpoint->viewers = g_list_remove_all(mountpoint->viewers, session);
viewer = g_list_first(mountpoint->viewers);
continue;
}
janus_mutex_lock(&session->mutex);

At the same time janus_streaming_hangup_media is executed :

First a lock on sessions->mutex (so new requests that need it will wait)

janus_mutex_lock(&sessions_mutex);
janus_streaming_hangup_media_internal(handle);

And injanus_streaming_hangup_media_internalfirst a lock on the session->mutex and after that on mp->mutex

janus_mutex_lock(&session->mutex);
janus_streaming_mountpoint *mp = session->mountpoint;
if(mp) {
janus_mutex_lock(&mp->mutex);

So if you are very unlucky both threads will wait at each other, and because there is a lock on &sessions->mutex the entire server is more or less 'broken'.

Stacktrace : https://pastebin.com/i7JCr5EU

I don't have a case with lock debug on, but if you prefer it I can try to reproduce it (But I guess that I will need to add a sleep between the locks to trigger the deadlock, otherwise it looks impossible to hit it)

I would add a PR if knew what the desired solution was

@lminiero
Copy link
Member Author

lminiero commented May 5, 2020

Ok, I'll look into it.

@lminiero
Copy link
Member Author

lminiero commented May 5, 2020

Unfortunately hangup_media is not the only place where we lock the session first and the mountpoint after that: it also happens in the management of many requests (watch, switch, etc.). As such, the right fix is probably to change the only where we lock the mountpoint first, that seems to be the notification you mentioned.

That said, it might not be that simple, considering we work on a mountpoint list and that gives us the sessions we need. Locking the mountpoint just to clone the list of viewers, and then unlock to iterate on the copy might work, since that copy wouldn't need to be protected by the mountpoint lock; at the same time, it might cause reference issues for sessions unless we make sure the copy doesn't address its own references too. I'll prepare a PR for you to test: not sure it will be ready today, more likely tomorrow.

@lminiero
Copy link
Member Author

lminiero commented May 5, 2020

I was wrong: the same swapped logic happens in "destroy" too. I'll have to think about how to fix things there too.

@lminiero
Copy link
Member Author

lminiero commented May 5, 2020

@TomFFF please test the PR above. I'm of course interested in fixing the new deadlock, but even more in ensuring we're not causing the original issue you opened again. The patch should ensure we always lock the mountpoint first, and then the session, but since I didn't have time to test there may still be issues. Looking forward to your feedback.

@TomFFF
Copy link
Contributor

TomFFF commented May 5, 2020

Thx again for the fast reply !
I will start testing it now and give some feedback tomorrow

@TomFFF
Copy link
Contributor

TomFFF commented May 6, 2020

Did some 'stress' tests during the night on a test server and couldn't reproduce any issue.

I will start using this version on our real servers.

@lminiero
Copy link
Member Author

lminiero commented May 7, 2020

@TomFFF any updates on the production tests? Thanks!

@TomFFF
Copy link
Contributor

TomFFF commented May 7, 2020

Still running without any problems so far. So it looks ok for me

@lminiero
Copy link
Member Author

lminiero commented May 7, 2020

Ack, I'll merge then 👍 Thanks for spotting the new issue and for the help testing the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants