New session mutex in Streaming plugin (see #2106) #2115

lminiero · 2020-04-28T13:47:59Z

First attempt at addressing the race condition described in #2106. Apparently, the cause is always the session->mountpoint changing from when we check it to when we use it, which is problematic when we use that info to decrease references. This patch adds a new mutex to the janus_streaming_session structure, that we can use when updating or checking the mountpoint property in critical sections.

I didn't test this extensively: I'll let @atoppi do his magic with the script he had to replicate the issue. Hopefully that will solve the problem AND avoid regressions (e.g., deadlocks that may occur due to the way the new mutex works). Until then, consider this WIP.

atoppi · 2020-04-28T14:30:21Z

Kudos to @TomFFF for the script.
I've been running 4 of them in the last 30 minutes, no locks or crashes detected so far.
So lgtm 👍

TomFFF · 2020-04-28T15:20:27Z

Thanks @atoppi and @lminiero for the quick fix !

I'm also testing it and I don't see any problems so far

lminiero · 2020-04-28T15:43:20Z

Good to know, thanks for the feedback! I'll let you test it for a bit longer, then, and if everything is still ok by tomorrow morning I'll merge 👍

lminiero · 2020-04-29T10:05:30Z

@TomFFF any update with your tests in the past few hours? Can we merge this?

TomFFF · 2020-04-29T10:09:19Z

For me it looks good ! Couldn't reproduce it anymore. So I think it can be
merged and the issue closed.

lminiero · 2020-04-29T10:13:19Z

Ack, merging then 👍

TomFFF · 2020-05-05T06:17:04Z

@lminiero , not sure if you should continue here or open a new issue ?

But we found a deadlock in the streaming plugin caused by this change.

When there is no more data in the janus_streaming_relay_thread there is some Notify users this mountpoint is done code

This takes a lock on the mountpoint mutex first, and then a lock on each session of the viewer (L7893)

janus-gateway/plugins/janus_streaming.c

Lines 7878 to 7893 in 0f6849e

    
           janus_mutex_lock(&mountpoint->mutex); 
        
           GList *viewer = g_list_first(mountpoint->viewers); 
        
           /* Prepare JSON event */ 
        
           json_t *event = json_object(); 
        
           json_object_set_new(event, "streaming", json_string("event")); 
        
           json_t *result = json_object(); 
        
           json_object_set_new(result, "status", json_string("stopped")); 
        
           json_object_set_new(event, "result", result); 
        
           while(viewer) { 
        
           	janus_streaming_session *session = (janus_streaming_session *)viewer->data; 
        
           	if(session == NULL) { 
        
           		mountpoint->viewers = g_list_remove_all(mountpoint->viewers, session); 
        
           		viewer = g_list_first(mountpoint->viewers); 
        
           		continue; 
        
           	} 
        
           	janus_mutex_lock(&session->mutex);

At the same time janus_streaming_hangup_media is executed :

First a lock on sessions->mutex (so new requests that need it will wait)

janus-gateway/plugins/janus_streaming.c

Lines 4375 to 4376 in 0f6849e

    
           janus_mutex_lock(&sessions_mutex); 
        
           janus_streaming_hangup_media_internal(handle);

And injanus_streaming_hangup_media_internalfirst a lock on the session->mutex and after that on mp->mutex

janus-gateway/plugins/janus_streaming.c

Lines 4405 to 4408 in 0f6849e

    
           janus_mutex_lock(&session->mutex); 
        
           janus_streaming_mountpoint *mp = session->mountpoint; 
        
           if(mp) { 
        
           	janus_mutex_lock(&mp->mutex);

So if you are very unlucky both threads will wait at each other, and because there is a lock on &sessions->mutex the entire server is more or less 'broken'.

Stacktrace : https://pastebin.com/i7JCr5EU

I don't have a case with lock debug on, but if you prefer it I can try to reproduce it (But I guess that I will need to add a sleep between the locks to trigger the deadlock, otherwise it looks impossible to hit it)

I would add a PR if knew what the desired solution was

lminiero · 2020-05-05T09:30:38Z

Ok, I'll look into it.

lminiero · 2020-05-05T17:25:31Z

Unfortunately hangup_media is not the only place where we lock the session first and the mountpoint after that: it also happens in the management of many requests (watch, switch, etc.). As such, the right fix is probably to change the only where we lock the mountpoint first, that seems to be the notification you mentioned.

That said, it might not be that simple, considering we work on a mountpoint list and that gives us the sessions we need. Locking the mountpoint just to clone the list of viewers, and then unlock to iterate on the copy might work, since that copy wouldn't need to be protected by the mountpoint lock; at the same time, it might cause reference issues for sessions unless we make sure the copy doesn't address its own references too. I'll prepare a PR for you to test: not sure it will be ready today, more likely tomorrow.

lminiero · 2020-05-05T17:31:21Z

I was wrong: the same swapped logic happens in "destroy" too. I'll have to think about how to fix things there too.

lminiero · 2020-05-05T17:59:57Z

@TomFFF please test the PR above. I'm of course interested in fixing the new deadlock, but even more in ensuring we're not causing the original issue you opened again. The patch should ensure we always lock the mountpoint first, and then the session, but since I didn't have time to test there may still be issues. Looking forward to your feedback.

TomFFF · 2020-05-05T19:41:29Z

Thx again for the fast reply !
I will start testing it now and give some feedback tomorrow

TomFFF · 2020-05-06T06:36:31Z

Did some 'stress' tests during the night on a test server and couldn't reproduce any issue.

I will start using this version on our real servers.

lminiero · 2020-05-07T07:23:02Z

@TomFFF any updates on the production tests? Thanks!

TomFFF · 2020-05-07T07:27:10Z

Still running without any problems so far. So it looks ok for me

lminiero · 2020-05-07T07:28:57Z

Ack, I'll merge then 👍 Thanks for spotting the new issue and for the help testing the fix!

New session mutex in Streaming plugin (see #2106)

6bf3a02

lminiero merged commit 09391e8 into master Apr 29, 2020

lminiero deleted the switch-hangup-race branch April 29, 2020 10:13

lminiero added a commit that referenced this pull request May 5, 2020

Fix to rare deadlock in Streaming plugin (see #2115)

915203c

lminiero mentioned this pull request May 5, 2020

Fix to rare deadlock in Streaming plugin (see #2115) #2141

Merged

lminiero added a commit that referenced this pull request May 7, 2020

Fix to rare deadlock in Streaming plugin (see #2115) (#2141)

57e47dc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New session mutex in Streaming plugin (see #2106) #2115

New session mutex in Streaming plugin (see #2106) #2115

lminiero commented Apr 28, 2020

atoppi commented Apr 28, 2020

TomFFF commented Apr 28, 2020

lminiero commented Apr 28, 2020

lminiero commented Apr 29, 2020

TomFFF commented Apr 29, 2020

lminiero commented Apr 29, 2020

TomFFF commented May 5, 2020 •

edited

Loading

lminiero commented May 5, 2020

lminiero commented May 5, 2020

lminiero commented May 5, 2020

lminiero commented May 5, 2020

TomFFF commented May 5, 2020

TomFFF commented May 6, 2020

lminiero commented May 7, 2020

TomFFF commented May 7, 2020

lminiero commented May 7, 2020

New session mutex in Streaming plugin (see #2106) #2115

New session mutex in Streaming plugin (see #2106) #2115

Conversation

lminiero commented Apr 28, 2020

atoppi commented Apr 28, 2020

TomFFF commented Apr 28, 2020

lminiero commented Apr 28, 2020

lminiero commented Apr 29, 2020

TomFFF commented Apr 29, 2020

lminiero commented Apr 29, 2020

TomFFF commented May 5, 2020 • edited Loading

lminiero commented May 5, 2020

lminiero commented May 5, 2020

lminiero commented May 5, 2020

lminiero commented May 5, 2020

TomFFF commented May 5, 2020

TomFFF commented May 6, 2020

lminiero commented May 7, 2020

TomFFF commented May 7, 2020

lminiero commented May 7, 2020

TomFFF commented May 5, 2020 •

edited

Loading