-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New session mutex in Streaming plugin (see #2106) #2115
Conversation
Kudos to @TomFFF for the script. |
Good to know, thanks for the feedback! I'll let you test it for a bit longer, then, and if everything is still ok by tomorrow morning I'll merge 👍 |
@TomFFF any update with your tests in the past few hours? Can we merge this? |
For me it looks good ! Couldn't reproduce it anymore. So I think it can be |
Ack, merging then 👍 |
@lminiero , not sure if you should continue here or open a new issue ? But we found a deadlock in the streaming plugin caused by this change. When there is no more data in the This takes a lock on the mountpoint mutex first, and then a lock on each session of the viewer (L7893) janus-gateway/plugins/janus_streaming.c Lines 7878 to 7893 in 0f6849e
At the same time First a lock on sessions->mutex (so new requests that need it will wait) janus-gateway/plugins/janus_streaming.c Lines 4375 to 4376 in 0f6849e
And in janus-gateway/plugins/janus_streaming.c Lines 4405 to 4408 in 0f6849e
So if you are very unlucky both threads will wait at each other, and because there is a lock on Stacktrace : https://pastebin.com/i7JCr5EU I don't have a case with lock debug on, but if you prefer it I can try to reproduce it (But I guess that I will need to add a sleep between the locks to trigger the deadlock, otherwise it looks impossible to hit it) I would add a PR if knew what the desired solution was |
Ok, I'll look into it. |
Unfortunately hangup_media is not the only place where we lock the session first and the mountpoint after that: it also happens in the management of many requests (watch, switch, etc.). As such, the right fix is probably to change the only where we lock the mountpoint first, that seems to be the notification you mentioned. That said, it might not be that simple, considering we work on a mountpoint list and that gives us the sessions we need. Locking the mountpoint just to clone the list of viewers, and then unlock to iterate on the copy might work, since that copy wouldn't need to be protected by the mountpoint lock; at the same time, it might cause reference issues for sessions unless we make sure the copy doesn't address its own references too. I'll prepare a PR for you to test: not sure it will be ready today, more likely tomorrow. |
I was wrong: the same swapped logic happens in "destroy" too. I'll have to think about how to fix things there too. |
@TomFFF please test the PR above. I'm of course interested in fixing the new deadlock, but even more in ensuring we're not causing the original issue you opened again. The patch should ensure we always lock the mountpoint first, and then the session, but since I didn't have time to test there may still be issues. Looking forward to your feedback. |
Thx again for the fast reply ! |
Did some 'stress' tests during the night on a test server and couldn't reproduce any issue. I will start using this version on our real servers. |
@TomFFF any updates on the production tests? Thanks! |
Still running without any problems so far. So it looks ok for me |
Ack, I'll merge then 👍 Thanks for spotting the new issue and for the help testing the fix! |
First attempt at addressing the race condition described in #2106. Apparently, the cause is always the
session->mountpoint
changing from when we check it to when we use it, which is problematic when we use that info to decrease references. This patch adds a new mutex to thejanus_streaming_session
structure, that we can use when updating or checking themountpoint
property in critical sections.I didn't test this extensively: I'll let @atoppi do his magic with the script he had to replicate the issue. Hopefully that will solve the problem AND avoid regressions (e.g., deadlocks that may occur due to the way the new mutex works). Until then, consider this WIP.