Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-53994] Fix potential stuck session due to ws client leak #492

Merged
merged 1 commit into from
Aug 21, 2023

Conversation

streamer45
Copy link
Collaborator

Summary

This was a fun one. A session was stuck in the call state after the client explicitly left the call and the RTC connection was closed, only to reconnect on the ws channel every now and then hours after the fact.

They key issue here was that we were improperly using the state of the WebSocket object field (callsClient.ws.ws) as a way to differentiate between an explicit disconnect (e.g. user leaving call) vs an event triggered disconnect (e.g. network issue). Unfortunately the logic didn't account for the potential edge case of an explicit disconnect happening right after a network disconnect.

The sequence of events was roughly the following:

  1. A websocket disconnect happens (network/service related).
    a. this.ws is set to null in the ws.onclose handler.
    b. close() gets called and initializes the reconnect handler.
  2. User explicitly leaves call before the ws client is able to reconnect.
    a. callsClient.disconnect() gets called.
    b. callsClient.ws.close() gets called.
    c. callsClients.ws.ws is null due to step 1a so we try to reconnect again instead of terminating the ws connection.
  3. callsClient object is cleared from window but essentially leaked given its ws client was never properly closed and will endlessly try to reconnect on any future disconnect.

Note: I am aware the above sequence was likely more useful to me, happy to explain anything unclear.

To fix the above we move the reconnect logic into its own method and solely rely on callsClient.ws.closed to decide whether we should reconnect or not.

In addition to the above, we also fix a related issue on the server side which would have prevented the stuck session (not the client side leak though). The problem here was that a reconnected session may have have changed its websocket connection ID which would then differ from the original session ID used to track the RTC session.

Big thanks to @wiggin77 for providing client logs which turned out to be key into figure out what was going on.

Ticket Link

https://mattermost.atlassian.net/browse/MM-53994

@streamer45 streamer45 added the 2: Dev Review Requires review by a core committer label Aug 9, 2023
@streamer45 streamer45 requested a review from cpoile August 9, 2023 22:40
@streamer45 streamer45 self-assigned this Aug 9, 2023
@streamer45 streamer45 added this to the v0.19.0 - MM 9.0 milestone Aug 9, 2023
@codecov-commenter
Copy link

Codecov Report

Patch coverage has no change and project coverage change: -0.03% ⚠️

Comparison is base (5c059da) 5.62% compared to head (099ef84) 5.60%.

Additional details and impacted files
@@           Coverage Diff            @@
##            main    #492      +/-   ##
========================================
- Coverage   5.62%   5.60%   -0.03%     
========================================
  Files         21      21              
  Lines       4178    4193      +15     
========================================
  Hits         235     235              
- Misses      3926    3941      +15     
  Partials      17      17              
Files Changed Coverage Δ
server/rtcd.go 6.82% <0.00%> (ø)
server/session.go 0.00% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +357 to +359
// If we can't find one, we resort to looping through all the sessions to
// check against the originalConnID field. This would be necessary only if
// the session reconnected throughout the call with a new ws connection ID.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense. Good comments.

this.removeAllListeners('message');
} else {
this.emit('close', code);
close() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we would we have the same problem on mobile?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check that, thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent mattermost/mattermost-mobile#7513, great catch :)

@streamer45 streamer45 added 3: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Aug 21, 2023
@streamer45 streamer45 merged commit 10f9f4b into main Aug 21, 2023
@streamer45 streamer45 deleted the MM-53994 branch August 21, 2023 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants