Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325

Closed
ChumpChief opened this issue May 18, 2022 · 6 comments · Fixed by #10378
Assignees
Labels

Comments

@ChumpChief
Copy link
Contributor

This issue repros for FRS specifically, probably anything using R11S driver (though it must be remote for this repro -- a locally hosted service like Tinylicious won't repro since it doesn't require network connection). I can repro using the Brainstorm demo, updated to Fluid 0.59.2001.

In dev tools network tab, change throttling to "Offline". Then change it back to "No throttling". For about 22 seconds the client remains disconnected.

Root cause:
Upon transitioning to offline, we immediately attempt a reconnect (which is obviously going to fail, we have no network). However, upon receiving the "connect_error" from socket.io we permit socket.io to try its built-in reconnection once. We expect to get a "reconnect_attempt" event when this happens, so that on a second "connect_error" we call fail() and overtly error out. However, for some reason the "reconnect_attempt" handler is not firing, causing the second call to the "connect_error" handler to just quietly return without calling fail() as this.reconnectAttempts is still 0. From the documentation it looks like perhaps the "reconnect_attempt" is expected to fire on the Manager rather than the Socket, but I don't know for sure this is the issue.

Because the connection attempt is stuck in limbo, it doesn't fail until the orderingServiceHandshakeTimeout fires 22 seconds later, thereby allowing the retry loop to operate as expected.

There are a few improvements that could be considered for this flow, but probably the most straightforward fix would be to ensure the "connect_error" handler fails loudly after the second try.

Good spot for a breakpoint, observe it hits twice and this.reconnectAttempts is still 0:
https://github.com/microsoft/FluidFramework/pull/8820/files#diff-2a14ac91cb7e71182f4e83e6ab2f9c98d76eec15bd23f62376c441e87f31cdd0R399

@ghost ghost added the triage label May 18, 2022
@ChumpChief
Copy link
Contributor Author

@znewton would you be able to take a look at this please?

@ChumpChief ChumpChief changed the title Transitioning to offline does not overtly fail, causing reconnect to delay by up to 22 seconds Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds May 18, 2022
@gamishra434 gamishra434 self-assigned this May 20, 2022
@gamishra434
Copy link
Contributor

This issue has existed since the inception of R11s/FRS. This issue will be taken care of as part of disaster recovery work.

@ChumpChief
Copy link
Contributor Author

This issue has existed since the inception of R11s/FRS. This issue will be taken care of as part of disaster recovery work.

The PR I linked that introduced the issue was merged this past February, it is more recent than that.

@znewton znewton self-assigned this May 20, 2022
@znewton
Copy link
Contributor

znewton commented May 20, 2022

Thanks for clarifying @ChumpChief. I misread this at first glance. I'm working on a fix now. I believe the root of the issue is that TransportError is being understood as always resolvable via long-polling failover. However, this is clearly not true per your example, as TransportError is also thrown when "Offline". The easy fix is just adding a check using the method isOnline() from driver-utils. I'm trying a couple other methods first, but I confirmed isOnline() works locally.

@ChumpChief
Copy link
Contributor Author

Thanks Zach! Do you have thoughts on the "reconnect_attempt" handler not running in the above repro? I'm concerned that the reconnect counting logic isn't working as expected in this case -- we could avoid running that logic with better isOnline() checking but in the event the reconnect counting logic needs to run I think this would still repro?

@znewton
Copy link
Contributor

znewton commented May 20, 2022

Yep, I did have concerns, so I moved away from using reconnect_attempt and just started tracking connect_error count. It works for both offline/online scenario and polling fallback. Opening PR now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants