-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325
Comments
@znewton would you be able to take a look at this please? |
This issue has existed since the inception of R11s/FRS. This issue will be taken care of as part of disaster recovery work. |
The PR I linked that introduced the issue was merged this past February, it is more recent than that. |
Thanks for clarifying @ChumpChief. I misread this at first glance. I'm working on a fix now. I believe the root of the issue is that TransportError is being understood as always resolvable via long-polling failover. However, this is clearly not true per your example, as TransportError is also thrown when "Offline". The easy fix is just adding a check using the method |
Thanks Zach! Do you have thoughts on the "reconnect_attempt" handler not running in the above repro? I'm concerned that the reconnect counting logic isn't working as expected in this case -- we could avoid running that logic with better |
Yep, I did have concerns, so I moved away from using |
This issue repros for FRS specifically, probably anything using R11S driver (though it must be remote for this repro -- a locally hosted service like Tinylicious won't repro since it doesn't require network connection). I can repro using the Brainstorm demo, updated to Fluid 0.59.2001.
In dev tools network tab, change throttling to "Offline". Then change it back to "No throttling". For about 22 seconds the client remains disconnected.
Root cause:
Upon transitioning to offline, we immediately attempt a reconnect (which is obviously going to fail, we have no network). However, upon receiving the "connect_error" from socket.io we permit socket.io to try its built-in reconnection once. We expect to get a "reconnect_attempt" event when this happens, so that on a second "connect_error" we call fail() and overtly error out. However, for some reason the "reconnect_attempt" handler is not firing, causing the second call to the "connect_error" handler to just quietly return without calling fail() as this.reconnectAttempts is still 0. From the documentation it looks like perhaps the "reconnect_attempt" is expected to fire on the Manager rather than the Socket, but I don't know for sure this is the issue.
Because the connection attempt is stuck in limbo, it doesn't fail until the orderingServiceHandshakeTimeout fires 22 seconds later, thereby allowing the retry loop to operate as expected.
There are a few improvements that could be considered for this flow, but probably the most straightforward fix would be to ensure the "connect_error" handler fails loudly after the second try.
Good spot for a breakpoint, observe it hits twice and this.reconnectAttempts is still 0:
https://github.com/microsoft/FluidFramework/pull/8820/files#diff-2a14ac91cb7e71182f4e83e6ab2f9c98d76eec15bd23f62376c441e87f31cdd0R399
The text was updated successfully, but these errors were encountered: