Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325

ChumpChief · 2022-05-18T02:42:33Z

This issue repros for FRS specifically, probably anything using R11S driver (though it must be remote for this repro -- a locally hosted service like Tinylicious won't repro since it doesn't require network connection). I can repro using the Brainstorm demo, updated to Fluid 0.59.2001.

In dev tools network tab, change throttling to "Offline". Then change it back to "No throttling". For about 22 seconds the client remains disconnected.

Root cause:
Upon transitioning to offline, we immediately attempt a reconnect (which is obviously going to fail, we have no network). However, upon receiving the "connect_error" from socket.io we permit socket.io to try its built-in reconnection once. We expect to get a "reconnect_attempt" event when this happens, so that on a second "connect_error" we call fail() and overtly error out. However, for some reason the "reconnect_attempt" handler is not firing, causing the second call to the "connect_error" handler to just quietly return without calling fail() as this.reconnectAttempts is still 0. From the documentation it looks like perhaps the "reconnect_attempt" is expected to fire on the Manager rather than the Socket, but I don't know for sure this is the issue.

Because the connection attempt is stuck in limbo, it doesn't fail until the orderingServiceHandshakeTimeout fires 22 seconds later, thereby allowing the retry loop to operate as expected.

There are a few improvements that could be considered for this flow, but probably the most straightforward fix would be to ensure the "connect_error" handler fails loudly after the second try.

Good spot for a breakpoint, observe it hits twice and this.reconnectAttempts is still 0:
https://github.com/microsoft/FluidFramework/pull/8820/files#diff-2a14ac91cb7e71182f4e83e6ab2f9c98d76eec15bd23f62376c441e87f31cdd0R399

ChumpChief · 2022-05-18T02:43:57Z

@znewton would you be able to take a look at this please?

gamishra434 · 2022-05-20T18:54:34Z

This issue has existed since the inception of R11s/FRS. This issue will be taken care of as part of disaster recovery work.

ChumpChief · 2022-05-20T18:56:48Z

This issue has existed since the inception of R11s/FRS. This issue will be taken care of as part of disaster recovery work.

The PR I linked that introduced the issue was merged this past February, it is more recent than that.

znewton · 2022-05-20T21:51:41Z

Thanks for clarifying @ChumpChief. I misread this at first glance. I'm working on a fix now. I believe the root of the issue is that TransportError is being understood as always resolvable via long-polling failover. However, this is clearly not true per your example, as TransportError is also thrown when "Offline". The easy fix is just adding a check using the method isOnline() from driver-utils. I'm trying a couple other methods first, but I confirmed isOnline() works locally.

ChumpChief · 2022-05-20T22:06:52Z

Thanks Zach! Do you have thoughts on the "reconnect_attempt" handler not running in the above repro? I'm concerned that the reconnect counting logic isn't working as expected in this case -- we could avoid running that logic with better isOnline() checking but in the event the reconnect counting logic needs to run I think this would still repro?

znewton · 2022-05-20T22:16:05Z

Yep, I did have concerns, so I moved away from using reconnect_attempt and just started tracking connect_error count. It works for both offline/online scenario and polling fallback. Opening PR now

ghost added the triage label May 18, 2022

ChumpChief changed the title ~~Transitioning to offline does not overtly fail, causing reconnect to delay by up to 22 seconds~~ Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds May 18, 2022

gamishra434 self-assigned this May 20, 2022

znewton self-assigned this May 20, 2022

znewton mentioned this issue May 20, 2022

r11s-driver: fix offline reconnect #10378

Merged

8 tasks

znewton closed this as completed in #10378 May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325

Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325

ChumpChief commented May 18, 2022

ChumpChief commented May 18, 2022

gamishra434 commented May 20, 2022

ChumpChief commented May 20, 2022

znewton commented May 20, 2022

ChumpChief commented May 20, 2022

znewton commented May 20, 2022

Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325

Transitioning to offline does not overtly fail reconnect attempt, causing reconnect to delay by up to 22 seconds #10325

Comments

ChumpChief commented May 18, 2022

ChumpChief commented May 18, 2022

gamishra434 commented May 20, 2022

ChumpChief commented May 20, 2022

znewton commented May 20, 2022

ChumpChief commented May 20, 2022

znewton commented May 20, 2022