Infinite reconnect loop can occur due to fatal error that should really close the container #8411

markfields · 2021-11-23T22:58:39Z

Some examples we've seen in telemetry:

Attempting to transmit a message larger than socket.io permits results in the connection closing, then we reconnect and try the same op again with the same result.
~~The host is providing tokens with the incorrect audience. The server disconnects every time, but the new token we get on reconnect has the same problem.~~
- This shouldn't happen, and I don't have record or memory of what was going on here, so I may have been mistaken to include this example
Server outage results in 500s on JoinSession call, and we never bail out, leaving the user with the Reconnecting banner but not real hope of reconnecting.

One fix would be a mechanism to watch and see if we're hitting the same error N times in a row and close the container.

markfields · 2021-12-03T22:30:18Z

General symptom in a session is to see DeltaConnectionFailureToConnect event (including the error that occurred) followed by repeated JoinSession_end events (as we try again and again to establish a connection).

Another error we're seeing hit this pattern is very generic, simply a "TransportError" received via the connect_error event on the socket. See #8416 and internally to Msft, incident 275888262.

andre4i · 2021-12-03T23:14:46Z

Thanks, @markfields.

This is tracked by #8179
#7599
#7545

(See some investigations here: #7599 (comment))

I have a PR open to track the message size but I'd prefer we would have feature gates to enable it..

markfields · 2021-12-04T01:00:15Z

@andre4i - You're speaking of just the one example it seems?

Any concerns with a world where (until that's fixed) container closes after some time, rather than being stuck in disconnect state?

jvoels · 2021-12-15T22:13:51Z

For the odsp related consideration "The host is providing tokens with the incorrect audience. The server disconnects every time, but the new token we get on reconnect has the same problem."

The service perhaps should be conveying "non-retriable" for this type of issue. Or the driver, if it runs under the assumption things are retriable by default, may need to deal with this specific error scenario differently.

Outwardly I imagine the UX experiences will usually aspire to a "seamless reconnect" - so the disconnected + open container state supports this.

So I would be concerned about pushing retry logic up the stack when detailed knowledge of what should be considered fatal or not will be closer to the driver.

markfields · 2021-12-15T22:18:31Z

@jvoels good comment, yes we should take a look at whether these errors are correctly marked as retryable, and whether that's being respected (spoiler alert: In some cases it's not, see #8570)

vladsud · 2022-03-01T20:29:48Z

I believe this issue is fully addressed by @andre4i and should be closed with his PR.

andre4i · 2022-03-01T20:31:39Z

#9243

markfields · 2022-03-07T19:27:47Z

I think it's ok to close, but for the record -- Andrei's fix only addresses the case where the connection can't take hold because of a pending op in the runtime. The case where the host is consistently giving a bad token would not be addressed by that fix.

I'm ok to be data-driven and see what kinds of issues come up in livesite for Loops - or issue reports from others using the ecosystem.

cc @vladsud @andre4i

vladsud · 2022-03-07T20:04:47Z

I think wrong token will not result in infinite reconnects, as we will get 403/401. These errors are retried once (with new token), and then non-recoverable error that closes container. At least that's how it worked and should work.

Note that t may take one cycle of reconnects to materialize, as we do not treat 403/401 errors as critical when received on disconnect - we always try to reconnect and errors on connection flow are critical

markfields · 2022-03-07T21:58:59Z

Hmmm I think you're right. I wonder what I saw in telemetry, too bad I didn't link to the logs or provide more details.

markfields · 2022-03-07T22:00:10Z

The other case recently was around a service outage where JoinSession returned 500 every time. We may want to consider a loader-level mitigation that would force the container to be read-only if we get stuck in this way.

vladsud · 2022-03-07T22:06:53Z

Yes, 500 might be worth following up. I think we still want to keep reconnecting, but maybe have linger back off period, and maybe eventually (after like couple minutes of retries) to bail out. Maybe other way to say it - I think policy here should be slightly different from what we do with other cases, but I'm not sure we have (today) enough data at container / container runtime layer to differentiate.

markfields · 2022-03-07T23:19:18Z

The 500's case is tracked in this internal PBI: https://dev.azure.com/office/OC/_workitems/edit/5828431/

markfields · 2022-03-22T23:09:32Z

Opened #9560 as a way to better address the 500s case.

ghost added the triage label Nov 23, 2021

markfields added the area: loader Loader related issues label Nov 23, 2021

markfields self-assigned this Nov 23, 2021

markfields added this to the Next milestone Nov 23, 2021

ghost removed the triage label Nov 23, 2021

markfields mentioned this issue Dec 15, 2021

Epic: Connectivity improvements #7995

Closed

8 tasks

markfields mentioned this issue Jan 25, 2022

r11s: enable socket.io long-polling as a fallback #8820

Merged

vladsud mentioned this issue Feb 4, 2022

1M limit epic #9023

Closed

andre4i closed this as completed Mar 1, 2022

markfields mentioned this issue Mar 22, 2022

Consider adding Progress/Cancellation to reconnect loop #9560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

markfields commented Nov 23, 2021 •

edited

Loading

markfields commented Dec 3, 2021

andre4i commented Dec 3, 2021

markfields commented Dec 4, 2021

jvoels commented Dec 15, 2021

markfields commented Dec 15, 2021

vladsud commented Mar 1, 2022

andre4i commented Mar 1, 2022

markfields commented Mar 7, 2022

vladsud commented Mar 7, 2022

markfields commented Mar 7, 2022

markfields commented Mar 7, 2022

vladsud commented Mar 7, 2022

markfields commented Mar 7, 2022

markfields commented Mar 22, 2022

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Comments

markfields commented Nov 23, 2021 • edited Loading

markfields commented Dec 3, 2021

andre4i commented Dec 3, 2021

markfields commented Dec 4, 2021

jvoels commented Dec 15, 2021

markfields commented Dec 15, 2021

vladsud commented Mar 1, 2022

andre4i commented Mar 1, 2022

markfields commented Mar 7, 2022

vladsud commented Mar 7, 2022

markfields commented Mar 7, 2022

markfields commented Mar 7, 2022

vladsud commented Mar 7, 2022

markfields commented Mar 7, 2022

markfields commented Mar 22, 2022

markfields commented Nov 23, 2021 •

edited

Loading