Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Closed
Tracked by #7995
markfields opened this issue Nov 23, 2021 · 14 comments
Assignees
Labels
area: loader Loader related issues
Milestone

Comments

@markfields
Copy link
Member

markfields commented Nov 23, 2021

Some examples we've seen in telemetry:

  • Attempting to transmit a message larger than socket.io permits results in the connection closing, then we reconnect and try the same op again with the same result.
  • The host is providing tokens with the incorrect audience. The server disconnects every time, but the new token we get on reconnect has the same problem.
    • This shouldn't happen, and I don't have record or memory of what was going on here, so I may have been mistaken to include this example
  • Server outage results in 500s on JoinSession call, and we never bail out, leaving the user with the Reconnecting banner but not real hope of reconnecting.

One fix would be a mechanism to watch and see if we're hitting the same error N times in a row and close the container.

@ghost ghost added the triage label Nov 23, 2021
@markfields markfields added the area: loader Loader related issues label Nov 23, 2021
@markfields markfields self-assigned this Nov 23, 2021
@markfields markfields added this to the Next milestone Nov 23, 2021
@ghost ghost removed the triage label Nov 23, 2021
@markfields
Copy link
Member Author

General symptom in a session is to see DeltaConnectionFailureToConnect event (including the error that occurred) followed by repeated JoinSession_end events (as we try again and again to establish a connection).

Another error we're seeing hit this pattern is very generic, simply a "TransportError" received via the connect_error event on the socket. See #8416 and internally to Msft, incident 275888262.

@andre4i
Copy link
Contributor

andre4i commented Dec 3, 2021

Thanks, @markfields.

This is tracked by #8179
#7599
#7545

(See some investigations here: #7599 (comment))

I have a PR open to track the message size but I'd prefer we would have feature gates to enable it..

@markfields
Copy link
Member Author

@andre4i - You're speaking of just the one example it seems?

Any concerns with a world where (until that's fixed) container closes after some time, rather than being stuck in disconnect state?

@jvoels
Copy link
Contributor

jvoels commented Dec 15, 2021

For the odsp related consideration "The host is providing tokens with the incorrect audience. The server disconnects every time, but the new token we get on reconnect has the same problem."

The service perhaps should be conveying "non-retriable" for this type of issue. Or the driver, if it runs under the assumption things are retriable by default, may need to deal with this specific error scenario differently.

Outwardly I imagine the UX experiences will usually aspire to a "seamless reconnect" - so the disconnected + open container state supports this.

So I would be concerned about pushing retry logic up the stack when detailed knowledge of what should be considered fatal or not will be closer to the driver.

@markfields
Copy link
Member Author

@jvoels good comment, yes we should take a look at whether these errors are correctly marked as retryable, and whether that's being respected (spoiler alert: In some cases it's not, see #8570)

@vladsud
Copy link
Contributor

vladsud commented Mar 1, 2022

I believe this issue is fully addressed by @andre4i and should be closed with his PR.

@andre4i
Copy link
Contributor

andre4i commented Mar 1, 2022

#9243

@andre4i andre4i closed this as completed Mar 1, 2022
@markfields
Copy link
Member Author

I think it's ok to close, but for the record -- Andrei's fix only addresses the case where the connection can't take hold because of a pending op in the runtime. The case where the host is consistently giving a bad token would not be addressed by that fix.

I'm ok to be data-driven and see what kinds of issues come up in livesite for Loops - or issue reports from others using the ecosystem.

cc @vladsud @andre4i

@vladsud
Copy link
Contributor

vladsud commented Mar 7, 2022

I think wrong token will not result in infinite reconnects, as we will get 403/401. These errors are retried once (with new token), and then non-recoverable error that closes container. At least that's how it worked and should work.

Note that t may take one cycle of reconnects to materialize, as we do not treat 403/401 errors as critical when received on disconnect - we always try to reconnect and errors on connection flow are critical

@markfields
Copy link
Member Author

Hmmm I think you're right. I wonder what I saw in telemetry, too bad I didn't link to the logs or provide more details.

@markfields
Copy link
Member Author

The other case recently was around a service outage where JoinSession returned 500 every time. We may want to consider a loader-level mitigation that would force the container to be read-only if we get stuck in this way.

@vladsud
Copy link
Contributor

vladsud commented Mar 7, 2022

Yes, 500 might be worth following up. I think we still want to keep reconnecting, but maybe have linger back off period, and maybe eventually (after like couple minutes of retries) to bail out. Maybe other way to say it - I think policy here should be slightly different from what we do with other cases, but I'm not sure we have (today) enough data at container / container runtime layer to differentiate.

@markfields
Copy link
Member Author

The 500's case is tracked in this internal PBI: https://dev.azure.com/office/OC/_workitems/edit/5828431/

@markfields
Copy link
Member Author

Opened #9560 as a way to better address the 500s case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: loader Loader related issues
Projects
None yet
Development

No branches or pull requests

4 participants