-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite reconnect loop can occur due to fatal error that should really close the container #8411
Comments
General symptom in a session is to see Another error we're seeing hit this pattern is very generic, simply a "TransportError" received via the |
Thanks, @markfields. This is tracked by #8179 (See some investigations here: #7599 (comment)) I have a PR open to track the message size but I'd prefer we would have feature gates to enable it.. |
@andre4i - You're speaking of just the one example it seems? Any concerns with a world where (until that's fixed) container closes after some time, rather than being stuck in disconnect state? |
For the odsp related consideration "The host is providing tokens with the incorrect audience. The server disconnects every time, but the new token we get on reconnect has the same problem." The service perhaps should be conveying "non-retriable" for this type of issue. Or the driver, if it runs under the assumption things are retriable by default, may need to deal with this specific error scenario differently. Outwardly I imagine the UX experiences will usually aspire to a "seamless reconnect" - so the disconnected + open container state supports this. So I would be concerned about pushing retry logic up the stack when detailed knowledge of what should be considered fatal or not will be closer to the driver. |
I believe this issue is fully addressed by @andre4i and should be closed with his PR. |
I think it's ok to close, but for the record -- Andrei's fix only addresses the case where the connection can't take hold because of a pending op in the runtime. The case where the host is consistently giving a bad token would not be addressed by that fix. I'm ok to be data-driven and see what kinds of issues come up in livesite for Loops - or issue reports from others using the ecosystem. |
I think wrong token will not result in infinite reconnects, as we will get 403/401. These errors are retried once (with new token), and then non-recoverable error that closes container. At least that's how it worked and should work. Note that t may take one cycle of reconnects to materialize, as we do not treat 403/401 errors as critical when received on disconnect - we always try to reconnect and errors on connection flow are critical |
Hmmm I think you're right. I wonder what I saw in telemetry, too bad I didn't link to the logs or provide more details. |
The other case recently was around a service outage where JoinSession returned 500 every time. We may want to consider a loader-level mitigation that would force the container to be read-only if we get stuck in this way. |
Yes, 500 might be worth following up. I think we still want to keep reconnecting, but maybe have linger back off period, and maybe eventually (after like couple minutes of retries) to bail out. Maybe other way to say it - I think policy here should be slightly different from what we do with other cases, but I'm not sure we have (today) enough data at container / container runtime layer to differentiate. |
The 500's case is tracked in this internal PBI: https://dev.azure.com/office/OC/_workitems/edit/5828431/ |
Opened #9560 as a way to better address the 500s case. |
Some examples we've seen in telemetry:
The host is providing tokens with the incorrect audience. The server disconnects every time, but the new token we get on reconnect has the same problem.One fix would be a mechanism to watch and see if we're hitting the same error N times in a row and close the container.
The text was updated successfully, but these errors were encountered: