Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

Closed
vladsud opened this issue Nov 9, 2021 · 7 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Nov 9, 2021

union Office_Fluid_FluidRuntime_*
| where Data_loaderVersion contains "0.49"
| where Event_Time > ago(3d)
| where Data_eventName == "fluid:telemetry:SummaryStatus:Behind"
| summarize count() by Data_docId, Session_Id

Shows up 3 docs that hit a lot of these errors. That points to these 3 Session_Ids:
fe1b81e7-dcd1-4acd-9b6e-368733ff45cb
246d8bdd-b804-4509-929b-c6ce707925fb
1e5c288e-ef76-4822-8640-327f50fee582

In each of the cases we see rapid disconnects with "Disconnect: socket.io (disconnect): transport close" reason - example:

union Office_Fluid_FluidRuntime_*
| where Session_Id == "246d8bdd-b804-4509-929b-c6ce707925fb"
| where Data_eventName contains "ConnectionState"
| project Event_Sequence, Event_Time, Data_eventName, Data_error, Data_reason, Data_duration, Data_pendingClientId, Data_socketDocumentId

I assume this is mostly you testing socket.io limits.
If so, should we put a simple fix for now (while we are getting to the bottom of long term plan) to basically fail fast in case the message payload is above 90% of socket.io current / default limit?
That would have much cleaner telemetry and seems like right thing to have no matter what we do next in this area.

@pleath - FYI, in case you hear reports of summary errors.

@vladsud vladsud added the bug Something isn't working label Nov 9, 2021
@vladsud vladsud added this to the November 2021 milestone Nov 9, 2021
@andre4i
Copy link
Contributor

andre4i commented Nov 9, 2021

Thanks, @vladsud

I had a feeling that some of the summary behind issues I saw during my OCE shift were due to the payload, as inspecting the logs showed a similar reconnect loop. Yes, this is tied to #7599 and #7545 and indeed I did test this against ODSP..

should we put a simple fix for now (while we are getting to the bottom of long term plan) to basically fail fast in case the message payload is above 90% of socket.io current / default limit?

Let's chat about this. #7987 would track the payload size, however I'd do this in stages and with a localstorage flag first, mostly due to the unknowns of perf implications of stringifying/bufferizing the payload.

@vladsud
Copy link
Contributor Author

vladsud commented Nov 9, 2021

Might be worth taking a look how this is handled by socket.io.
It looks like in Node at least, there is a limit on websocket itself and it generates nice error:
https://github.com/websockets/ws/blob/26a46f33ffc74ad91d4adc7662489c79e64032c0/lib/receiver.js#L372
It's not clear if socket.io sets websocket limit (or uses its own heuristic) and if websocket error is available anywhere.

But I confirmed with Gary (for one of the sessions) that PUSH sees same thing - just transport disconnects.

@andre4i
Copy link
Contributor

andre4i commented Nov 9, 2021

My hypothesis is that the limit is indeed on the websocket level but somewhere downstream the error is discarded. I could not wire into any exposed error handler to give us the proper error. I will attempt another debugging session on the server side to see if I find where (or if) the error gets lost.

@andre4i
Copy link
Contributor

andre4i commented Nov 9, 2021

I would prefer to find a way to hook into that error directly and fail the op. This way, we don't even need to inspect the message size and defer this validation entirely to the socket.

@andre4i
Copy link
Contributor

andre4i commented Nov 9, 2021

@vladsud

When the limit is exceeded, the server socket gets a proper error:

message:'RangeError: Max payload size exceeded'
stack:'Error: RangeError: Max payload size exceeded\n    at WebSocket.onError (...)
type:'TransportError'

at https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/socket.ts#L170

In turn, this will call onclose: https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/socket.ts#L379 which further propagates the error to the close event of the socket. The server will binds the close event for the socket as following: https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/server.ts#L388.

What's puzzling for me is why a close event handler for the socket, bound inside tinylicious does not fire in this case. My guess is that something along the way unbinds the event handlers OR there is some sort of 'race condition' when the transports close (which happens before the close event gets emitted) manages to send the transport close message first ( see https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/socket.ts#L215)

@andre4i
Copy link
Contributor

andre4i commented Dec 27, 2021

See socketio/socket.io-client#1518

@andre4i
Copy link
Contributor

andre4i commented Mar 1, 2022

#9243 should serve as a fix for this. Considering we have multiple other issues about the same thing, this one is safe to close.

@andre4i andre4i closed this as completed Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants