A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

vladsud · 2021-11-09T17:53:09Z

union Office_Fluid_FluidRuntime_*
| where Data_loaderVersion contains "0.49"
| where Event_Time > ago(3d)
| where Data_eventName == "fluid:telemetry:SummaryStatus:Behind"
| summarize count() by Data_docId, Session_Id

Shows up 3 docs that hit a lot of these errors. That points to these 3 Session_Ids:
fe1b81e7-dcd1-4acd-9b6e-368733ff45cb
246d8bdd-b804-4509-929b-c6ce707925fb
1e5c288e-ef76-4822-8640-327f50fee582

In each of the cases we see rapid disconnects with "Disconnect: socket.io (disconnect): transport close" reason - example:

union Office_Fluid_FluidRuntime_*
| where Session_Id == "246d8bdd-b804-4509-929b-c6ce707925fb"
| where Data_eventName contains "ConnectionState"
| project Event_Sequence, Event_Time, Data_eventName, Data_error, Data_reason, Data_duration, Data_pendingClientId, Data_socketDocumentId

I assume this is mostly you testing socket.io limits.
If so, should we put a simple fix for now (while we are getting to the bottom of long term plan) to basically fail fast in case the message payload is above 90% of socket.io current / default limit?
That would have much cleaner telemetry and seems like right thing to have no matter what we do next in this area.

@pleath - FYI, in case you hear reports of summary errors.

andre4i · 2021-11-09T18:07:58Z

Thanks, @vladsud

I had a feeling that some of the summary behind issues I saw during my OCE shift were due to the payload, as inspecting the logs showed a similar reconnect loop. Yes, this is tied to #7599 and #7545 and indeed I did test this against ODSP..

should we put a simple fix for now (while we are getting to the bottom of long term plan) to basically fail fast in case the message payload is above 90% of socket.io current / default limit?

Let's chat about this. #7987 would track the payload size, however I'd do this in stages and with a localstorage flag first, mostly due to the unknowns of perf implications of stringifying/bufferizing the payload.

vladsud · 2021-11-09T18:32:01Z

Might be worth taking a look how this is handled by socket.io.
It looks like in Node at least, there is a limit on websocket itself and it generates nice error:
https://github.com/websockets/ws/blob/26a46f33ffc74ad91d4adc7662489c79e64032c0/lib/receiver.js#L372
It's not clear if socket.io sets websocket limit (or uses its own heuristic) and if websocket error is available anywhere.

But I confirmed with Gary (for one of the sessions) that PUSH sees same thing - just transport disconnects.

andre4i · 2021-11-09T19:47:05Z

My hypothesis is that the limit is indeed on the websocket level but somewhere downstream the error is discarded. I could not wire into any exposed error handler to give us the proper error. I will attempt another debugging session on the server side to see if I find where (or if) the error gets lost.

andre4i · 2021-11-09T19:58:59Z

I would prefer to find a way to hook into that error directly and fail the op. This way, we don't even need to inspect the message size and defer this validation entirely to the socket.

andre4i · 2021-11-09T23:08:06Z

@vladsud

When the limit is exceeded, the server socket gets a proper error:

message:'RangeError: Max payload size exceeded'
stack:'Error: RangeError: Max payload size exceeded\n    at WebSocket.onError (...)
type:'TransportError'

at https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/socket.ts#L170

In turn, this will call onclose: https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/socket.ts#L379 which further propagates the error to the close event of the socket. The server will binds the close event for the socket as following: https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/server.ts#L388.

What's puzzling for me is why a close event handler for the socket, bound inside tinylicious does not fire in this case. My guess is that something along the way unbinds the event handlers OR there is some sort of 'race condition' when the transports close (which happens before the close event gets emitted) manages to send the transport close message first ( see https://github.com/socketio/engine.io/blob/271e2df94d39bbd13c33cab98cdd5915f9d28536/lib/socket.ts#L215)

andre4i · 2021-12-27T16:39:32Z

See socketio/socket.io-client#1518

andre4i · 2022-03-01T18:34:56Z

#9243 should serve as a fix for this. Considering we have multiple other issues about the same thing, this one is safe to close.

vladsud added the bug Something isn't working label Nov 9, 2021

vladsud added this to the November 2021 milestone Nov 9, 2021

vladsud assigned andre4i Nov 9, 2021

andre4i mentioned this issue Nov 29, 2021

Op batching work for 1.0 #7912

Closed

7 tasks

anthony-murphy modified the milestones: November 2021, December 2021 Dec 1, 2021

This was referenced Dec 3, 2021

Infinite reconnect loop can occur due to fatal error that should really close the container #8411

Closed

Block large payloads inside DocumentDeltaConnection, size due to the 1MB Kafka limit #7987

Closed

andre4i modified the milestones: December 2021, January 2022 Dec 27, 2021

anthony-murphy modified the milestones: January 2022, February 2022 Feb 1, 2022

vladsud mentioned this issue Feb 4, 2022

1M limit epic #9023

Closed

andre4i closed this as completed Mar 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

vladsud commented Nov 9, 2021

andre4i commented Nov 9, 2021

vladsud commented Nov 9, 2021

andre4i commented Nov 9, 2021

andre4i commented Nov 9, 2021

andre4i commented Nov 9, 2021

andre4i commented Dec 27, 2021

andre4i commented Mar 1, 2022

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

A lot of disconnects / summary behind: need to put message size limit not to hit socket.io limits? #8179

Comments

vladsud commented Nov 9, 2021

andre4i commented Nov 9, 2021

vladsud commented Nov 9, 2021

andre4i commented Nov 9, 2021

andre4i commented Nov 9, 2021

andre4i commented Nov 9, 2021

andre4i commented Dec 27, 2021

andre4i commented Mar 1, 2022