-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block large payloads inside DocumentDeltaConnection
, size due to the 1MB Kafka limit
#7987
Conversation
…rom threshold counter, also send the max messagesize
⯅ @fluid-example/bundle-size-tests: +3.03 KB
Baseline commit: 646a987 |
…k into track-message-size
DocumentDeltaConnection
, size due to the 1MB Kafka limitDocumentDeltaConnection
, size due to the 1MB Kafka limit
@markfields @vladsud please take a look at this. There is a bit of a behavior change with regards to retrying on error in the deltamanager. |
(errorMessage: string) => new GenericNetworkError( | ||
fluidErrorCode, | ||
errorMessage, | ||
err?.canRetry === true || err?.canRetry === undefined, // unless explicitly specified, this will retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ We would always retry, regardless of the type of error. Not sure if this has always been intentional or not..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the story :)
In ODSP, we definitely want to reconnect always, because we may get disconnect with 403 due to token expiration. 403 is critical error (i.e., in general it's a game over event), but across layers we always do one retry with refreshed token to ensure host has a chance to provide new token.
With that said, I think that's the wrong layer to participate in this game. I.e. ODSP has to project such errors as recoverable. I'd need to look at the code to say if it's the case. And not sure about FRS.
Also, worth noting that for long period of time we consider errors without canRetry to be recoverable. We changed that (maybe 6 months back) - any exception, whether it's a bug in our code, or in host code (like token callback) is catastrophic.
So, to summarize, this code is likely wrong. But I do not think you are making it any more right :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe there is a viable way right now for propagating non-retriable errors from the driver to the container. Maybe a TTL on the error itself? What are your thoughts?
I know you looked into it before, but I want to poke more. |
None that I can find :( #8179 (comment) was the farthest I've gotten investigating/debugging this particular issue. I think the best way is to actually add the limitation to the server (our server) based on my notes here: #7599 (comment), push that value down using the client config, the documentdeltaconnection should read that config and either block OR if we fix the 1MB kafka limit do the fix in the runtime layer to support larger batches. The current PR is to be explicit about a limit that is 'de facto' breaking things silently. |
public static sizeInBytes(message: IDocumentMessage): number { | ||
const { contents, ...restOfObject } = message; | ||
// `contents` is already stringified. Re-stringifying the whole message will | ||
// lead to additional escape characters which will increase the size artificially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment suggest that what we measure is not what actually gets counted by socket.io. I.e. if socket.io strigifies payload, then it will add all these escape characters and they will go against the limit, right?
This approach is currently not preferred, as the overhead for each OP risks making this a performance bottleneck. We'll be exploring alternative solutions, such as a socket.io limitation along with an improvement of the error retry mechanism, the latter also fixing the current reconnect loop on socket.io error issue. |
See #7599
It would block message transmission and close the container but only with explicit feature gate.