Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block large payloads inside DocumentDeltaConnection, size due to the 1MB Kafka limit #7987

Closed
wants to merge 25 commits into from

Conversation

andre4i
Copy link
Contributor

@andre4i andre4i commented Oct 25, 2021

See #7599

It would block message transmission and close the container but only with explicit feature gate.

@andre4i andre4i requested a review from a team as a code owner October 25, 2021 22:36
@github-actions github-actions bot requested review from vladsud, jatgarg, tanviraumi, znewton, anthony-murphy, markfields and wes-carlson and removed request for a team October 25, 2021 22:36
@github-actions github-actions bot added area: driver Driver related issues area: loader Loader related issues public api change Changes to a public API labels Oct 25, 2021
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Oct 25, 2021

@fluid-example/bundle-size-tests: +3.03 KB
Metric NameBaseline SizeCompare SizeSize Diff
container.js 169.6 KB 169.82 KB +219 Bytes
map.js 47.21 KB 47.21 KB No change
matrix.js 143.43 KB 143.43 KB +1 Bytes
odspDriver.js 186.55 KB 189.29 KB +2.74 KB
odspPrefetchSnapshot.js 41.28 KB 41.36 KB +76 Bytes
sharedString.js 164.24 KB 164.24 KB +1 Bytes
Total Size 784.99 KB 788.02 KB +3.03 KB

Baseline commit: 646a987

Generated by 🚫 dangerJS against b66f6e0

@andre4i andre4i marked this pull request as draft October 26, 2021 17:04
@github-actions github-actions bot requested a review from vladsud October 27, 2021 20:08
@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label Nov 1, 2021
@andre4i andre4i marked this pull request as ready for review November 2, 2021 00:48
@andre4i andre4i changed the title Track payload inside DocumentDeltaConnection, size due to the 1MB Kafka limit Block large payloads inside DocumentDeltaConnection, size due to the 1MB Kafka limit Dec 8, 2021
@andre4i
Copy link
Contributor Author

andre4i commented Dec 9, 2021

@markfields @vladsud please take a look at this. There is a bit of a behavior change with regards to retrying on error in the deltamanager.

(errorMessage: string) => new GenericNetworkError(
fluidErrorCode,
errorMessage,
err?.canRetry === true || err?.canRetry === undefined, // unless explicitly specified, this will retry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ We would always retry, regardless of the type of error. Not sure if this has always been intentional or not..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the story :)
In ODSP, we definitely want to reconnect always, because we may get disconnect with 403 due to token expiration. 403 is critical error (i.e., in general it's a game over event), but across layers we always do one retry with refreshed token to ensure host has a chance to provide new token.

With that said, I think that's the wrong layer to participate in this game. I.e. ODSP has to project such errors as recoverable. I'd need to look at the code to say if it's the case. And not sure about FRS.

Also, worth noting that for long period of time we consider errors without canRetry to be recoverable. We changed that (maybe 6 months back) - any exception, whether it's a bug in our code, or in host code (like token callback) is catastrophic.

So, to summarize, this code is likely wrong. But I do not think you are making it any more right :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there is a viable way right now for propagating non-retriable errors from the driver to the container. Maybe a TTL on the error itself? What are your thoughts?

@vladsud
Copy link
Contributor

vladsud commented Dec 9, 2021

I know you looked into it before, but I want to poke more.
Is there really no way to determine that socket.io disconnect happened due to size violation?
Is there anything we can do to observe lower-level traffic to figure it out? That would be obviously much better direction to convert such errors to catastrophic errors (not sure about 0.9Mb limit).

@andre4i
Copy link
Contributor Author

andre4i commented Dec 9, 2021

Is there really no way to determine that socket.io disconnect happened due to size violation?

None that I can find :( #8179 (comment) was the farthest I've gotten investigating/debugging this particular issue. I think the best way is to actually add the limitation to the server (our server) based on my notes here: #7599 (comment), push that value down using the client config, the documentdeltaconnection should read that config and either block OR if we fix the 1MB kafka limit do the fix in the runtime layer to support larger batches.

The current PR is to be explicit about a limit that is 'de facto' breaking things silently.

public static sizeInBytes(message: IDocumentMessage): number {
const { contents, ...restOfObject } = message;
// `contents` is already stringified. Re-stringifying the whole message will
// lead to additional escape characters which will increase the size artificially.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment suggest that what we measure is not what actually gets counted by socket.io. I.e. if socket.io strigifies payload, then it will add all these escape characters and they will go against the limit, right?

@andre4i
Copy link
Contributor Author

andre4i commented Jan 12, 2022

This approach is currently not preferred, as the overhead for each OP risks making this a performance bottleneck. We'll be exploring alternative solutions, such as a socket.io limitation along with an improvement of the error retry mechanism, the latter also fixing the current reconnect loop on socket.io error issue.

@andre4i andre4i closed this Jan 12, 2022
@vladsud vladsud mentioned this pull request Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: driver Driver related issues area: loader Loader related issues area: tests Tests to add, test infrastructure improvements, etc public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants