Restart summarizer after 10 seconds #15140

tyler-cai-microsoft · 2023-04-17T20:15:21Z

In order to prevent the summarizer from restarting in rapid succession in which we make lots of network calls that would be throttled, we will delay 10 seconds before we close the summarizer.

If the summarizer restarts too frequently, it would slow down the client which would be problematic and a degradation of user experience.

agarwal-navin · 2023-04-17T20:54:40Z

packages/runtime/container-runtime/src/containerRuntime.ts

+					"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
+				) ?? defaultRestartDelayMs;
+
+			// Delay 10 seconds before restarting summarizer to prevent the summarizer from restarting too frequently.


10 seconds seem like a arbitrary and long time. Any reason for picking up 10 seconds? There should be some way of telling from telemetry whether this is high / low and adjust it accordingly. You should consider adding some.

Why is 10 seconds problematic?

It may not be. But we need data to know that. For example, a busy doc can have thousands of ops in 10 seconds which make the next summary that much more expensive. Or it may turn out that 10 seconds is less and clients keep reloading. It's fine as a starting point but I was interested as to how you came up with 10 seconds? Plus there should be telemetry that tells us whether that number is good enough or not.

Added telemetry, we will not know if 10 seconds is enough or not until we see the telemetry. It's an arbitrary number we came up with offline. If the summarizer cache is 60 seconds, we would at most restart 5 times which is not that bad of a worse case scenario.

It's safety against tight loops that could happen if we have some state (that we can't envision) that results in infinite reloads (or possibly - tight reloads for 60 seconds, or something like that). It's arbitrary, and we should reevaluate it once we have more data from PROD. In ideal case we should not hit it :)

Not familiar with the summarizer lifecycle; but wondering if waiting 10 seconds here before stopping it VS after, makes a difference.

I don't think so since this would be part of the entire submit summary call, but potentially since we are stopping the summarizer and a different promise may resolve. It's much safer to wait before we call stop

agarwal-navin · 2023-04-17T20:55:19Z

packages/runtime/container-runtime/src/containerRuntime.ts

@@ -3225,6 +3228,14 @@ export class ContainerRuntime
 				},
 				error,
 			);
+
+			const testDelayOverrideMs =


If the error thrown contains a retryAfter field, that should be honored. If it's more than 10 seconds, this delay won't help very much.

So this delay will fire after the retryAfter when we try to summarize a second time. In the trySummarize method we do wait for the delay seconds. If we receive an unknown ack, we will wait 10 seconds to restart as well.

Oh wait, I totally missed that this is in container runtime. This code is especially important in SummaryGenerator where summary fails. It's not need here because summaries didn't fail.

Also, with restart there is no second attempt. There is just the one attempt.

In this scenario, we don't know if summarizer failed. It's probably best to make sure we do not burn summarizer cycles in the worst case scenario. Waiting 10 seconds should not be an issue unless a large number of ops occurs in a very short timespan, which we should have a mitigation for when virtual sequencing rolls out.

Made it so the retry timer will fire regardless

agarwal-navin · 2023-04-17T20:55:51Z

To enable turning on the feature flag of restarting safely AB#3898

I am not sure I understand this statement from the purpose of this PR.

msfluid-bot · 2023-04-17T21:00:56Z

⯆ @fluid-example/bundle-size-tests: -94 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
aqueduct.js	437.01 KB	436.96 KB	⯆ -56 Bytes
connectionState.js	680 Bytes	680 Bytes	■ No change
containerRuntime.js	230.27 KB	230.21 KB	⯆ -62 Bytes
loader.js	160.81 KB	160.82 KB	⯅ +6 Bytes
map.js	44.07 KB	44.07 KB	⯅ +2 Bytes
matrix.js	137.1 KB	137.1 KB	⯅ +2 Bytes
odspDriver.js	92.35 KB	92.36 KB	⯅ +6 Bytes
odspPrefetchSnapshot.js	43.58 KB	43.59 KB	⯅ +4 Bytes
sharedString.js	157.67 KB	157.67 KB	⯅ +2 Bytes
sharedTree2.js	240.65 KB	240.65 KB	■ No change
Total Size	1.61 MB	1.61 MB	⯆ -94 Bytes

Baseline commit: d69ccba

Generated by 🚫 dangerJS against 2e037d8

…est state

tyler-cai-microsoft · 2023-04-17T23:10:25Z

To enable turning on the feature flag of restarting safely AB#3898

I am not sure I understand this statement from the purpose of this PR.

Updated the description

vladsud · 2023-04-18T02:56:52Z

packages/runtime/container-runtime/src/containerRuntime.ts

+			const closeSummarizerDelayOverrideMs =
+				this.mc.config.getNumber(
+					"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
+				) ?? defaultCloseSummarizerDelayMs;


It would be great to open a bug like 2 sprints out to review telemetry and start cleaning code. I.e. if data suggests we do not need flights here, or delays - we should clean them

https://dev.azure.com/fluidframework/internal/_workitems/edit/4109

vladsud · 2023-04-18T03:17:21Z

packages/runtime/container-runtime/src/summary/runningSummarizer.ts

-					this.runtime.closeFn();
-					return;
+						this.stopSummarizerCallback("latestSummaryStateStale");
+						this.runtime.closeFn();


this.runtime.closeFn(); looks fishy to me. I'm not sure why this block should have slightly different behavior than code couple lines below that also does this.stopSummarizerCallback().

In the test, the runtime doesn't close unless I call this here. When I trace the code, there doesn't seem to be any call to close the runtime. I have this code here. It's different, but I am not sure the behavior of the summarizer does what I expect it to do.

Removed this code and left it centralized in the container runtime

vladsud · 2023-04-18T03:18:13Z

packages/runtime/container-runtime/src/summary/runningSummarizer.ts

+
+						this.mc.logger.sendTelemetryEvent(
+							{
+								eventName: "ClosingSummarizerOnSummaryStale",


I think terms "ClosingSummarizerOnSummaryStale" & "latestSummaryStateStale" do not reflect reality. Nothing is really stale. Summary failed, and without cracking error messages we do not know why, so I feel like code should not come up with guesses. It could be 400/404/429 from service.

At this point, we have already sent a telemetry event Summarize_cancel which contains all that information. We do not need to send multiple error messages that we do not read that propagate that information. At this point, we just want to know if the feature is engaged, and thus the telemetry event.

Removed this unnecessary telemetry, it will fire from the runtime, and the running summarizer will still encapsulate the errors in "FailToSummarize". We will see 2 summarize cancels, one where the original went badly, and another where we chose to restart after some delay.

vladsud · 2023-04-18T03:28:52Z

packages/runtime/container-runtime/src/summary/runningSummarizer.ts

+								this.mc.config.getNumber(
+									"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
+								) ?? defaultCloseSummarizerDelayMs;
+							await delay(closeSummarizerDelayOverrideMs);


Any reason not to set delaySeconds above let code above do the wait?
It feels like there is a lot of special casing / duplication of existing code that we can avoid by plugging into existing workflow that already does all we need - delays and closing container.

The more I'm advised to make modifications, the more I want to gut this code here and just keep it simple. I don't really want to maintain the same code in two different places. I'd rather go back to the implementation, where all the logic was in one class.

vladsud · 2023-04-18T03:30:27Z

packages/runtime/container-runtime/src/summary/runningSummarizer.ts

+						// There's no need to delay here as we already have summarizeAttemptDelay given to us by the server.
+						if (delaySeconds === 0) {
+							const closeSummarizerDelayOverrideMs =
+								this.mc.config.getNumber(


Can you please add all the FGs we are using / adding to "ContainerLoadStats" event in containerRuntime.ts? Thanks!

agarwal-navin · 2023-04-18T17:48:30Z

packages/runtime/container-runtime/src/containerRuntime.ts

@@ -508,6 +509,8 @@ const defaultCompressionConfig = {

 const defaultChunkSizeInBytes = 204800;

+export const defaultCloseSummarizerDelayMs = 10000; // 10 seconds


Looks like this belongs in the runnningSummarizer and should be imported here.

Not exactly sure the difference between one and the other, we already import the containerRuntime file in runningSummarizer, so I was concerned about any build breaks that might potentially happen because of that. I don't know if I should be doing tech debt in this PR.

What tech debt? Does the build fail when you do what I suggested? If so, by all means, don't fix it.
I was suggesting that defaultCloseSummarizerDelayMs belongs in runningSummarizer or one of the other files under the summarizer folder. It is not a container runtime specific property so it does not belong here.
Not asking you to fix existing issues, it's merely a suggestion based on how these classes are designed.

agarwal-navin · 2023-04-25T16:09:47Z

packages/runtime/container-runtime/src/summary/runningSummarizer.ts

-					? numberOfAttemptsOnRestartAsRecovery
-					: this.mc.config.getNumber("Fluid.Summarizer.Attempts") ??
-					  defaultNumberSummarizationAttempts;
+				let totalAttempts =


Not doing a second attempt here is the correct approach since we already know the second attempt is going to fail. There is no point in doing the extra work.
It's fine to not do it that way for now, but create an issue to do that as a follow up.

https://dev.azure.com/fluidframework/internal/_workitems/edit/4177/

agarwal-navin · 2023-04-25T16:10:39Z

packages/runtime/container-runtime/src/containerRuntime.ts

@@ -508,6 +509,8 @@ const defaultCompressionConfig = {

 const defaultChunkSizeInBytes = 204800;

+const defaultCloseSummarizerDelayMs = 10000; // 10 seconds


Add a comment there saying what this is used for.

agarwal-navin · 2023-04-25T16:13:41Z

packages/runtime/container-runtime/src/containerRuntime.ts

@@ -1283,6 +1288,14 @@ export class ContainerRuntime
 			this.remoteMessageProcessor.clearPartialMessagesFor(clientId);
 		});

+		this.abnormalAckRecoveryMethod = this.mc.config.getString(


abnormalAckRecoveryMethod does not sound right. Maybe summaryStateUpdateMethod? And then add comments to clearly state what it means because the name does not make it clear.
Also, update the feature flag name accordingly.

agarwal-navin · 2023-04-25T16:14:08Z

packages/runtime/container-runtime/src/containerRuntime.ts

+			"Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod",
+		);
+		const closeSummarizerDelayOverride = this.mc.config.getNumber(
+			"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",


Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs -> Fluid.ContainerRuntime.Test.CloseSummarizerDelayOverrideMs?

agarwal-navin · 2023-04-25T16:14:25Z

packages/runtime/container-runtime/src/containerRuntime.ts

+		const closeSummarizerDelayOverride = this.mc.config.getNumber(
+			"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
+		);
+		this.closeSummarizerDelayMs = closeSummarizerDelayOverride ?? defaultCloseSummarizerDelayMs;


nit: Combine this with the above statement. The extra variable is not needed.

I need this to be logged in telemetry.

I meant remove the const closeSummarizerDelayOverride

agarwal-navin · 2023-04-25T16:16:42Z

packages/test/test-end-to-end-tests/src/test/summarizeRestart.spec.ts

@@ -33,6 +33,7 @@ describeNoCompat("Summarizer closes instead of refreshing", (getTestObjectProvid
 	beforeEach(async () => {
 		provider = getTestObjectProvider({ syncSummarizer: true });
 		settings["Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod"] = "restart";
+		settings["Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs"] = 0; // so that we don't have to wait for the timeout


You should add tests with the timeout to validate it works as expected. Maybe a unit test so timers can be mocked.

agarwal-navin · 2023-04-25T16:16:52Z

packages/runtime/container-runtime/src/test/summary/runningSummarizer.spec.ts

@@ -669,61 +669,6 @@ describe("Runtime", () => {
 					assert.strictEqual(stopCall, 1);
 				});

-				it("Should not retry on failure when stopping instead of restarting", async () => {


Move this test to container runtime?

The running summarizer is responsible for doing multiple retries. I don't think we can move this test to the container runtime

Sure, not this exact test. But testing of similar functionality - container runtime should close instead of fetching snapshot. Unless it already exists.

vladsud · 2023-04-25T17:42:07Z

packages/runtime/container-runtime/src/containerRuntime.ts

-			"Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod",
-		);
-		if (recoveryMethod === "restart") {
+		if (this.abnormalAckRecoveryMethod === "restart") {


If we want to go ON by default, this FG should be reversed

This will need to be a separate PR. I think there will be some failing e2e tests that I will need to fix/adjust.

packages/test/test-end-to-end-tests/src/test/summarizeRestart.spec.ts

vladsud

There are some comments that are worth addressing before merging

agarwal-navin · 2023-04-25T22:00:05Z

packages/runtime/container-runtime/src/containerRuntime.ts

@@ -1423,6 +1441,8 @@ export class ContainerRuntime
 				disableChunking,
 				disableAttachReorder: this.disableAttachReorder,
 				disablePartialFlush,
+				abnormalAckRecoveryMethod: this.summaryStateUpdateMethod,


@tyler-cai-microsoft I believe you missed updating this property name in the log.

https://github.com/microsoft/FluidFramework/pull/15274/files

PR: #15140 (comment) Forgot to rename this telemetry property. Fixing that here

[AB#5129](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5129) In light of #15140, this moves the close container logic to automatically close the container by default instead of hiding it behind a feature flag. The default behavior should be to close the container after 5 seconds The new behavior is that instead of making a network refresh, we will close the container by default. I ran into an interesting piece of behavior, we will still need a lot of the refresh code, but it's only for updating state on the current container. --------- Co-authored-by: Alex Villarreal <[email protected]>

[AB#5118](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5118) Remove "refresh from server" logic. We are making this PR into `next` instead of `main` as we want to wait a bit before we commit to removing "refresh from server" logic. Instead of refreshing we are closing the container and restarting. More details here: #15140 This requires updating the `summarizerNode` logic and `containerRuntime` logic. Work to move when we decide to close the container sits in [AB#5152](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5152) --------- Co-authored-by: Navin Agarwal <[email protected]> Co-authored-by: Mark Fields <[email protected]>

Restart summarizer after 10 seconds

73d89cf

tyler-cai-microsoft requested a review from a team as a code owner April 17, 2023 20:15

github-actions bot added area: runtime Runtime related issues base: main PRs targeted against main branch labels Apr 17, 2023

tyler-cai-microsoft requested review from NicholasCouri, vladsud and agarwal-navin April 17, 2023 20:34

Allow restart to be 0s in tests

f4f55de

tyler-cai-microsoft requested a review from a team as a code owner April 17, 2023 20:37

github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label Apr 17, 2023

agarwal-navin reviewed Apr 17, 2023

View reviewed changes

tyler-cai-microsoft added 2 commits April 17, 2023 15:52

Add delay info to telemetry

b255170

Add delay to abort when closing the summarizer to reload from the lat…

3b13925

…est state

Fix broken tests

d52ad59

vladsud reviewed Apr 18, 2023

View reviewed changes

agarwal-navin reviewed Apr 18, 2023

View reviewed changes

Simplify code, be ok with an extra error message

4ca1cd0

github-actions bot added the changeset-present label Apr 18, 2023

Remove unnecessary export

87d5f6f

agarwal-navin reviewed Apr 25, 2023

View reviewed changes

vladsud reviewed Apr 25, 2023

View reviewed changes

packages/test/test-end-to-end-tests/src/test/summarizeRestart.spec.ts Outdated Show resolved Hide resolved

vladsud approved these changes Apr 25, 2023

View reviewed changes

tyler-cai-microsoft added 2 commits April 25, 2023 11:07

Feature flag renames plus adding 10s to e2e test

432d571

Run with slight delay in e2e tests

2e037d8

tyler-cai-microsoft merged commit 153806f into microsoft:main Apr 25, 2023

agarwal-navin reviewed Apr 25, 2023

View reviewed changes

tyler-cai-microsoft mentioned this pull request Apr 25, 2023

Rename telemetry property to summaryStateUpdateMethod #15274

Merged

tyler-cai-microsoft added a commit that referenced this pull request Apr 25, 2023

Rename telemetry property to summaryStateUpdateMethod (#15274)

4f77eb4

PR: #15140 (comment) Forgot to rename this telemetry property. Fixing that here

tyler-cai-microsoft mentioned this pull request Jul 28, 2023

Close the container by default #16612

Merged

tyler-cai-microsoft mentioned this pull request Aug 1, 2023

Remove refresh from server path and refresh from network call #16657

Merged

		@@ -508,6 +509,8 @@ const defaultCompressionConfig = {

		const defaultChunkSizeInBytes = 204800;

		export const defaultCloseSummarizerDelayMs = 10000; // 10 seconds

		@@ -508,6 +509,8 @@ const defaultCompressionConfig = {

		const defaultChunkSizeInBytes = 204800;

		const defaultCloseSummarizerDelayMs = 10000; // 10 seconds

Restart summarizer after 10 seconds #15140

Restart summarizer after 10 seconds #15140

Conversation

tyler-cai-microsoft commented Apr 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tyler-cai-microsoft Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarwal-navin commented Apr 17, 2023

msfluid-bot commented Apr 17, 2023 • edited Loading

tyler-cai-microsoft commented Apr 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarwal-navin Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarwal-navin Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

tyler-cai-microsoft Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladsud Apr 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladsud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tyler-cai-microsoft commented Apr 17, 2023 •

edited

Loading

tyler-cai-microsoft Apr 25, 2023 •

edited

Loading

msfluid-bot commented Apr 17, 2023 •

edited

Loading

agarwal-navin Apr 18, 2023 •

edited

Loading

agarwal-navin Apr 25, 2023 •

edited

Loading

tyler-cai-microsoft Apr 25, 2023 •

edited

Loading

vladsud Apr 25, 2023 •

edited

Loading