Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart summarizer after 10 seconds #15140

Conversation

tyler-cai-microsoft
Copy link
Contributor

@tyler-cai-microsoft tyler-cai-microsoft commented Apr 17, 2023

AB#3898

In order to prevent the summarizer from restarting in rapid succession in which we make lots of network calls that would be throttled, we will delay 10 seconds before we close the summarizer.

If the summarizer restarts too frequently, it would slow down the client which would be problematic and a degradation of user experience.

@tyler-cai-microsoft tyler-cai-microsoft requested a review from a team as a code owner April 17, 2023 20:15
@github-actions github-actions bot added area: runtime Runtime related issues base: main PRs targeted against main branch labels Apr 17, 2023
@tyler-cai-microsoft tyler-cai-microsoft requested a review from a team as a code owner April 17, 2023 20:37
@github-actions github-actions bot added the area: tests Tests to add, test infrastructure improvements, etc label Apr 17, 2023
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
) ?? defaultRestartDelayMs;

// Delay 10 seconds before restarting summarizer to prevent the summarizer from restarting too frequently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 seconds seem like a arbitrary and long time. Any reason for picking up 10 seconds? There should be some way of telling from telemetry whether this is high / low and adjust it accordingly. You should consider adding some.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 10 seconds problematic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be. But we need data to know that. For example, a busy doc can have thousands of ops in 10 seconds which make the next summary that much more expensive. Or it may turn out that 10 seconds is less and clients keep reloading. It's fine as a starting point but I was interested as to how you came up with 10 seconds? Plus there should be telemetry that tells us whether that number is good enough or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added telemetry, we will not know if 10 seconds is enough or not until we see the telemetry. It's an arbitrary number we came up with offline. If the summarizer cache is 60 seconds, we would at most restart 5 times which is not that bad of a worse case scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safety against tight loops that could happen if we have some state (that we can't envision) that results in infinite reloads (or possibly - tight reloads for 60 seconds, or something like that). It's arbitrary, and we should reevaluate it once we have more data from PROD. In ideal case we should not hit it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with the summarizer lifecycle; but wondering if waiting 10 seconds here before stopping it VS after, makes a difference.

Copy link
Contributor Author

@tyler-cai-microsoft tyler-cai-microsoft Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since this would be part of the entire submit summary call, but potentially since we are stopping the summarizer and a different promise may resolve. It's much safer to wait before we call stop

@@ -3225,6 +3228,14 @@ export class ContainerRuntime
},
error,
);

const testDelayOverrideMs =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the error thrown contains a retryAfter field, that should be honored. If it's more than 10 seconds, this delay won't help very much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this delay will fire after the retryAfter when we try to summarize a second time. In the trySummarize method we do wait for the delay seconds. If we receive an unknown ack, we will wait 10 seconds to restart as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I totally missed that this is in container runtime. This code is especially important in SummaryGenerator where summary fails. It's not need here because summaries didn't fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with restart there is no second attempt. There is just the one attempt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario, we don't know if summarizer failed. It's probably best to make sure we do not burn summarizer cycles in the worst case scenario. Waiting 10 seconds should not be an issue unless a large number of ops occurs in a very short timespan, which we should have a mitigation for when virtual sequencing rolls out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it so the retry timer will fire regardless

@agarwal-navin
Copy link
Contributor

To enable turning on the feature flag of restarting safely AB#3898

I am not sure I understand this statement from the purpose of this PR.

@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Apr 17, 2023

@fluid-example/bundle-size-tests: -94 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 437.01 KB 436.96 KB -56 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 230.27 KB 230.21 KB -62 Bytes
loader.js 160.81 KB 160.82 KB +6 Bytes
map.js 44.07 KB 44.07 KB +2 Bytes
matrix.js 137.1 KB 137.1 KB +2 Bytes
odspDriver.js 92.35 KB 92.36 KB +6 Bytes
odspPrefetchSnapshot.js 43.58 KB 43.59 KB +4 Bytes
sharedString.js 157.67 KB 157.67 KB +2 Bytes
sharedTree2.js 240.65 KB 240.65 KB No change
Total Size 1.61 MB 1.61 MB -94 Bytes

Baseline commit: d69ccba

Generated by 🚫 dangerJS against 2e037d8

@tyler-cai-microsoft
Copy link
Contributor Author

To enable turning on the feature flag of restarting safely AB#3898

I am not sure I understand this statement from the purpose of this PR.

Updated the description

const closeSummarizerDelayOverrideMs =
this.mc.config.getNumber(
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
) ?? defaultCloseSummarizerDelayMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to open a bug like 2 sprints out to review telemetry and start cleaning code. I.e. if data suggests we do not need flights here, or delays - we should clean them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.runtime.closeFn();
return;
this.stopSummarizerCallback("latestSummaryStateStale");
this.runtime.closeFn();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.runtime.closeFn(); looks fishy to me. I'm not sure why this block should have slightly different behavior than code couple lines below that also does this.stopSummarizerCallback().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test, the runtime doesn't close unless I call this here. When I trace the code, there doesn't seem to be any call to close the runtime. I have this code here. It's different, but I am not sure the behavior of the summarizer does what I expect it to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this code and left it centralized in the container runtime


this.mc.logger.sendTelemetryEvent(
{
eventName: "ClosingSummarizerOnSummaryStale",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think terms "ClosingSummarizerOnSummaryStale" & "latestSummaryStateStale" do not reflect reality. Nothing is really stale. Summary failed, and without cracking error messages we do not know why, so I feel like code should not come up with guesses. It could be 400/404/429 from service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we have already sent a telemetry event Summarize_cancel which contains all that information. We do not need to send multiple error messages that we do not read that propagate that information. At this point, we just want to know if the feature is engaged, and thus the telemetry event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this unnecessary telemetry, it will fire from the runtime, and the running summarizer will still encapsulate the errors in "FailToSummarize". We will see 2 summarize cancels, one where the original went badly, and another where we chose to restart after some delay.

this.mc.config.getNumber(
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
) ?? defaultCloseSummarizerDelayMs;
await delay(closeSummarizerDelayOverrideMs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to set delaySeconds above let code above do the wait?
It feels like there is a lot of special casing / duplication of existing code that we can avoid by plugging into existing workflow that already does all we need - delays and closing container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I'm advised to make modifications, the more I want to gut this code here and just keep it simple. I don't really want to maintain the same code in two different places. I'd rather go back to the implementation, where all the logic was in one class.

// There's no need to delay here as we already have summarizeAttemptDelay given to us by the server.
if (delaySeconds === 0) {
const closeSummarizerDelayOverrideMs =
this.mc.config.getNumber(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add all the FGs we are using / adding to "ContainerLoadStats" event in containerRuntime.ts? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -508,6 +509,8 @@ const defaultCompressionConfig = {

const defaultChunkSizeInBytes = 204800;

export const defaultCloseSummarizerDelayMs = 10000; // 10 seconds
Copy link
Contributor

@agarwal-navin agarwal-navin Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this belongs in the runnningSummarizer and should be imported here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure the difference between one and the other, we already import the containerRuntime file in runningSummarizer, so I was concerned about any build breaks that might potentially happen because of that. I don't know if I should be doing tech debt in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tech debt? Does the build fail when you do what I suggested? If so, by all means, don't fix it.
I was suggesting that defaultCloseSummarizerDelayMs belongs in runningSummarizer or one of the other files under the summarizer folder. It is not a container runtime specific property so it does not belong here.
Not asking you to fix existing issues, it's merely a suggestion based on how these classes are designed.

? numberOfAttemptsOnRestartAsRecovery
: this.mc.config.getNumber("Fluid.Summarizer.Attempts") ??
defaultNumberSummarizationAttempts;
let totalAttempts =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not doing a second attempt here is the correct approach since we already know the second attempt is going to fail. There is no point in doing the extra work.
It's fine to not do it that way for now, but create an issue to do that as a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -508,6 +509,8 @@ const defaultCompressionConfig = {

const defaultChunkSizeInBytes = 204800;

const defaultCloseSummarizerDelayMs = 10000; // 10 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment there saying what this is used for.

@@ -1283,6 +1288,14 @@ export class ContainerRuntime
this.remoteMessageProcessor.clearPartialMessagesFor(clientId);
});

this.abnormalAckRecoveryMethod = this.mc.config.getString(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abnormalAckRecoveryMethod does not sound right. Maybe summaryStateUpdateMethod? And then add comments to clearly state what it means because the name does not make it clear.
Also, update the feature flag name accordingly.

"Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod",
);
const closeSummarizerDelayOverride = this.mc.config.getNumber(
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs -> Fluid.ContainerRuntime.Test.CloseSummarizerDelayOverrideMs?

const closeSummarizerDelayOverride = this.mc.config.getNumber(
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs",
);
this.closeSummarizerDelayMs = closeSummarizerDelayOverride ?? defaultCloseSummarizerDelayMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Combine this with the above statement. The extra variable is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this to be logged in telemetry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant remove the const closeSummarizerDelayOverride

@@ -33,6 +33,7 @@ describeNoCompat("Summarizer closes instead of refreshing", (getTestObjectProvid
beforeEach(async () => {
provider = getTestObjectProvider({ syncSummarizer: true });
settings["Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod"] = "restart";
settings["Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs"] = 0; // so that we don't have to wait for the timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add tests with the timeout to validate it works as expected. Maybe a unit test so timers can be mocked.

@@ -669,61 +669,6 @@ describe("Runtime", () => {
assert.strictEqual(stopCall, 1);
});

it("Should not retry on failure when stopping instead of restarting", async () => {
Copy link
Contributor

@agarwal-navin agarwal-navin Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this test to container runtime?

Copy link
Contributor Author

@tyler-cai-microsoft tyler-cai-microsoft Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The running summarizer is responsible for doing multiple retries. I don't think we can move this test to the container runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, not this exact test. But testing of similar functionality - container runtime should close instead of fetching snapshot. Unless it already exists.

"Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod",
);
if (recoveryMethod === "restart") {
if (this.abnormalAckRecoveryMethod === "restart") {
Copy link
Contributor

@vladsud vladsud Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to go ON by default, this FG should be reversed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be a separate PR. I think there will be some failing e2e tests that I will need to fix/adjust.

Copy link
Contributor

@vladsud vladsud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some comments that are worth addressing before merging

@tyler-cai-microsoft tyler-cai-microsoft merged commit 153806f into microsoft:main Apr 25, 2023
@@ -1423,6 +1441,8 @@ export class ContainerRuntime
disableChunking,
disableAttachReorder: this.disableAttachReorder,
disablePartialFlush,
abnormalAckRecoveryMethod: this.summaryStateUpdateMethod,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyler-cai-microsoft I believe you missed updating this property name in the log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tyler-cai-microsoft added a commit that referenced this pull request Apr 25, 2023
PR:
#15140 (comment)

Forgot to rename this telemetry property. Fixing that here
tyler-cai-microsoft added a commit that referenced this pull request Aug 1, 2023
[AB#5129](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5129)

In light of #15140, this
moves the close container logic to automatically close the container by
default instead of hiding it behind a feature flag.

The default behavior should be to close the container after 5 seconds

The new behavior is that instead of making a network refresh, we will
close the container by default. I ran into an interesting piece of
behavior, we will still need a lot of the refresh code, but it's only
for updating state on the current container.

---------

Co-authored-by: Alex Villarreal <[email protected]>
pradeepvairamani pushed a commit that referenced this pull request Aug 9, 2023
[AB#5129](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5129)

In light of #15140, this
moves the close container logic to automatically close the container by
default instead of hiding it behind a feature flag.

The default behavior should be to close the container after 5 seconds

The new behavior is that instead of making a network refresh, we will
close the container by default. I ran into an interesting piece of
behavior, we will still need a lot of the refresh code, but it's only
for updating state on the current container.

---------

Co-authored-by: Alex Villarreal <[email protected]>
tyler-cai-microsoft added a commit that referenced this pull request Aug 30, 2023
[AB#5118](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5118)

Remove "refresh from server" logic. We are making this PR into `next`
instead of `main` as we want to wait a bit before we commit to removing
"refresh from server" logic. Instead of refreshing we are closing the
container and restarting. More details here:
#15140

This requires updating the `summarizerNode` logic and `containerRuntime`
logic. Work to move when we decide to close the container sits in
[AB#5152](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5152)

---------

Co-authored-by: Navin Agarwal <[email protected]>
Co-authored-by: Mark Fields <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants