-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart summarizer after 10 seconds #15140
Restart summarizer after 10 seconds #15140
Conversation
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs", | ||
) ?? defaultRestartDelayMs; | ||
|
||
// Delay 10 seconds before restarting summarizer to prevent the summarizer from restarting too frequently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 seconds seem like a arbitrary and long time. Any reason for picking up 10 seconds? There should be some way of telling from telemetry whether this is high / low and adjust it accordingly. You should consider adding some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 10 seconds problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be. But we need data to know that. For example, a busy doc can have thousands of ops in 10 seconds which make the next summary that much more expensive. Or it may turn out that 10 seconds is less and clients keep reloading. It's fine as a starting point but I was interested as to how you came up with 10 seconds? Plus there should be telemetry that tells us whether that number is good enough or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added telemetry, we will not know if 10 seconds is enough or not until we see the telemetry. It's an arbitrary number we came up with offline. If the summarizer cache is 60 seconds, we would at most restart 5 times which is not that bad of a worse case scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safety against tight loops that could happen if we have some state (that we can't envision) that results in infinite reloads (or possibly - tight reloads for 60 seconds, or something like that). It's arbitrary, and we should reevaluate it once we have more data from PROD. In ideal case we should not hit it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not familiar with the summarizer lifecycle; but wondering if waiting 10 seconds here before stopping it VS after, makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so since this would be part of the entire submit summary call, but potentially since we are stopping the summarizer and a different promise may resolve. It's much safer to wait before we call stop
@@ -3225,6 +3228,14 @@ export class ContainerRuntime | |||
}, | |||
error, | |||
); | |||
|
|||
const testDelayOverrideMs = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the error thrown contains a retryAfter
field, that should be honored. If it's more than 10 seconds, this delay won't help very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this delay will fire after the retryAfter when we try to summarize a second time. In the trySummarize
method we do wait for the delay seconds. If we receive an unknown ack, we will wait 10 seconds to restart as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, I totally missed that this is in container runtime. This code is especially important in SummaryGenerator
where summary fails. It's not need here because summaries didn't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, with restart there is no second attempt. There is just the one attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this scenario, we don't know if summarizer failed. It's probably best to make sure we do not burn summarizer cycles in the worst case scenario. Waiting 10 seconds should not be an issue unless a large number of ops occurs in a very short timespan, which we should have a mitigation for when virtual sequencing rolls out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it so the retry timer will fire regardless
I am not sure I understand this statement from the purpose of this PR. |
⯆ @fluid-example/bundle-size-tests: -94 Bytes
Baseline commit: d69ccba |
Updated the description |
const closeSummarizerDelayOverrideMs = | ||
this.mc.config.getNumber( | ||
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs", | ||
) ?? defaultCloseSummarizerDelayMs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to open a bug like 2 sprints out to review telemetry and start cleaning code. I.e. if data suggests we do not need flights here, or delays - we should clean them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.runtime.closeFn(); | ||
return; | ||
this.stopSummarizerCallback("latestSummaryStateStale"); | ||
this.runtime.closeFn(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.runtime.closeFn(); looks fishy to me. I'm not sure why this block should have slightly different behavior than code couple lines below that also does this.stopSummarizerCallback().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the test, the runtime doesn't close unless I call this here. When I trace the code, there doesn't seem to be any call to close the runtime. I have this code here. It's different, but I am not sure the behavior of the summarizer does what I expect it to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this code and left it centralized in the container runtime
|
||
this.mc.logger.sendTelemetryEvent( | ||
{ | ||
eventName: "ClosingSummarizerOnSummaryStale", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think terms "ClosingSummarizerOnSummaryStale" & "latestSummaryStateStale" do not reflect reality. Nothing is really stale. Summary failed, and without cracking error messages we do not know why, so I feel like code should not come up with guesses. It could be 400/404/429 from service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, we have already sent a telemetry event Summarize_cancel
which contains all that information. We do not need to send multiple error messages that we do not read that propagate that information. At this point, we just want to know if the feature is engaged, and thus the telemetry event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this unnecessary telemetry, it will fire from the runtime, and the running summarizer will still encapsulate the errors in "FailToSummarize". We will see 2 summarize cancels, one where the original went badly, and another where we chose to restart after some delay.
this.mc.config.getNumber( | ||
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs", | ||
) ?? defaultCloseSummarizerDelayMs; | ||
await delay(closeSummarizerDelayOverrideMs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to set delaySeconds
above let code above do the wait?
It feels like there is a lot of special casing / duplication of existing code that we can avoid by plugging into existing workflow that already does all we need - delays and closing container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I'm advised to make modifications, the more I want to gut this code here and just keep it simple. I don't really want to maintain the same code in two different places. I'd rather go back to the implementation, where all the logic was in one class.
// There's no need to delay here as we already have summarizeAttemptDelay given to us by the server. | ||
if (delaySeconds === 0) { | ||
const closeSummarizerDelayOverrideMs = | ||
this.mc.config.getNumber( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add all the FGs we are using / adding to "ContainerLoadStats" event in containerRuntime.ts? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -508,6 +509,8 @@ const defaultCompressionConfig = { | |||
|
|||
const defaultChunkSizeInBytes = 204800; | |||
|
|||
export const defaultCloseSummarizerDelayMs = 10000; // 10 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this belongs in the runnningSummarizer
and should be imported here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly sure the difference between one and the other, we already import the containerRuntime file in runningSummarizer, so I was concerned about any build breaks that might potentially happen because of that. I don't know if I should be doing tech debt in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What tech debt? Does the build fail when you do what I suggested? If so, by all means, don't fix it.
I was suggesting that defaultCloseSummarizerDelayMs
belongs in runningSummarizer or one of the other files under the summarizer
folder. It is not a container runtime specific property so it does not belong here.
Not asking you to fix existing issues, it's merely a suggestion based on how these classes are designed.
? numberOfAttemptsOnRestartAsRecovery | ||
: this.mc.config.getNumber("Fluid.Summarizer.Attempts") ?? | ||
defaultNumberSummarizationAttempts; | ||
let totalAttempts = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not doing a second attempt here is the correct approach since we already know the second attempt is going to fail. There is no point in doing the extra work.
It's fine to not do it that way for now, but create an issue to do that as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -508,6 +509,8 @@ const defaultCompressionConfig = { | |||
|
|||
const defaultChunkSizeInBytes = 204800; | |||
|
|||
const defaultCloseSummarizerDelayMs = 10000; // 10 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment there saying what this is used for.
@@ -1283,6 +1288,14 @@ export class ContainerRuntime | |||
this.remoteMessageProcessor.clearPartialMessagesFor(clientId); | |||
}); | |||
|
|||
this.abnormalAckRecoveryMethod = this.mc.config.getString( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abnormalAckRecoveryMethod
does not sound right. Maybe summaryStateUpdateMethod
? And then add comments to clearly state what it means because the name does not make it clear.
Also, update the feature flag name accordingly.
"Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod", | ||
); | ||
const closeSummarizerDelayOverride = this.mc.config.getNumber( | ||
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs
-> Fluid.ContainerRuntime.Test.CloseSummarizerDelayOverrideMs
?
const closeSummarizerDelayOverride = this.mc.config.getNumber( | ||
"Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs", | ||
); | ||
this.closeSummarizerDelayMs = closeSummarizerDelayOverride ?? defaultCloseSummarizerDelayMs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Combine this with the above statement. The extra variable is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need this to be logged in telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant remove the const closeSummarizerDelayOverride
@@ -33,6 +33,7 @@ describeNoCompat("Summarizer closes instead of refreshing", (getTestObjectProvid | |||
beforeEach(async () => { | |||
provider = getTestObjectProvider({ syncSummarizer: true }); | |||
settings["Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod"] = "restart"; | |||
settings["Fluid.ContainerRuntime.Test.CloseSummarizerOnSummaryStaleDelayOverrideMs"] = 0; // so that we don't have to wait for the timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add tests with the timeout to validate it works as expected. Maybe a unit test so timers can be mocked.
@@ -669,61 +669,6 @@ describe("Runtime", () => { | |||
assert.strictEqual(stopCall, 1); | |||
}); | |||
|
|||
it("Should not retry on failure when stopping instead of restarting", async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this test to container runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The running summarizer is responsible for doing multiple retries. I don't think we can move this test to the container runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, not this exact test. But testing of similar functionality - container runtime should close instead of fetching snapshot. Unless it already exists.
"Fluid.ContainerRuntime.Test.SummarizationRecoveryMethod", | ||
); | ||
if (recoveryMethod === "restart") { | ||
if (this.abnormalAckRecoveryMethod === "restart") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to go ON by default, this FG should be reversed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be a separate PR. I think there will be some failing e2e tests that I will need to fix/adjust.
packages/test/test-end-to-end-tests/src/test/summarizeRestart.spec.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some comments that are worth addressing before merging
@@ -1423,6 +1441,8 @@ export class ContainerRuntime | |||
disableChunking, | |||
disableAttachReorder: this.disableAttachReorder, | |||
disablePartialFlush, | |||
abnormalAckRecoveryMethod: this.summaryStateUpdateMethod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tyler-cai-microsoft I believe you missed updating this property name in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR: #15140 (comment) Forgot to rename this telemetry property. Fixing that here
[AB#5129](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5129) In light of #15140, this moves the close container logic to automatically close the container by default instead of hiding it behind a feature flag. The default behavior should be to close the container after 5 seconds The new behavior is that instead of making a network refresh, we will close the container by default. I ran into an interesting piece of behavior, we will still need a lot of the refresh code, but it's only for updating state on the current container. --------- Co-authored-by: Alex Villarreal <[email protected]>
[AB#5129](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5129) In light of #15140, this moves the close container logic to automatically close the container by default instead of hiding it behind a feature flag. The default behavior should be to close the container after 5 seconds The new behavior is that instead of making a network refresh, we will close the container by default. I ran into an interesting piece of behavior, we will still need a lot of the refresh code, but it's only for updating state on the current container. --------- Co-authored-by: Alex Villarreal <[email protected]>
[AB#5118](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5118) Remove "refresh from server" logic. We are making this PR into `next` instead of `main` as we want to wait a bit before we commit to removing "refresh from server" logic. Instead of refreshing we are closing the container and restarting. More details here: #15140 This requires updating the `summarizerNode` logic and `containerRuntime` logic. Work to move when we decide to close the container sits in [AB#5152](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5152) --------- Co-authored-by: Navin Agarwal <[email protected]> Co-authored-by: Mark Fields <[email protected]>
AB#3898
In order to prevent the summarizer from restarting in rapid succession in which we make lots of network calls that would be throttled, we will delay 10 seconds before we close the summarizer.
If the summarizer restarts too frequently, it would slow down the client which would be problematic and a degradation of user experience.