-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update summarization retry logic to be based on failure params and error #16885
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
area: runtime
Runtime related issues
area: tests
Tests to add, test infrastructure improvements, etc
public api change
Changes to a public API
base: main
PRs targeted against main branch
labels
Aug 16, 2023
agarwal-navin
commented
Aug 16, 2023
agarwal-navin
commented
Aug 16, 2023
} | ||
|
||
this.stopSummarizerCallback("failToSummarize"); | ||
return this.mc.config.getBoolean("Fluid.Summarizer.TryDynamicRetries") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change in logic from #16694 is here - this.stopSummarizerCallback
is now only called in case of failures.
⯅ @fluid-example/bundle-size-tests: +2.32 KB
Baseline commit: 0cf2b6d |
agarwal-navin
force-pushed
the
summaryRetries
branch
from
August 16, 2023 22:12
58f7063
to
fd7ea92
Compare
NicholasCouri
approved these changes
Aug 17, 2023
jatgarg
pushed a commit
to jatgarg/FluidFramework-1
that referenced
this pull request
Aug 18, 2023
…ror (microsoft#16885) ## Current retry logic Currently, summarization retries are done statically: - Two attempts are done with different params. - In the second attempt, refreshFromLatest option is true meaning that the latest snapshot is downloaded, summary state updated from it before attempting summarization . - In addition, if there is a server error with retryAfterSeconds param set, that particular attempt is tried again once. - So, in total summarization can be tried max 4 times - two attempts with a retry possible in each attempt. Note: The second attempt will always fail because of the recent changes where if refreshFromLatest option is set, container runtime closes the summarizer instead. ## New retry logic This PR adds new retry logic which is based on the failure params received from a summarization attempt. The retry logic is in RunningSummarizer. Here is how it works: - Summarization attempts will only be retried if the failure params has retryAfterSeconds set. - If summarization fails before it is submitted, summarization will be retried 4 times (5 total attempts). - The total attempts can be overridden via a feature flag. The idea is to look at telemetry and tweak it until we can determine a stable value. - If summarization fails after it is submitted, summarization will be retried 1 time (2 total attempts). This only happens today when summary is nacked by server with retryAfterSeconds set. The idea behind this approach is that some kind of failures are intermittent and can go away after retries. The failure site is the best place to know which failures can be retried and how many times before giving up. For example, when summarizer node validation fails because GC did not run on a given node, this failure is transient and a retry will most likely fix it and so, it sets retryAfterSeconds. Other failues such as registry not found for a package won't be fixed on retry so these properties are not set on the error. [AB#4708](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/4708) [AB#5199](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5199)
agarwal-navin
added a commit
that referenced
this pull request
May 14, 2024
…0851) #16885 added a new retry mechanism in case of summarization failures. This was rolled out with A/B testing and the experiment was successful. This PR makes the new retry mechanism default and removes the old one. **Notes** - Added retry for these two additional scenarios - When a summary op or summary ack is not received within the timeout, retry. This is because these failures are mostly transient and a retry may fix them. - Removed `refreshLatestAck` summary option because its not used anymore. A lot of the code changes are because of this. [AB#7605](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/7605)
kekachmar
pushed a commit
to kekachmar/FluidFramework
that referenced
this pull request
May 21, 2024
…crosoft#20851) microsoft#16885 added a new retry mechanism in case of summarization failures. This was rolled out with A/B testing and the experiment was successful. This PR makes the new retry mechanism default and removes the old one. **Notes** - Added retry for these two additional scenarios - When a summary op or summary ack is not received within the timeout, retry. This is because these failures are mostly transient and a retry may fix them. - Removed `refreshLatestAck` summary option because its not used anymore. A lot of the code changes are because of this. [AB#7605](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/7605)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area: runtime
Runtime related issues
area: tests
Tests to add, test infrastructure improvements, etc
base: main
PRs targeted against main branch
public api change
Changes to a public API
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reviewer Guidance
This change was merged before via #16694 but it introduced a regression and was backed out. This change fixed the bug and updated tests to ensure that the bug does not happen again.
The regressions was that after attempting summarization in
RunningSummarize::trySummarize
, it always calledthis.stopSummarizerCallback
instead of only doing it for failures.The CI pipeline passes with this - https://dev.azure.com/fluidframework/internal/_build/results?buildId=183045&view=results
Current retry logic
Currently, summarization retries are done statically:
Note: The second attempt will always fail because of the recent changes where if refreshFromLatest option is set, container runtime closes the summarizer instead.
New retry logic
This PR adds new retry logic which is based on the failure params received from a summarization attempt. The retry logic is in RunningSummarizer. Here is how it works:
The idea behind this approach is that some kind of failures are intermittent and can go away after retries. The failure site is the best place to know which failures can be retried and how many times before giving up. For example, when summarizer node validation fails because GC did not run on a given node, this failure is transient and a retry will most likely fix it and so, it sets retryAfterSeconds. Other failues such as registry not found for a package won't be fixed on retry so these properties are not set on the error.
AB#4708
AB#5199