Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update summarization retry logic to be based on failure params and error #16885

Merged
merged 2 commits into from
Aug 17, 2023

Conversation

agarwal-navin
Copy link
Contributor

@agarwal-navin agarwal-navin commented Aug 16, 2023

Reviewer Guidance

This change was merged before via #16694 but it introduced a regression and was backed out. This change fixed the bug and updated tests to ensure that the bug does not happen again.
The regressions was that after attempting summarization in RunningSummarize::trySummarize, it always called this.stopSummarizerCallback instead of only doing it for failures.

The CI pipeline passes with this - https://dev.azure.com/fluidframework/internal/_build/results?buildId=183045&view=results

Current retry logic

Currently, summarization retries are done statically:

  • Two attempts are done with different params.
  • In the second attempt, refreshFromLatest option is true meaning that the latest snapshot is downloaded, summary state updated from it before attempting summarization .
  • In addition, if there is a server error with retryAfterSeconds param set, that particular attempt is tried again once.
  • So, in total summarization can be tried max 4 times - two attempts with a retry possible in each attempt.

Note: The second attempt will always fail because of the recent changes where if refreshFromLatest option is set, container runtime closes the summarizer instead.

New retry logic

This PR adds new retry logic which is based on the failure params received from a summarization attempt. The retry logic is in RunningSummarizer. Here is how it works:

  • Summarization attempts will only be retried if the failure params has retryAfterSeconds set.
  • If summarization fails before it is submitted, summarization will be retried 4 times (5 total attempts).
    • The total attempts can be overridden via a feature flag. The idea is to look at telemetry and tweak it until we can determine a stable value.
  • If summarization fails after it is submitted, summarization will be retried 1 time (2 total attempts). This only happens today when summary is nacked by server with retryAfterSeconds set.

The idea behind this approach is that some kind of failures are intermittent and can go away after retries. The failure site is the best place to know which failures can be retried and how many times before giving up. For example, when summarizer node validation fails because GC did not run on a given node, this failure is transient and a retry will most likely fix it and so, it sets retryAfterSeconds. Other failues such as registry not found for a package won't be fixed on retry so these properties are not set on the error.

AB#4708
AB#5199

@agarwal-navin agarwal-navin requested review from msfluid-bot and a team as code owners August 16, 2023 20:21
@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc public api change Changes to a public API base: main PRs targeted against main branch labels Aug 16, 2023
}

this.stopSummarizerCallback("failToSummarize");
return this.mc.config.getBoolean("Fluid.Summarizer.TryDynamicRetries")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change in logic from #16694 is here - this.stopSummarizerCallback is now only called in case of failures.

@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Aug 16, 2023

@fluid-example/bundle-size-tests: +2.32 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 436.39 KB 437.55 KB +1.16 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 236.6 KB 237.76 KB +1.16 KB
loader.js 146.35 KB 146.35 KB No change
map.js 45.67 KB 45.67 KB No change
matrix.js 138.04 KB 138.04 KB No change
odspDriver.js 88.63 KB 88.63 KB No change
odspPrefetchSnapshot.js 41.48 KB 41.48 KB No change
sharedString.js 154.02 KB 154.02 KB No change
sharedTree2.js 229.53 KB 229.53 KB No change
Total Size 1.63 MB 1.64 MB +2.32 KB

Baseline commit: 0cf2b6d

Generated by 🚫 dangerJS against fd7ea92

@agarwal-navin agarwal-navin merged commit 50e09ab into microsoft:main Aug 17, 2023
jatgarg pushed a commit to jatgarg/FluidFramework-1 that referenced this pull request Aug 18, 2023
…ror (microsoft#16885)

## Current retry logic
Currently, summarization retries are done statically:
- Two attempts are done with different params.
- In the second attempt, refreshFromLatest option is true meaning that
the latest snapshot is downloaded, summary state updated from it before
attempting summarization .
- In addition, if there is a server error with retryAfterSeconds param
set, that particular attempt is tried again once.
- So, in total summarization can be tried max 4 times - two attempts
with a retry possible in each attempt.

Note: The second attempt will always fail because of the recent changes
where if refreshFromLatest option is set, container runtime closes the
summarizer instead.

## New retry logic
This PR adds new retry logic which is based on the failure params
received from a summarization attempt. The retry logic is in
RunningSummarizer. Here is how it works:

- Summarization attempts will only be retried if the failure params has
retryAfterSeconds set.
- If summarization fails before it is submitted, summarization will be
retried 4 times (5 total attempts).
- The total attempts can be overridden via a feature flag. The idea is
to look at telemetry and tweak it until we can determine a stable value.
- If summarization fails after it is submitted, summarization will be
retried 1 time (2 total attempts). This only happens today when summary
is nacked by server with retryAfterSeconds set.

The idea behind this approach is that some kind of failures are
intermittent and can go away after retries. The failure site is the best
place to know which failures can be retried and how many times before
giving up. For example, when summarizer node validation fails because GC
did not run on a given node, this failure is transient and a retry will
most likely fix it and so, it sets retryAfterSeconds. Other failues such
as registry not found for a package won't be fixed on retry so these
properties are not set on the error.


[AB#4708](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/4708)

[AB#5199](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5199)
@agarwal-navin agarwal-navin deleted the summaryRetries branch August 21, 2023 18:42
agarwal-navin added a commit that referenced this pull request May 14, 2024
…0851)

#16885 added a new retry
mechanism in case of summarization failures. This was rolled out with
A/B testing and the experiment was successful. This PR makes the new
retry mechanism default and removes the old one.

**Notes**
- Added retry for these two additional scenarios - When a summary op or
summary ack is not received within the timeout, retry. This is because
these failures are mostly transient and a retry may fix them.
- Removed `refreshLatestAck` summary option because its not used
anymore. A lot of the code changes are because of this.


[AB#7605](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/7605)
kekachmar pushed a commit to kekachmar/FluidFramework that referenced this pull request May 21, 2024
…crosoft#20851)

microsoft#16885 added a new retry
mechanism in case of summarization failures. This was rolled out with
A/B testing and the experiment was successful. This PR makes the new
retry mechanism default and removes the old one.

**Notes**
- Added retry for these two additional scenarios - When a summary op or
summary ack is not received within the timeout, retry. This is because
these failures are mostly transient and a retry may fix them.
- Removed `refreshLatestAck` summary option because its not used
anymore. A lot of the code changes are because of this.


[AB#7605](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/7605)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants