-
Notifications
You must be signed in to change notification settings - Fork 536
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update summarization retry logic to be based on failure params and er…
…ror (#16885) ## Current retry logic Currently, summarization retries are done statically: - Two attempts are done with different params. - In the second attempt, refreshFromLatest option is true meaning that the latest snapshot is downloaded, summary state updated from it before attempting summarization . - In addition, if there is a server error with retryAfterSeconds param set, that particular attempt is tried again once. - So, in total summarization can be tried max 4 times - two attempts with a retry possible in each attempt. Note: The second attempt will always fail because of the recent changes where if refreshFromLatest option is set, container runtime closes the summarizer instead. ## New retry logic This PR adds new retry logic which is based on the failure params received from a summarization attempt. The retry logic is in RunningSummarizer. Here is how it works: - Summarization attempts will only be retried if the failure params has retryAfterSeconds set. - If summarization fails before it is submitted, summarization will be retried 4 times (5 total attempts). - The total attempts can be overridden via a feature flag. The idea is to look at telemetry and tweak it until we can determine a stable value. - If summarization fails after it is submitted, summarization will be retried 1 time (2 total attempts). This only happens today when summary is nacked by server with retryAfterSeconds set. The idea behind this approach is that some kind of failures are intermittent and can go away after retries. The failure site is the best place to know which failures can be retried and how many times before giving up. For example, when summarizer node validation fails because GC did not run on a given node, this failure is transient and a retry will most likely fix it and so, it sets retryAfterSeconds. Other failues such as registry not found for a package won't be fixed on retry so these properties are not set on the error. [AB#4708](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/4708) [AB#5199](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5199)
- Loading branch information
1 parent
69527bc
commit 50e09ab
Showing
9 changed files
with
840 additions
and
208 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.