Skip to content

Commit

Permalink
Update summarization retry logic to be based on failure params and er…
Browse files Browse the repository at this point in the history
…ror (#16885)

## Current retry logic
Currently, summarization retries are done statically:
- Two attempts are done with different params.
- In the second attempt, refreshFromLatest option is true meaning that
the latest snapshot is downloaded, summary state updated from it before
attempting summarization .
- In addition, if there is a server error with retryAfterSeconds param
set, that particular attempt is tried again once.
- So, in total summarization can be tried max 4 times - two attempts
with a retry possible in each attempt.

Note: The second attempt will always fail because of the recent changes
where if refreshFromLatest option is set, container runtime closes the
summarizer instead.

## New retry logic
This PR adds new retry logic which is based on the failure params
received from a summarization attempt. The retry logic is in
RunningSummarizer. Here is how it works:

- Summarization attempts will only be retried if the failure params has
retryAfterSeconds set.
- If summarization fails before it is submitted, summarization will be
retried 4 times (5 total attempts).
- The total attempts can be overridden via a feature flag. The idea is
to look at telemetry and tweak it until we can determine a stable value.
- If summarization fails after it is submitted, summarization will be
retried 1 time (2 total attempts). This only happens today when summary
is nacked by server with retryAfterSeconds set.

The idea behind this approach is that some kind of failures are
intermittent and can go away after retries. The failure site is the best
place to know which failures can be retried and how many times before
giving up. For example, when summarizer node validation fails because GC
did not run on a given node, this failure is transient and a retry will
most likely fix it and so, it sets retryAfterSeconds. Other failues such
as registry not found for a package won't be fixed on retry so these
properties are not set on the error.


[AB#4708](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/4708)

[AB#5199](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/5199)
  • Loading branch information
agarwal-navin authored Aug 17, 2023
1 parent 69527bc commit 50e09ab
Show file tree
Hide file tree
Showing 9 changed files with 840 additions and 208 deletions.
11 changes: 8 additions & 3 deletions api-report/container-runtime.api.md
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ export interface IGenerateSummaryTreeResult extends Omit<IBaseSummarizeResult, "
}

// @public (undocumented)
export interface INackSummaryResult {
export interface INackSummaryResult extends IRetriableFailureResult {
// (undocumented)
readonly ackNackDuration: number;
// (undocumented)
Expand All @@ -452,6 +452,12 @@ export interface IRefreshSummaryAckOptions {
readonly summaryRefSeq: number;
}

// @public
export interface IRetriableFailureResult {
// (undocumented)
readonly retryAfterSeconds?: number;
}

// @public @deprecated (undocumented)
export function isRuntimeMessage(message: ISequencedDocumentMessage): boolean;

Expand Down Expand Up @@ -664,7 +670,7 @@ export enum RuntimeMessage {
}

// @public
export interface SubmitSummaryFailureData {
export interface SubmitSummaryFailureData extends IRetriableFailureResult {
// (undocumented)
stage: SummaryStage;
}
Expand Down Expand Up @@ -706,7 +712,6 @@ export type SummarizeResultPart<TSuccess, TFailure = undefined> = {
data: TFailure | undefined;
message: string;
error: any;
retryAfterSeconds?: number;
};

// @public (undocumented)
Expand Down
1 change: 1 addition & 0 deletions packages/runtime/container-runtime/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ export {
ICancellableSummarizerController,
SubmitSummaryFailureData,
SummaryStage,
IRetriableFailureResult,
} from "./summary";
export { IChunkedOp, unpackRuntimeMessage } from "./opLifecycle";
export { generateStableId, isStableId, assertIsStableId } from "./id-compressor";
1 change: 1 addition & 0 deletions packages/runtime/container-runtime/src/summary/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ export {
SummarizeResultPart,
SubmitSummaryFailureData,
SummaryStage,
IRetriableFailureResult,
} from "./summarizerTypes";
export {
IAckedSummary,
Expand Down
251 changes: 189 additions & 62 deletions packages/runtime/container-runtime/src/summary/runningSummarizer.ts
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,17 @@ import {

const maxSummarizeAckWaitTime = 10 * 60 * 1000; // 10 minutes

/**
* The maximum number of summarization attempts that will be done by default in case of failures
* that can be retried.
*/
export const defaultMaxAttempts = 2;
/**
* The default value for maximum number of summarization attempts that will be done for summarization failures where
* submit fails and the failure can be retried.
*/
export const defaultMaxAttemptsForSubmitFailures = 5;

/**
* An instance of RunningSummarizer manages the heuristics for summarizing.
* Until disposed, the instance of RunningSummarizer can assume that it is
Expand Down Expand Up @@ -147,6 +158,9 @@ export class RunningSummarizer implements IDisposable {

private readonly runtimeListener;

/** The maximum number of summary attempts to do when submit summary fails. */
private readonly maxAttemptsForSubmitFailures: number;

private constructor(
baseLogger: ITelemetryBaseLogger,
private readonly summaryWatcher: IClientSummaryWatcher,
Expand Down Expand Up @@ -240,6 +254,17 @@ export class RunningSummarizer implements IDisposable {
this.handleOp(op, runtimeMessage === true);
};
this.runtime.on("op", this.runtimeListener);

// The max attempts for submit failures can be overridden via a feature flag. This allows us to
// tweak this as per telemetry data until we arrive at a stable number.
// If its set to a number higher than `defaultMaxAttemptsForSubmitFailures`, it will be ignored.
const overrideMaxAttempts = this.mc.config.getNumber(
"Fluid.Summarizer.AttemptsForSubmitFailures",
);
this.maxAttemptsForSubmitFailures =
overrideMaxAttempts && overrideMaxAttempts < defaultMaxAttemptsForSubmitFailures
? overrideMaxAttempts
: defaultMaxAttemptsForSubmitFailures;
}

private async handleSummaryAck(): Promise<number> {
Expand Down Expand Up @@ -571,68 +596,9 @@ export class RunningSummarizer implements IDisposable {
this.beforeSummaryAction();
},
async () => {
const attempts: ISummarizeOptions[] = [
{ refreshLatestAck: false, fullTree: false },
{ refreshLatestAck: true, fullTree: false },
];
let overrideDelaySeconds: number | undefined;
let summaryAttempts = 0;
let summaryAttemptsPerPhase = 0;
let summaryAttemptPhase = 0;
while (summaryAttemptPhase < attempts.length) {
if (this.cancellationToken.cancelled) {
return;
}

// We only want to attempt 1 summary when reason is "lastSummary"
if (++summaryAttempts > 1 && reason === "lastSummary") {
return;
}

summaryAttemptsPerPhase++;

const summarizeOptions = attempts[summaryAttemptPhase];
const summarizeProps: ISummarizeTelemetryProperties = {
reason,
summaryAttempts,
summaryAttemptsPerPhase,
summaryAttemptPhase: summaryAttemptPhase + 1, // make everything 1-based
...summarizeOptions,
};

// Note: no need to account for cancellationToken.waitCancelled here, as
// this is accounted SummaryGenerator.summarizeCore that controls receivedSummaryAckOrNack.
const resultSummarize = this.generator.summarize(
summarizeProps,
summarizeOptions,
cancellationToken,
);
const result = await resultSummarize.receivedSummaryAckOrNack;

if (result.success) {
return;
}

// Check for retryDelay that can come from summaryNack or upload summary flow.
// Retry the same step only once per retryAfter response.
const delaySeconds = result.retryAfterSeconds;
if (delaySeconds === undefined || summaryAttemptsPerPhase > 1) {
summaryAttemptPhase++;
summaryAttemptsPerPhase = 0;
}

if (delaySeconds !== undefined) {
this.mc.logger.sendPerformanceEvent({
eventName: "SummarizeAttemptDelay",
duration: delaySeconds,
summaryNackDelay: overrideDelaySeconds !== undefined,
...summarizeProps,
});
await delay(delaySeconds * 1000);
}
}

this.stopSummarizerCallback("failToSummarize");
return this.mc.config.getBoolean("Fluid.Summarizer.TryDynamicRetries")
? this.trySummarizeWithRetries(reason, cancellationToken)
: this.trySummarizeWithStaticAttempts(reason, cancellationToken);
},
() => {
this.afterSummaryAction();
Expand All @@ -642,6 +608,167 @@ export class RunningSummarizer implements IDisposable {
});
}

/**
* Tries to summarize 2 times with pre-defined summary options. If an attempt fails with "retryAfterSeconds"
* param, that attempt is tried once more.
*/
private async trySummarizeWithStaticAttempts(
reason: SummarizeReason,
cancellationToken: ISummaryCancellationToken,
) {
const attempts: ISummarizeOptions[] = [
{ refreshLatestAck: false, fullTree: false },
{ refreshLatestAck: true, fullTree: false },
];
let summaryAttempts = 0;
let summaryAttemptsPerPhase = 0;
let summaryAttemptPhase = 0;
while (summaryAttemptPhase < attempts.length) {
if (cancellationToken.cancelled) {
return;
}

// We only want to attempt 1 summary when reason is "lastSummary"
if (++summaryAttempts > 1 && reason === "lastSummary") {
return;
}

summaryAttemptsPerPhase++;

const summarizeOptions = attempts[summaryAttemptPhase];
const summarizeProps: ISummarizeTelemetryProperties = {
reason,
summaryAttempts,
summaryAttemptsPerPhase,
summaryAttemptPhase: summaryAttemptPhase + 1, // make everything 1-based
...summarizeOptions,
};

// Note: no need to account for cancellationToken.waitCancelled here, as
// this is accounted SummaryGenerator.summarizeCore that controls receivedSummaryAckOrNack.
const resultSummarize = this.generator.summarize(
summarizeProps,
summarizeOptions,
cancellationToken,
);
const ackNackResult = await resultSummarize.receivedSummaryAckOrNack;
if (ackNackResult.success) {
return;
}

// Check for retryDelay that can come from summaryNack, upload summary or submit summary flows.
// Retry the same step only once per retryAfter response.
const submitResult = await resultSummarize.summarySubmitted;
const delaySeconds = !submitResult.success
? submitResult.data?.retryAfterSeconds
: ackNackResult.data?.retryAfterSeconds;
if (delaySeconds === undefined || summaryAttemptsPerPhase > 1) {
summaryAttemptPhase++;
summaryAttemptsPerPhase = 0;
}

if (delaySeconds !== undefined) {
this.mc.logger.sendPerformanceEvent({
eventName: "SummarizeAttemptDelay",
duration: delaySeconds,
summaryNackDelay: ackNackResult.data?.retryAfterSeconds !== undefined,
...summarizeProps,
});
await delay(delaySeconds * 1000);
}
}
this.stopSummarizerCallback("failToSummarize");
}

/**
* Tries to summarize with retries where retry is based on the failure params.
* For example, summarization may be retried for failures with "retryAfterSeconds" param.
*/
private async trySummarizeWithRetries(
reason: SummarizeReason,
cancellationToken: ISummaryCancellationToken,
) {
// The max number of attempts are based on the stage at which summarization failed. If it fails before it is
// submitted, a different value is used compared to if it fails after submission. Usually, in the former case,
// we would retry more often as its cheaper and retries are likely to succeed.
// This makes it harder to predict how many attempts would actually happen as that depends on how far an attempt
// made. To keep things simple, the max attempts is reset after every attempt based on where it failed. This may
// result in some failures not being retried depending on what happened before this attempt. That's fine because
// such scenarios are very unlikely and even if it happens, it would resolve when a new summarizer starts over.
let maxAttempts = defaultMaxAttempts;
let currentAttempt = 0;
let success = false;
let done = false;
do {
if (this.cancellationToken.cancelled) {
success = true;
done = true;
break;
}

currentAttempt++;
const summarizeOptions: ISummarizeOptions = {
refreshLatestAck: false,
fullTree: false,
};
const summarizeProps: ISummarizeTelemetryProperties = {
reason,
summaryAttempts: currentAttempt,
...summarizeOptions,
};
const summarizeResult = this.generator.summarize(
summarizeProps,
summarizeOptions,
cancellationToken,
);

// Ack / nack is the final step, so if it succeeds we're done.
const ackNackResult = await summarizeResult.receivedSummaryAckOrNack;
if (ackNackResult.success) {
success = true;
done = true;
break;
}

const submitSummaryResult = await summarizeResult.summarySubmitted;
let retryAfterSeconds: number | undefined;

// Update max attempts and retry params from the failure result.
// If submit summary failed, use the params from "summarySubmitted" result. Else, use the params
// from "receivedSummaryAckOrNack" result.
// Note: Check "summarySubmitted" result first because if it fails, ack nack would fail as well.
if (!submitSummaryResult.success) {
maxAttempts = this.maxAttemptsForSubmitFailures;
retryAfterSeconds = submitSummaryResult.data?.retryAfterSeconds;
} else {
maxAttempts = defaultMaxAttempts;
retryAfterSeconds = ackNackResult.data?.retryAfterSeconds;
}

// If the failure doesn't have "retryAfterSeconds" or the max number of attempts have been done, we're done.
if (retryAfterSeconds === undefined || currentAttempt >= maxAttempts) {
success = false;
done = true;
break;
}

this.mc.logger.sendPerformanceEvent({
eventName: "SummarizeAttemptDelay",
duration: retryAfterSeconds,
summaryNackDelay: ackNackResult.data?.retryAfterSeconds !== undefined,
stage: submitSummaryResult.data?.stage,
dynamicRetries: true, // To differentiate this telemetry from regular retry logic
...summarizeProps,
});
await delay(retryAfterSeconds * 1000);
} while (!done);

// If summarization isn't successful, stop the summarizer.
if (!success) {
this.stopSummarizerCallback("failToSummarize");
}
}

/** {@inheritdoc (ISummarizer:interface).summarizeOnDemand} */
public summarizeOnDemand(
resultsBuilder: SummarizeResultBuilder = new SummarizeResultBuilder(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -216,8 +216,14 @@ export type SubmitSummaryResult =

/** The stages of Summarize, used to describe how far progress succeeded in case of a failure at a later stage. */
export type SummaryStage = SubmitSummaryResult["stage"] | "unknown";

/** Type for summarization failures that are retriable. */
export interface IRetriableFailureResult {
readonly retryAfterSeconds?: number;
}

/** The data in summarizer result when submit summary stage fails. */
export interface SubmitSummaryFailureData {
export interface SubmitSummaryFailureData extends IRetriableFailureResult {
stage: SummaryStage;
}

Expand All @@ -231,7 +237,7 @@ export interface IAckSummaryResult {
readonly ackNackDuration: number;
}

export interface INackSummaryResult {
export interface INackSummaryResult extends IRetriableFailureResult {
readonly summaryNackOp: ISummaryNackMessage;
readonly ackNackDuration: number;
}
Expand All @@ -246,7 +252,6 @@ export type SummarizeResultPart<TSuccess, TFailure = undefined> =
data: TFailure | undefined;
message: string;
error: any;
retryAfterSeconds?: number;
};

export interface ISummarizeResults {
Expand Down
Loading

0 comments on commit 50e09ab

Please sign in to comment.