Consider changing summarization election to avoid 2 summarizer stepping on each other (last Summary) #7279

vladsud · 2021-08-27T20:38:38Z

In #7247, I've tried to reduce number of cases where we have two summarizer clients running at the same time by ensuring that newly starting client always waits a bit (under normal circumstances) before actually starting to summarize, giving some amount of time for the outgoing client to produce last summary.

This does work, but it's probabilistic, and I do see in stress test that we hit summary nacks due to two summarizer clients running. This creates extra noise that humans need to look into, but also consumes resources from client and server for nothing.

Much better flow should be something like this:

Today, summarizer election only considers interactive (not summarizer container) clients. I propose we change that and always given preference to running summarizer client and elect it as summarizer. That means that first some parent (interactive) client is elected, but then election moves to its child (summarizing container).
- Please see SummaryManager.delayBeforeCreatingSummarizer() logic for details and comments in there.
Parent recognizes that and does not cancel summarizer client, it keeps going as today.
When parent disconnects or closes, it would (as today) notify summarizing client that it needs to move to exit. But no new summarizer starts and that allows last Summary to work without any concerns.
(no change required) once last summary is over, summarizing client will move to exit and new summarizer would spawn.

If we go with this route, we can change commented out checks in RunWhileConnectedCoordinator.cancelled to be more assertive about this.onBehalfOfClientId checks.

vladsud · 2021-08-27T20:38:57Z

@agarwal-navin, @anthony-murphy - FYI

Closes issue #6896: Summarizer keep going when summarizer container is closed Key changes: 1. Introduce ICancellationToken interface to represent cancellation token. It's similar to AbortSignal, but easier to work with as promises compose easier. 2. Consolidate all cancellation logic in coordinator object (RunWhileConnectedCoordinator) that implements ICancellationToken. Before this change various layers checked for being connected or other conditions, and in many cases - we did not check it at all as proper object was not available at given layer. - While it currently does not change a fact that summarizing client needs to be connected to ordering service to proceed with summaries, this might change in the future (with single-commit summaries), so having one source of truth for decision making on continue or not makes design more flexible for future changes. 3. Propagate ICancellationToken through all summarization layers and start checking cancellation (including using Promise.race wherever we wait for promises) to ensure client moves to exit immediately on cancellation. This creates proper infrastructure to cancel summary (using right trigger) and avoid in the future condition of two clients submitting summaries and stepping on each other toes. This change is especially important in closing gaps like - RunningSummarizer.trySummarize() doing 3 summary attempts with delays and not checking for cancellation - Waiting for summary op, summary ack/nack. Later results in summary timeouts from a client that was closed 10 minutes ago, creating a lot of noise in scalability tests (and I assume in production as well). - Better detection of disposed / disconnected summarizer client before it even started the work. 4. Stop checking for onBehalfOfClient as it does not really work. - Current design is that onBehalfOfClient is updated whenever parent container clientId changes. This makes this check not very useful, as it creates a condition for lastSummary to be cancelled right away, and it indeed happens very often as we observe with documents that have 10K of unsummarized ops and have really hard time to do such summary (in presence of main client continuously getting nacked and disconnected). - Ideal solution is tracked in Issue #7279

vladsud added the bug Something isn't working label Aug 27, 2021

vladsud added this to the October 2021 milestone Aug 27, 2021

vladsud assigned pleath Aug 27, 2021

vladsud added the area: runtime: summarizer label Sep 28, 2021

vladsud modified the milestones: October 2021, November 2021 Sep 28, 2021

vladsud mentioned this issue Oct 7, 2021

Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

Closed

pleath modified the milestones: November 2021, December 2021 Oct 26, 2021

curtisman mentioned this issue Dec 1, 2021

Summarizer Improvements #8030

Closed

11 tasks

pleath modified the milestones: December 2021, January 2022 Dec 1, 2021

vladsud modified the milestones: January 2022, February 2022 Dec 13, 2021

pleath linked a pull request Feb 1, 2022 that will close this issue

Let OrderedClientCollection track the summarizer client separately #8951

Merged

anthony-murphy modified the milestones: February 2022, March 2022 Mar 1, 2022

pleath modified the milestones: March 2022, April 2022 Apr 4, 2022

pleath closed this as completed in #8951 Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider changing summarization election to avoid 2 summarizer stepping on each other (last Summary) #7279

Consider changing summarization election to avoid 2 summarizer stepping on each other (last Summary) #7279

vladsud commented Aug 27, 2021

vladsud commented Aug 27, 2021

Consider changing summarization election to avoid 2 summarizer stepping on each other (last Summary) #7279

Consider changing summarization election to avoid 2 summarizer stepping on each other (last Summary) #7279

Comments

vladsud commented Aug 27, 2021

vladsud commented Aug 27, 2021