Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider changing summarization election to avoid 2 summarizer stepping on each other (last Summary) #7279

Closed
Tracked by #8030
vladsud opened this issue Aug 27, 2021 · 1 comment · Fixed by #8951
Closed
Tracked by #8030
Assignees
Labels
area: runtime: summarizer bug Something isn't working
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Aug 27, 2021

In #7247, I've tried to reduce number of cases where we have two summarizer clients running at the same time by ensuring that newly starting client always waits a bit (under normal circumstances) before actually starting to summarize, giving some amount of time for the outgoing client to produce last summary.

This does work, but it's probabilistic, and I do see in stress test that we hit summary nacks due to two summarizer clients running. This creates extra noise that humans need to look into, but also consumes resources from client and server for nothing.

Much better flow should be something like this:

  1. Today, summarizer election only considers interactive (not summarizer container) clients. I propose we change that and always given preference to running summarizer client and elect it as summarizer. That means that first some parent (interactive) client is elected, but then election moves to its child (summarizing container).
    • Please see SummaryManager.delayBeforeCreatingSummarizer() logic for details and comments in there.
  2. Parent recognizes that and does not cancel summarizer client, it keeps going as today.
  3. When parent disconnects or closes, it would (as today) notify summarizing client that it needs to move to exit. But no new summarizer starts and that allows last Summary to work without any concerns.
  4. (no change required) once last summary is over, summarizing client will move to exit and new summarizer would spawn.

If we go with this route, we can change commented out checks in RunWhileConnectedCoordinator.cancelled to be more assertive about this.onBehalfOfClientId checks.

@vladsud vladsud added the bug Something isn't working label Aug 27, 2021
@vladsud vladsud added this to the October 2021 milestone Aug 27, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Aug 27, 2021

@agarwal-navin, @anthony-murphy - FYI

vladsud added a commit that referenced this issue Aug 30, 2021
Closes issue #6896: Summarizer keep going when summarizer container is closed

Key changes:

1. Introduce ICancellationToken interface to represent cancellation token. It's similar to AbortSignal, but easier to work with as promises compose easier.
2. Consolidate all cancellation logic in coordinator object (RunWhileConnectedCoordinator) that implements ICancellationToken. Before this change various layers checked for being connected or other conditions, and in many cases - we did not check it at all as proper object was not available at given layer.
   - While it currently does not change a fact that summarizing client needs to be connected to ordering service to proceed with summaries, this might change in the future (with single-commit summaries), so having one source of truth for decision making on continue or not makes design more flexible for future changes.
3. Propagate ICancellationToken through all summarization layers and start checking cancellation (including using Promise.race wherever we wait for promises) to ensure client moves to exit immediately on cancellation. This creates proper infrastructure to cancel summary (using right trigger) and avoid in the future condition of two clients submitting summaries and stepping on each other toes. This change is especially important in closing gaps like
   - RunningSummarizer.trySummarize() doing 3 summary attempts with delays and not checking for cancellation
   - Waiting for summary op, summary ack/nack. Later results in summary timeouts from a client that was closed 10 minutes ago, creating a lot of noise in scalability tests (and I assume in production as well).
   - Better detection of disposed / disconnected summarizer client before it even started the work.
4. Stop checking for onBehalfOfClient as it does not really work.
   - Current design is that onBehalfOfClient is updated whenever parent container clientId changes. This makes this check not very useful, as it creates a condition for lastSummary to be cancelled right away, and it indeed happens very often as we observe with documents that have 10K of unsummarized ops and have really hard time to do such summary (in presence of main client continuously getting nacked and disconnected).
   - Ideal solution is tracked in Issue #7279
@vladsud vladsud modified the milestones: October 2021, November 2021 Sep 28, 2021
@pleath pleath modified the milestones: November 2021, December 2021 Oct 26, 2021
@curtisman curtisman mentioned this issue Dec 1, 2021
11 tasks
@pleath pleath modified the milestones: December 2021, January 2022 Dec 1, 2021
@vladsud vladsud modified the milestones: January 2022, February 2022 Dec 13, 2021
@pleath pleath modified the milestones: March 2022, April 2022 Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime: summarizer bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants