You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #7247, I've tried to reduce number of cases where we have two summarizer clients running at the same time by ensuring that newly starting client always waits a bit (under normal circumstances) before actually starting to summarize, giving some amount of time for the outgoing client to produce last summary.
This does work, but it's probabilistic, and I do see in stress test that we hit summary nacks due to two summarizer clients running. This creates extra noise that humans need to look into, but also consumes resources from client and server for nothing.
Much better flow should be something like this:
Today, summarizer election only considers interactive (not summarizer container) clients. I propose we change that and always given preference to running summarizer client and elect it as summarizer. That means that first some parent (interactive) client is elected, but then election moves to its child (summarizing container).
Please see SummaryManager.delayBeforeCreatingSummarizer() logic for details and comments in there.
Parent recognizes that and does not cancel summarizer client, it keeps going as today.
When parent disconnects or closes, it would (as today) notify summarizing client that it needs to move to exit. But no new summarizer starts and that allows last Summary to work without any concerns.
(no change required) once last summary is over, summarizing client will move to exit and new summarizer would spawn.
If we go with this route, we can change commented out checks in RunWhileConnectedCoordinator.cancelled to be more assertive about this.onBehalfOfClientId checks.
The text was updated successfully, but these errors were encountered:
Closes issue #6896: Summarizer keep going when summarizer container is closed
Key changes:
1. Introduce ICancellationToken interface to represent cancellation token. It's similar to AbortSignal, but easier to work with as promises compose easier.
2. Consolidate all cancellation logic in coordinator object (RunWhileConnectedCoordinator) that implements ICancellationToken. Before this change various layers checked for being connected or other conditions, and in many cases - we did not check it at all as proper object was not available at given layer.
- While it currently does not change a fact that summarizing client needs to be connected to ordering service to proceed with summaries, this might change in the future (with single-commit summaries), so having one source of truth for decision making on continue or not makes design more flexible for future changes.
3. Propagate ICancellationToken through all summarization layers and start checking cancellation (including using Promise.race wherever we wait for promises) to ensure client moves to exit immediately on cancellation. This creates proper infrastructure to cancel summary (using right trigger) and avoid in the future condition of two clients submitting summaries and stepping on each other toes. This change is especially important in closing gaps like
- RunningSummarizer.trySummarize() doing 3 summary attempts with delays and not checking for cancellation
- Waiting for summary op, summary ack/nack. Later results in summary timeouts from a client that was closed 10 minutes ago, creating a lot of noise in scalability tests (and I assume in production as well).
- Better detection of disposed / disconnected summarizer client before it even started the work.
4. Stop checking for onBehalfOfClient as it does not really work.
- Current design is that onBehalfOfClient is updated whenever parent container clientId changes. This makes this check not very useful, as it creates a condition for lastSummary to be cancelled right away, and it indeed happens very often as we observe with documents that have 10K of unsummarized ops and have really hard time to do such summary (in presence of main client continuously getting nacked and disconnected).
- Ideal solution is tracked in Issue #7279
In #7247, I've tried to reduce number of cases where we have two summarizer clients running at the same time by ensuring that newly starting client always waits a bit (under normal circumstances) before actually starting to summarize, giving some amount of time for the outgoing client to produce last summary.
This does work, but it's probabilistic, and I do see in stress test that we hit summary nacks due to two summarizer clients running. This creates extra noise that humans need to look into, but also consumes resources from client and server for nothing.
Much better flow should be something like this:
If we go with this route, we can change commented out checks in RunWhileConnectedCoordinator.cancelled to be more assertive about this.onBehalfOfClientId checks.
The text was updated successfully, but these errors were encountered: