Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

vladsud · 2021-08-27T17:30:28Z

Tracking work mentioned in #7247 (comment)

Navin: Is it possible to re-create this scenario in tests? Basically, making sure that docs can get out of broken state when it goes through the cycle of reconnects

Vlad: I'll look into it.

This is RE this comment:

this.delayBeforeCreatingSummarizer().then(async (isDelayed: boolean) => {
// Re-validate that it need to be running. Due to asynchrony, it may be not the case anymore
// but only if creation was delayed. If it was not, then we want to ensure we always create
// a summarizer to kick off lastSummary. Without that, we would not be able to summarize and get
// document out of broken state if it has too many ops and ordering service keeps nacking main
// container (and thus it goes into cycle of reconnects)

vladsud · 2021-08-30T20:02:22Z

Note: that's special case of #7292

pleath · 2021-10-02T01:09:37Z

@vladsud, @agarwal-navin, do we know that this works correctly today? In my attempts to test a case where the SummaryManager disconnects while we're in the delay before creating the summarizer, we don't generate the last summary before exiting. The last summary gets generated correctly if we're stopping a running summarizer but not if we're still in the Starting state.

agarwal-navin · 2021-10-04T17:25:54Z

It may not work as expected, hence the ask to have tests for this scenario :-). Can you share the test to give an idea of the exact scenario?

vladsud · 2021-10-05T03:51:09Z

@pleath, you are correct here. (all form my memory, without looking at the code :)) Basically if we have less than 4K unsummarized ops when we start the process, then we will wait some time, and we will recheck if given client is still a summarizer after wait, and exit if it's not, even if we have by that time over 4K ops. I'm not sure if it's interesting case.

The more important case is when we start with over 4K ops. Then there is no wait. We skip the test if current client is a summarizing client. Thus we always create summarizer and it will do last summary immediately (as it is instructed to exit right away). That's the main thing we should test (that it stays same way).

pleath · 2021-10-05T18:01:05Z

Agreed about the relative importance of the two cases. So you want to cover the specific case where we create the summarizer without a delay, but the client still manages to disconnect while we're creating the summarizer? So, this early return in SummaryManager::startSummarization:

        const summarizer = await this.requestSummarizerFn();

        // Re-validate that it need to be running. Due to asynchrony, it may be not the case anymore
        const shouldSummarizeState = this.getShouldSummarizeState();
        if (shouldSummarizeState.shouldSummarize === false) {
            summarizer.stop(shouldSummarizeState.stopReason);
            return;
        }

vladsud · 2021-10-05T22:24:12Z

I think the main thing to validate (in presence of too many ops) are:

we do not exit early due to this check
we do not exit anywhere else on similar checks, and actually produce last summary that gets client out of the misery.

I.e. the moment this client connects successfully to socket, if we queue disconnect (as a mini-task, or setTimeout(0)), then this single client in a document does successfully produce a summary.

pleath · 2021-10-06T00:09:25Z

Thanks. It seems like the relevant code is in SummaryManager (the delay logic on summaizer creation) and RunningSummarizer (running last summary on summarizer exit). So I think this could be accomplished by using a real RunningSummarizer in the SummaryManager UT, which doesn't use a real Summarizer.

pleath · 2021-10-07T00:58:02Z

I'm finding that, in the >4K ops case, I can hit that early exit, and so fail to generate a last summary, if the client disconnects before requestSummarizerFn() is able to complete. The SummaryManager never gets out of the Starting state, and we never check to see if a last summary should be done. That seems to suggest we have a bug. It seems like we need the ability to transition, after disconnect, from the Starting state to either Running (if we need a last summary) or Off.

vladsud · 2021-10-07T02:40:45Z

Looking at the code, there are two checks in place, and second check indeed will bail out. I think for "normal" case that's expected and assumed - we want to reduce chances of two summarizer instances running at the same time in most cases (we do not know for sure, but if current client is not a summarizer, then very likely there is already another one elected).

But for "too many unsummarized ops" case, we have to ensure that client produces last summary, as not doing it has dire consequences.

At least that's current state.
We can do better, I've opened #7279 to consider change in design where summarizer election not just operates on "main" (non-summarizer / parent) clients, but also takes into account summarizer accounts in some form. If done properly, that may allow summarizer client stay "chosen" even if it's parent is disconnected, and thus potentially remove all these checks / move them to the point where it connects and can re-evalate itself if it's still chosen or not.

This is my memory dump, please critique / propose improvements :)

vladsud added the bug Something isn't working label Aug 27, 2021

vladsud added this to the September 2021 milestone Aug 27, 2021

vladsud assigned pleath and vladsud Aug 27, 2021

vladsud mentioned this issue Aug 27, 2021

More reliable last Summary flow #7247

Merged

vladsud removed their assignment Aug 31, 2021

anthony-murphy modified the milestones: September 2021, October 2021 Oct 4, 2021

pleath modified the milestones: October 2021, November 2021 Oct 18, 2021

curtisman mentioned this issue Oct 28, 2021

Summarizer Improvements #8030

Closed

11 tasks

pleath modified the milestones: November 2021, December 2021 Nov 8, 2021

pleath modified the milestones: December 2021, January 2022 Dec 1, 2021

pleath modified the milestones: January 2022, February 2022 Jan 19, 2022

pleath modified the milestones: February 2022, March 2022 Feb 7, 2022

pleath modified the milestones: March 2022, April 2022 Mar 10, 2022

curtisman added area: runtime: summarizer area: tests Tests to add, test infrastructure improvements, etc and removed bug Something isn't working labels Apr 4, 2022

pleath modified the milestones: April 2022, Future Apr 4, 2022

pleath linked a pull request Apr 7, 2022 that will close this issue

Use real RunningSummarizer in SummaryManager UT to test last summary #7762

Merged

pleath closed this as completed in #7762 Apr 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

vladsud commented Aug 27, 2021

vladsud commented Aug 30, 2021

pleath commented Oct 2, 2021

agarwal-navin commented Oct 4, 2021

vladsud commented Oct 5, 2021

pleath commented Oct 5, 2021

vladsud commented Oct 5, 2021

pleath commented Oct 6, 2021

pleath commented Oct 7, 2021 •

edited

Loading

vladsud commented Oct 7, 2021

Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

Comments

vladsud commented Aug 27, 2021

vladsud commented Aug 30, 2021

pleath commented Oct 2, 2021

agarwal-navin commented Oct 4, 2021

vladsud commented Oct 5, 2021

pleath commented Oct 5, 2021

vladsud commented Oct 5, 2021

pleath commented Oct 6, 2021

pleath commented Oct 7, 2021 • edited Loading

vladsud commented Oct 7, 2021

pleath commented Oct 7, 2021 •

edited

Loading