Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272

Closed
Tracked by #8030
vladsud opened this issue Aug 27, 2021 · 9 comments · Fixed by #7762
Assignees
Labels
area: runtime: summarizer area: tests Tests to add, test infrastructure improvements, etc
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Aug 27, 2021

Tracking work mentioned in #7247 (comment)

Navin: Is it possible to re-create this scenario in tests? Basically, making sure that docs can get out of broken state when it goes through the cycle of reconnects

Vlad: I'll look into it.

This is RE this comment:

this.delayBeforeCreatingSummarizer().then(async (isDelayed: boolean) => {
// Re-validate that it need to be running. Due to asynchrony, it may be not the case anymore
// but only if creation was delayed. If it was not, then we want to ensure we always create
// a summarizer to kick off lastSummary. Without that, we would not be able to summarize and get
// document out of broken state if it has too many ops and ordering service keeps nacking main
// container (and thus it goes into cycle of reconnects)

@vladsud vladsud added the bug Something isn't working label Aug 27, 2021
@vladsud vladsud added this to the September 2021 milestone Aug 27, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Aug 30, 2021

Note: that's special case of #7292

@vladsud vladsud removed their assignment Aug 31, 2021
@pleath
Copy link
Contributor

pleath commented Oct 2, 2021

@vladsud, @agarwal-navin, do we know that this works correctly today? In my attempts to test a case where the SummaryManager disconnects while we're in the delay before creating the summarizer, we don't generate the last summary before exiting. The last summary gets generated correctly if we're stopping a running summarizer but not if we're still in the Starting state.

@agarwal-navin
Copy link
Contributor

It may not work as expected, hence the ask to have tests for this scenario :-). Can you share the test to give an idea of the exact scenario?

@vladsud
Copy link
Contributor Author

vladsud commented Oct 5, 2021

@pleath, you are correct here. (all form my memory, without looking at the code :)) Basically if we have less than 4K unsummarized ops when we start the process, then we will wait some time, and we will recheck if given client is still a summarizer after wait, and exit if it's not, even if we have by that time over 4K ops. I'm not sure if it's interesting case.

The more important case is when we start with over 4K ops. Then there is no wait. We skip the test if current client is a summarizing client. Thus we always create summarizer and it will do last summary immediately (as it is instructed to exit right away). That's the main thing we should test (that it stays same way).

@pleath
Copy link
Contributor

pleath commented Oct 5, 2021

Agreed about the relative importance of the two cases. So you want to cover the specific case where we create the summarizer without a delay, but the client still manages to disconnect while we're creating the summarizer? So, this early return in SummaryManager::startSummarization:

        const summarizer = await this.requestSummarizerFn();

        // Re-validate that it need to be running. Due to asynchrony, it may be not the case anymore
        const shouldSummarizeState = this.getShouldSummarizeState();
        if (shouldSummarizeState.shouldSummarize === false) {
            summarizer.stop(shouldSummarizeState.stopReason);
            return;
        }

@vladsud
Copy link
Contributor Author

vladsud commented Oct 5, 2021

I think the main thing to validate (in presence of too many ops) are:

  • we do not exit early due to this check
  • we do not exit anywhere else on similar checks, and actually produce last summary that gets client out of the misery.

I.e. the moment this client connects successfully to socket, if we queue disconnect (as a mini-task, or setTimeout(0)), then this single client in a document does successfully produce a summary.

@pleath
Copy link
Contributor

pleath commented Oct 6, 2021

Thanks. It seems like the relevant code is in SummaryManager (the delay logic on summaizer creation) and RunningSummarizer (running last summary on summarizer exit). So I think this could be accomplished by using a real RunningSummarizer in the SummaryManager UT, which doesn't use a real Summarizer.

@pleath
Copy link
Contributor

pleath commented Oct 7, 2021

I'm finding that, in the >4K ops case, I can hit that early exit, and so fail to generate a last summary, if the client disconnects before requestSummarizerFn() is able to complete. The SummaryManager never gets out of the Starting state, and we never check to see if a last summary should be done. That seems to suggest we have a bug. It seems like we need the ability to transition, after disconnect, from the Starting state to either Running (if we need a last summary) or Off.

@vladsud
Copy link
Contributor Author

vladsud commented Oct 7, 2021

Looking at the code, there are two checks in place, and second check indeed will bail out. I think for "normal" case that's expected and assumed - we want to reduce chances of two summarizer instances running at the same time in most cases (we do not know for sure, but if current client is not a summarizer, then very likely there is already another one elected).

But for "too many unsummarized ops" case, we have to ensure that client produces last summary, as not doing it has dire consequences.

At least that's current state.
We can do better, I've opened #7279 to consider change in design where summarizer election not just operates on "main" (non-summarizer / parent) clients, but also takes into account summarizer accounts in some form. If done properly, that may allow summarizer client stay "chosen" even if it's parent is disconnected, and thus potentially remove all these checks / move them to the point where it connects and can re-evalate itself if it's still chosen or not.

This is my memory dump, please critique / propose improvements :)

@pleath pleath modified the milestones: October 2021, November 2021 Oct 18, 2021
@curtisman curtisman mentioned this issue Oct 28, 2021
11 tasks
@pleath pleath modified the milestones: November 2021, December 2021 Nov 8, 2021
@pleath pleath modified the milestones: December 2021, January 2022 Dec 1, 2021
@pleath pleath modified the milestones: January 2022, February 2022 Jan 19, 2022
@pleath pleath modified the milestones: February 2022, March 2022 Feb 7, 2022
@pleath pleath modified the milestones: March 2022, April 2022 Mar 10, 2022
@curtisman curtisman added area: runtime: summarizer area: tests Tests to add, test infrastructure improvements, etc and removed bug Something isn't working labels Apr 4, 2022
@pleath pleath modified the milestones: April 2022, Future Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime: summarizer area: tests Tests to add, test infrastructure improvements, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants