-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UT coverage for lastSummary successfully happening when main client disconnects due to nack (10K ops) #7272
Comments
Note: that's special case of #7292 |
@vladsud, @agarwal-navin, do we know that this works correctly today? In my attempts to test a case where the SummaryManager disconnects while we're in the delay before creating the summarizer, we don't generate the last summary before exiting. The last summary gets generated correctly if we're stopping a running summarizer but not if we're still in the Starting state. |
It may not work as expected, hence the ask to have tests for this scenario :-). Can you share the test to give an idea of the exact scenario? |
@pleath, you are correct here. (all form my memory, without looking at the code :)) Basically if we have less than 4K unsummarized ops when we start the process, then we will wait some time, and we will recheck if given client is still a summarizer after wait, and exit if it's not, even if we have by that time over 4K ops. I'm not sure if it's interesting case. The more important case is when we start with over 4K ops. Then there is no wait. We skip the test if current client is a summarizing client. Thus we always create summarizer and it will do last summary immediately (as it is instructed to exit right away). That's the main thing we should test (that it stays same way). |
Agreed about the relative importance of the two cases. So you want to cover the specific case where we create the summarizer without a delay, but the client still manages to disconnect while we're creating the summarizer? So, this early return in SummaryManager::startSummarization:
|
I think the main thing to validate (in presence of too many ops) are:
I.e. the moment this client connects successfully to socket, if we queue disconnect (as a mini-task, or setTimeout(0)), then this single client in a document does successfully produce a summary. |
Thanks. It seems like the relevant code is in SummaryManager (the delay logic on summaizer creation) and RunningSummarizer (running last summary on summarizer exit). So I think this could be accomplished by using a real RunningSummarizer in the SummaryManager UT, which doesn't use a real Summarizer. |
I'm finding that, in the >4K ops case, I can hit that early exit, and so fail to generate a last summary, if the client disconnects before requestSummarizerFn() is able to complete. The SummaryManager never gets out of the Starting state, and we never check to see if a last summary should be done. That seems to suggest we have a bug. It seems like we need the ability to transition, after disconnect, from the Starting state to either Running (if we need a last summary) or Off. |
Looking at the code, there are two checks in place, and second check indeed will bail out. I think for "normal" case that's expected and assumed - we want to reduce chances of two summarizer instances running at the same time in most cases (we do not know for sure, but if current client is not a summarizer, then very likely there is already another one elected). But for "too many unsummarized ops" case, we have to ensure that client produces last summary, as not doing it has dire consequences. At least that's current state. This is my memory dump, please critique / propose improvements :) |
Tracking work mentioned in #7247 (comment)
Navin: Is it possible to re-create this scenario in tests? Basically, making sure that docs can get out of broken state when it goes through the cycle of reconnects
Vlad: I'll look into it.
This is RE this comment:
this.delayBeforeCreatingSummarizer().then(async (isDelayed: boolean) => {
// Re-validate that it need to be running. Due to asynchrony, it may be not the case anymore
// but only if creation was delayed. If it was not, then we want to ensure we always create
// a summarizer to kick off lastSummary. Without that, we would not be able to summarize and get
// document out of broken state if it has too many ops and ordering service keeps nacking main
// container (and thus it goes into cycle of reconnects)
The text was updated successfully, but these errors were encountered: