-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(replica): support multi shard commands new flow #645
Conversation
Signed-off-by: adi_holden <[email protected]>
src/server/replica.cc
Outdated
// TODO: Vlad, shouldnt we call here Replica::JoinAllFlows()? | ||
// can it be that sync_fb_ of execution_fb_ are valid? Stop is called from server family | ||
// not for replica flows | ||
if (sync_fb_.joinable()) { | ||
sync_fb_.join(); | ||
} | ||
if (execution_fb_.joinable()) { | ||
execution_fb_.join(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is as follows (and its kind of bad the replica and sub-replica are on class - but thats what we have for now).
- We cancel the context
- It unblocks
- The current task (i.e. full/stable sync) must do cleanup when it exists and returns an error - i.e. join the flows
- We wait for sync_fb_ to join on the main replica class - so when the loop exited, the task must have done cleanup
We can join all flows - but that shouldn't be needed here - it happens either way. Besides, we can only join if the flows are not in a transitioning state, so we have to lock by the transition mutex... so it makes it all more complicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that in the main replica sync_fb_ is set to MainReplicationFb. I thought that it is only used in the sub-replica instances and this is why I thought that there was a bug here that you wanted to join the sub flows here. Never mind.
I will remove the execution_fb_.join() from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave a comment there, that sync_fb_ is the MainReplicationFb and is responsible for stopping everything
src/server/replica.cc
Outdated
// TODO: Vlad I want to make sure we dont need cleanup here of shared map and flow queues | ||
// If I understand the reason we dont have cleanup is because after we exit here replica class | ||
// is distroyed? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of cleanup? The current cleanup for a flow done by the stable sync function is just joining it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the replica is holding state than I would think that after each step we will have a cleanup for the step, f.e in here clear the shared transaction map and queues in each flow. The reason for it is if the step failed to and you want to do recover I would want to have a clean data in replica class.
Unless we are destroying the class after the step fails, as currently done in full sync if we start it again you recreate the replica flow classes.
So what is the flow if the replica connection fails while we are in stable state? the error handler is invoked right? and after that if we are able to connect again today we will start the full sync step again right? this will recreate the replica flow classes. But what if we want to continue from where we failed if possible ?
Maybe I should not think about this now and we will get to it once we need to do the partial sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would suggest thinking about this when we do partial sync. We can just trivially drop all the data we have inside the transaction map, etc...
What's the problem you currently have? Maybe I can give you some suggestions or assumptions. Making it pass the pytests should be a good goal. If it survives the disconnect ones - then it should survive almost all real world use cases |
I see now, Replica::Stop doesn't make it stop |
} | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to unblock all the wakers, thats why its stuck (stuck on joining execution fibers). I've added it and it seems to work
Signed-off-by: adi_holden <[email protected]>
@dranikpg
I have some problem with the cancellation in this which I will continue investigate.