Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(replica): support multi shard commands new flow #645

Merged
merged 2 commits into from
Jan 8, 2023

Conversation

adiholden
Copy link
Collaborator

@dranikpg
I have some problem with the cancellation in this which I will continue investigate.

@adiholden adiholden requested a review from dranikpg January 5, 2023 13:44
Comment on lines 168 to 177
// TODO: Vlad, shouldnt we call here Replica::JoinAllFlows()?
// can it be that sync_fb_ of execution_fb_ are valid? Stop is called from server family
// not for replica flows
if (sync_fb_.joinable()) {
sync_fb_.join();
}
if (execution_fb_.joinable()) {
execution_fb_.join();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is as follows (and its kind of bad the replica and sub-replica are on class - but thats what we have for now).

  1. We cancel the context
  2. It unblocks
  3. The current task (i.e. full/stable sync) must do cleanup when it exists and returns an error - i.e. join the flows
  4. We wait for sync_fb_ to join on the main replica class - so when the loop exited, the task must have done cleanup

We can join all flows - but that shouldn't be needed here - it happens either way. Besides, we can only join if the flows are not in a transitioning state, so we have to lock by the transition mutex... so it makes it all more complicated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that in the main replica sync_fb_ is set to MainReplicationFb. I thought that it is only used in the sub-replica instances and this is why I thought that there was a bug here that you wanted to join the sub flows here. Never mind.
I will remove the execution_fb_.join() from here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave a comment there, that sync_fb_ is the MainReplicationFb and is responsible for stopping everything

Comment on lines 609 to 612
// TODO: Vlad I want to make sure we dont need cleanup here of shared map and flow queues
// If I understand the reason we dont have cleanup is because after we exit here replica class
// is distroyed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of cleanup? The current cleanup for a flow done by the stable sync function is just joining it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the replica is holding state than I would think that after each step we will have a cleanup for the step, f.e in here clear the shared transaction map and queues in each flow. The reason for it is if the step failed to and you want to do recover I would want to have a clean data in replica class.
Unless we are destroying the class after the step fails, as currently done in full sync if we start it again you recreate the replica flow classes.
So what is the flow if the replica connection fails while we are in stable state? the error handler is invoked right? and after that if we are able to connect again today we will start the full sync step again right? this will recreate the replica flow classes. But what if we want to continue from where we failed if possible ?
Maybe I should not think about this now and we will get to it once we need to do the partial sync

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would suggest thinking about this when we do partial sync. We can just trivially drop all the data we have inside the transaction map, etc...

@dranikpg
Copy link
Contributor

dranikpg commented Jan 5, 2023

@adiholden

I have some problem with the cancellation in this which I will continue investigate.

What's the problem you currently have? Maybe I can give you some suggestions or assumptions.

Making it pass the pytests should be a good goal. If it survives the disconnect ones - then it should survive almost all real world use cases

@dranikpg
Copy link
Contributor

dranikpg commented Jan 5, 2023

I see now, Replica::Stop doesn't make it stop

Comment on lines +585 to 587
}
}
};
Copy link
Contributor

@dranikpg dranikpg Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to unblock all the wakers, thats why its stuck (stuck on joining execution fibers). I've added it and it seems to work

@adiholden adiholden requested a review from dranikpg January 8, 2023 09:17
@dranikpg dranikpg merged commit 77bd5f5 into main Jan 8, 2023
@romange romange deleted the replica_multi_shard_cmd branch April 12, 2023 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants