Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: test_replication_all failure #4155

Merged
merged 1 commit into from
Nov 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/server/journal/executor.cc
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,11 @@ void JournalExecutor::SelectDb(DbIndex dbid) {
auto cmd = BuildFromParts("SELECT", dbid);
Execute(cmd);
ensured_dbs_[dbid] = true;

// TODO: This is a temporary fix for #4146.
// For some reason without this the replication breaks in regtests.
auto cb = [](EngineShard* shard) { return OpStatus::OK; };
shard_set->RunBriefInParallel(std::move(cb));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this. #4146 you allow select in multi if it's noop (that is if the selected index is the same as the active index). However int https://github.com/dragonflydb/dragonfly/pull/4146/files#diff-0bbb915a716c77068c12141f79435ca7315f74c50657a040894f178837d6f58cR87 you only change where you set the conn_context_.conn_state.db_index. Can you plz explain (not why this barrier as you say fixes it but where was the barrier in the first place and why is it neeced) ? What am I missing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff you mention is unrelated. What's important is that before GenericFamily::Select always did a hop and my pr changed that:
https://github.com/dragonflydb/dragonfly/pull/4146/files#diff-f8f9387edb04c03ce7333ad3b69778d7b379fcf8c8765543f62b34d883e187c5R1611-R1614

So for some reason, the replication logic relied on this implementation detail for its "correctness". And by correctness I mean - we have not observed failures there before. I can not think of a reason why we need the barrier, maybe @adiholden knows, she fixed quite a few bugs there . What I know is that once I keep the barrier (i could also keep it inside GenericFamily::Select ) the bug disappears. I verified this experimentally. I asked @dranikpg to deep dive into it but this PR provides a patch to preserve the current behaviour for now. I justify this PR by stating that it preserves the previous logic that had such a barrier before. It does not mean the code is good, hence the TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's what you meant. Gotcha

} else {
conn_context_.conn_state.db_index = dbid;
}
Expand Down
Loading