-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: test_replication_all failure #4155
Conversation
Fixes #4150. The failure can be reproduced with high probability on ARM via `pytest dragonfly/replication_test.py -k test_replication_all[df_factory0-mode0-8-t_replicas3-seeder_config3-2000-False]` Not sure why this barrier is needed but #4146 removes the barrier which breaks a gentle balance in the code in unexpected way. Signed-off-by: Roman Gershman <[email protected]>
// TODO: This is a temporary fix for #4146. | ||
// For some reason without this the replication breaks in regtests. | ||
auto cb = [](EngineShard* shard) { return OpStatus::OK; }; | ||
shard_set->RunBriefInParallel(std::move(cb)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand this. #4146 you allow select in multi
if it's noop
(that is if the selected index is the same as the active index). However int https://github.com/dragonflydb/dragonfly/pull/4146/files#diff-0bbb915a716c77068c12141f79435ca7315f74c50657a040894f178837d6f58cR87 you only change where you set the conn_context_.conn_state.db_index
. Can you plz explain (not why this barrier as you say fixes it but where was the barrier in the first place and why is it neeced) ? What am I missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff you mention is unrelated. What's important is that before GenericFamily::Select always did a hop and my pr changed that:
https://github.com/dragonflydb/dragonfly/pull/4146/files#diff-f8f9387edb04c03ce7333ad3b69778d7b379fcf8c8765543f62b34d883e187c5R1611-R1614
So for some reason, the replication logic relied on this implementation detail for its "correctness". And by correctness I mean - we have not observed failures there before. I can not think of a reason why we need the barrier, maybe @adiholden knows, she fixed quite a few bugs there . What I know is that once I keep the barrier (i could also keep it inside GenericFamily::Select
) the bug disappears. I verified this experimentally. I asked @dranikpg to deep dive into it but this PR provides a patch to preserve the current behaviour for now. I justify this PR by stating that it preserves the previous logic that had such a barrier before. It does not mean the code is good, hence the TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's what you meant. Gotcha
Regression tests pass: |
Fixes #4150. The failure can be reproduced with high probability on ARM via
pytest dragonfly/replication_test.py -k test_replication_all[df_factory0-mode0-8-t_replicas3-seeder_config3-2000-False]
Not sure why this barrier is needed but #4146 removes the barrier which breaks a gentle balance in the code in unexpected way.