Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: test_replication_all failure #4155

Merged
merged 1 commit into from
Nov 20, 2024
Merged

fix: test_replication_all failure #4155

merged 1 commit into from
Nov 20, 2024

Conversation

romange
Copy link
Collaborator

@romange romange commented Nov 19, 2024

Fixes #4150. The failure can be reproduced with high probability on ARM via pytest dragonfly/replication_test.py -k test_replication_all[df_factory0-mode0-8-t_replicas3-seeder_config3-2000-False]

Not sure why this barrier is needed but #4146 removes the barrier which breaks a gentle balance in the code in unexpected way.

Fixes #4150. The failure can be reproduced with high probability on ARM via
`pytest dragonfly/replication_test.py -k test_replication_all[df_factory0-mode0-8-t_replicas3-seeder_config3-2000-False]`

Not sure why this barrier is needed but #4146 removes the barrier
which breaks a gentle balance in the code in unexpected way.

Signed-off-by: Roman Gershman <[email protected]>
// TODO: This is a temporary fix for #4146.
// For some reason without this the replication breaks in regtests.
auto cb = [](EngineShard* shard) { return OpStatus::OK; };
shard_set->RunBriefInParallel(std::move(cb));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand this. #4146 you allow select in multi if it's noop (that is if the selected index is the same as the active index). However int https://github.com/dragonflydb/dragonfly/pull/4146/files#diff-0bbb915a716c77068c12141f79435ca7315f74c50657a040894f178837d6f58cR87 you only change where you set the conn_context_.conn_state.db_index. Can you plz explain (not why this barrier as you say fixes it but where was the barrier in the first place and why is it neeced) ? What am I missing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff you mention is unrelated. What's important is that before GenericFamily::Select always did a hop and my pr changed that:
https://github.com/dragonflydb/dragonfly/pull/4146/files#diff-f8f9387edb04c03ce7333ad3b69778d7b379fcf8c8765543f62b34d883e187c5R1611-R1614

So for some reason, the replication logic relied on this implementation detail for its "correctness". And by correctness I mean - we have not observed failures there before. I can not think of a reason why we need the barrier, maybe @adiholden knows, she fixed quite a few bugs there . What I know is that once I keep the barrier (i could also keep it inside GenericFamily::Select ) the bug disappears. I verified this experimentally. I asked @dranikpg to deep dive into it but this PR provides a patch to preserve the current behaviour for now. I justify this PR by stating that it preserves the previous logic that had such a barrier before. It does not mean the code is good, hence the TODO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's what you meant. Gotcha

@romange romange requested a review from adiholden November 20, 2024 04:11
@romange
Copy link
Collaborator Author

romange commented Nov 20, 2024

@romange romange merged commit 36135f5 into main Nov 20, 2024
17 checks passed
@romange romange deleted the Pr1 branch November 20, 2024 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_replication_all fails
3 participants