Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use atomic to check if leader bank changed #4596

Merged
merged 5 commits into from
Jan 24, 2025

Conversation

jstarry
Copy link

@jstarry jstarry commented Jan 23, 2025

Problem

Banking stage consume workers don't check if a leader bank got interrupted while consuming work. This can cause them to get stuck consuming work for a bank that was abandoned rather than switching to the newest leader bank.

Summary of Changes

Add a lightweight atomic variable which holds the current leader bank id and check it before each consumed piece of work to see if the bank was interrupted.

Fixes #

Copy link

@alessandrod alessandrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just ordering nits!

@@ -81,7 +81,10 @@ impl<Tx: TransactionWithMeta> ConsumeWorker<Tx> {
.fetch_add(get_bank_us, Ordering::Relaxed);

for work in try_drain_iter(work, &self.consume_receiver) {
if bank.is_complete() {
if bank.is_complete() || {
// check if the bank got interrupted before completion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own understanding: when can a bank get interrupted before completion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During replay stage we could have discovered a better fork to build our leader blocks on top of. When that happens we immediately abandon the current block we are building and never complete it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see thank you!

@@ -43,6 +61,8 @@ impl LeaderBankNotifier {
let mut state = self.state.lock().unwrap();
assert_eq!(state.status, Status::StandBy);

self.current_bank_id
.store(bank.bank_id(), Ordering::Release);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this can be ::Relaxed? Afaict there is no dependent state we
write/read based on this so we just need atomicity

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that current_bank_id and state changes are always visible together so that a consume worker doesn't see that current_bank_id changed but it's unable to read the latest state change yet. I just moved the release store to after the mutex release to make sure that this condition would hold. Let me know if that makes more sense. I also changed the load to relaxed to make it faster and just act as an atomic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just went with your suggestion and changed it all to relaxed. It's good enough to have an indication that the bank has been updated and we don't need to be perfect about memory access ordering here. It's better to have this check be fast in order to not slow down transaction processing.

@@ -61,12 +81,23 @@ impl LeaderBankNotifier {
assert_eq!(state.status, Status::InProgress);
assert_eq!(state.slot, Some(slot));

self.current_bank_id
.store(STAND_BY_SENTINEL_ID, Ordering::Release);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too?

state.status = Status::StandBy;
drop(state);

self.condvar.notify_all();
}

pub fn get_current_bank_id(&self) -> Option<u64> {
let current_bank_id = self.current_bank_id.load(Ordering::Acquire);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

@jstarry jstarry added v2.0 Backport to v2.0 branch v2.1 Backport to v2.1 branch labels Jan 24, 2025
@jstarry jstarry merged commit 4cc49ac into anza-xyz:master Jan 24, 2025
49 checks passed
@jstarry jstarry deleted the fix/consume-bank branch January 24, 2025 09:40
Copy link

mergify bot commented Jan 24, 2025

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link

mergify bot commented Jan 24, 2025

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

mergify bot pushed a commit that referenced this pull request Jan 24, 2025
* fix: use atomic to check if leader bank changed

* fix default value and tests

* ordering update

* eedback

* add comment

(cherry picked from commit 4cc49ac)
mergify bot pushed a commit that referenced this pull request Jan 24, 2025
* fix: use atomic to check if leader bank changed

* fix default value and tests

* ordering update

* eedback

* add comment

(cherry picked from commit 4cc49ac)
willhickey pushed a commit that referenced this pull request Jan 25, 2025
) (#4612)

fix: use atomic to check if leader bank changed (#4596)

* fix: use atomic to check if leader bank changed

* fix default value and tests

* ordering update

* eedback

* add comment

(cherry picked from commit 4cc49ac)

Co-authored-by: Justin Starry <[email protected]>
willhickey pushed a commit that referenced this pull request Jan 25, 2025
) (#4611)

fix: use atomic to check if leader bank changed (#4596)

* fix: use atomic to check if leader bank changed

* fix default value and tests

* ordering update

* eedback

* add comment

(cherry picked from commit 4cc49ac)

Co-authored-by: Justin Starry <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2.0 Backport to v2.0 branch v2.1 Backport to v2.1 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants