-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use atomic to check if leader bank changed #4596
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just ordering nits!
@@ -81,7 +81,10 @@ impl<Tx: TransactionWithMeta> ConsumeWorker<Tx> { | |||
.fetch_add(get_bank_us, Ordering::Relaxed); | |||
|
|||
for work in try_drain_iter(work, &self.consume_receiver) { | |||
if bank.is_complete() { | |||
if bank.is_complete() || { | |||
// check if the bank got interrupted before completion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my own understanding: when can a bank get interrupted before completion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During replay stage we could have discovered a better fork to build our leader blocks on top of. When that happens we immediately abandon the current block we are building and never complete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see thank you!
poh/src/leader_bank_notifier.rs
Outdated
@@ -43,6 +61,8 @@ impl LeaderBankNotifier { | |||
let mut state = self.state.lock().unwrap(); | |||
assert_eq!(state.status, Status::StandBy); | |||
|
|||
self.current_bank_id | |||
.store(bank.bank_id(), Ordering::Release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this can be ::Relaxed? Afaict there is no dependent state we
write/read based on this so we just need atomicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure that current_bank_id
and state
changes are always visible together so that a consume worker doesn't see that current_bank_id
changed but it's unable to read the latest state
change yet. I just moved the release store to after the mutex release to make sure that this condition would hold. Let me know if that makes more sense. I also changed the load to relaxed to make it faster and just act as an atomic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just went with your suggestion and changed it all to relaxed. It's good enough to have an indication that the bank has been updated and we don't need to be perfect about memory access ordering here. It's better to have this check be fast in order to not slow down transaction processing.
poh/src/leader_bank_notifier.rs
Outdated
@@ -61,12 +81,23 @@ impl LeaderBankNotifier { | |||
assert_eq!(state.status, Status::InProgress); | |||
assert_eq!(state.slot, Some(slot)); | |||
|
|||
self.current_bank_id | |||
.store(STAND_BY_SENTINEL_ID, Ordering::Release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too?
poh/src/leader_bank_notifier.rs
Outdated
state.status = Status::StandBy; | ||
drop(state); | ||
|
||
self.condvar.notify_all(); | ||
} | ||
|
||
pub fn get_current_bank_id(&self) -> Option<u64> { | ||
let current_bank_id = self.current_bank_id.load(Ordering::Acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. |
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
* fix: use atomic to check if leader bank changed * fix default value and tests * ordering update * eedback * add comment (cherry picked from commit 4cc49ac)
* fix: use atomic to check if leader bank changed * fix default value and tests * ordering update * eedback * add comment (cherry picked from commit 4cc49ac)
) (#4612) fix: use atomic to check if leader bank changed (#4596) * fix: use atomic to check if leader bank changed * fix default value and tests * ordering update * eedback * add comment (cherry picked from commit 4cc49ac) Co-authored-by: Justin Starry <[email protected]>
) (#4611) fix: use atomic to check if leader bank changed (#4596) * fix: use atomic to check if leader bank changed * fix default value and tests * ordering update * eedback * add comment (cherry picked from commit 4cc49ac) Co-authored-by: Justin Starry <[email protected]>
Problem
Banking stage consume workers don't check if a leader bank got interrupted while consuming work. This can cause them to get stuck consuming work for a bank that was abandoned rather than switching to the newest leader bank.
Summary of Changes
Add a lightweight atomic variable which holds the current leader bank id and check it before each consumed piece of work to see if the bank was interrupted.
Fixes #