-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(transaction): Add special barrier for blocking tx #2512
Conversation
} | ||
|
||
bool Transaction::SingleClaimBarrier::TryClaim() { | ||
return !claimed_.exchange(true, memory_order_relaxed); // false means first means success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why relaxed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand why relaxed is enough here. Do not see any obvious reasons why not and it makes me nervous. Usually, it should not be like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It was relaxed to begin with 😆
- Currently code touches only shard local state or expire/cancel are placed on the coordinator thread
- If we acquire, what is our corresponding release pair? We had an acuiqre/release for the hop that prepared suspension, since then we did no writes to our local state, so there is noting to sync
- I can make it acquire because though I assume it might have real impact on the conclusion writes
Will think about it
src/server/transaction.cc
Outdated
@@ -1470,7 +1484,10 @@ void Transaction::FinishLogJournalOnShard(EngineShard* shard, uint32_t shard_cnt | |||
} | |||
|
|||
void Transaction::CancelBlocking(std::function<OpStatus(ArgSlice)> status_cb) { | |||
if ((coordinator_state_ & COORD_BLOCKED) == 0) | |||
// We're on the owning thread of this transaction, so accessing local data is safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by local data
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to phrase it more clearly. My thought is that we're on the owning thread and we don't suspend, so we can for example read keys as we do below. Of course just accessing transaction state wouldn't be allowed for anyone else
} | ||
|
||
bool Transaction::SingleClaimBarrier::TryClaim() { | ||
return !claimed_.exchange(true, memory_order_relaxed); // false means first means success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It was relaxed to begin with 😆
- Currently code touches only shard local state or expire/cancel are placed on the coordinator thread
- If we acquire, what is our corresponding release pair? We had an acuiqre/release for the hop that prepared suspension, since then we did no writes to our local state, so there is noting to sync
- I can make it acquire because though I assume it might have real impact on the conclusion writes
Will think about it
src/server/transaction.cc
Outdated
} | ||
|
||
cv_status Transaction::SingleClaimBarrier::Wait(time_point tp) { | ||
auto cb = [this] { return released_.load(memory_order_acquire); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I have an acquire here because currently our event-count doesn't provide us with one on the fast path
src/server/transaction.cc
Outdated
@@ -1470,7 +1484,10 @@ void Transaction::FinishLogJournalOnShard(EngineShard* shard, uint32_t shard_cnt | |||
} | |||
|
|||
void Transaction::CancelBlocking(std::function<OpStatus(ArgSlice)> status_cb) { | |||
if ((coordinator_state_ & COORD_BLOCKED) == 0) | |||
// We're on the owning thread of this transaction, so accessing local data is safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to phrase it more clearly. My thought is that we're on the owning thread and we don't suspend, so we can for example read keys as we do below. Of course just accessing transaction state wouldn't be allowed for anyone else
Signed-off-by: Vladislav Oleshko <[email protected]>
2283d8d
to
396ca3d
Compare
I've ran sidekiq loadtests on graviton, seems to pass |
Signed-off-by: Vladislav Oleshko <[email protected]>
577cc6c
to
f8a8e76
Compare
src/server/transaction.h
Outdated
// "Single claim - single modification" barrier. Multiple threads might try to claim it, only one | ||
// will succeed and will be allowed to modify the guarded object until it closes the barrier. | ||
// A closed barrier can't be claimed again or re-used in any way. | ||
class SingleClaimBarrier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I only know of folly::Baton and it's somewhat different, but in some way it makes sense and is a more entertaining name 😆
std::atomic_uint32_t wakeup_requested_{0}; // incremented when blocking transaction gets notified | ||
EventCount blocking_ec_; // to wait for wakeup_requested > 0 (or cancelled) | ||
// Barrier for waking blocking transactions that ensures exclusivity of waking operation. | ||
BatonBarrierrier blocking_barrier_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is wrong here. Also maybe "baton_barrier_" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Baton barrier doesn't indicate where its used, it like calling an atomic variable just atomic
.
blocking_ec
just became blocking_barrier
... how else can we call blocking transactions? Suspension_barrier, watch_barrier, notify_barrier?
Signed-off-by: Vladislav Oleshko <[email protected]>
62f804a
to
80b532f
Compare
Currently we assume waking a transaction is exlusive and we ensured exclusivity with
in
NotifySuspended()
However we ensured no exclusivity with cancelling or timing out. I.e. it was totally possible for a transaction to time out and start preparing the cleanup hops, while another thread would call
NotifySuspended()
. What is more, we removedCOORD_BLOCKED
only after we calledExpire
, so we could be cancelled while we are expiring. This results in a lot of unsafe data parallel accesses.Another problem here is also that once we do
we technically allow blocking_ec_ in the coordinator to pass. If it wasn't suspended and just reached the check (fast path), it will proceed without waiting for the foreign thread to actually finish modifying the values. Paired with no proper acquire/release semantics later on, this is a very unsafe inter-thread exchange
Instead I suggest to use
SingleClaimBarrier
with proper expiration
It results in very simple guarantees:
TryClaim()
, it is totally safe to for it perform modifications untilRelease()
Expire()
duringNotifySuspended()
TryClaim()
returned false we have to discard our operation, no exceptions, if it returned true, we can expect the transaction to be in a clean state - no one modified it before us since registering in the blocking controllerWith those guarantees we also don't need
EXPIRED_Q
andCOORD_BLOCKED
Linked:
NotifySuspended()
, but does it unsafely)