-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Lock inversion in master/snapshot codepath #14677
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
amitanandaiyer
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Oct 26, 2022
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Oct 26, 2022
May also be affecting
|
msun07
added a commit
that referenced
this issue
Jan 17, 2023
Summary: Case 1: M0 == CatalogManager::state_lock_ M1 == ReplicaState::update_lock_ (of sys catalog's tablet peer) Thread A: CatalogManager::CollectEntriesForSnapshot() CatalogManager::CollectEntries() CatalogManager::CheckIsLeaderAndReady() std::lock_guard<simple_spinlock> l(state_lock_); <---- M0 acquired RaftConsensus::ConsensusState() auto lock = state_->LockForRead(); <---- try to acquire M1 Thread B: PeerMessageQueue::ResponseFromPeer() PeerMessageQueue::NotifyObserversOfMajorityReplOpChange() PeerMessageQueue::NotifyObserversOfMajorityReplOpChangeTask() RaftConsensus::UpdateMajorityReplicated() Status s = state_->LockForMajorityReplicatedIndexUpdate(&lock); <---- M1 acquired ReplicaState::UpdateMajorityReplicatedUnlocked() ReplicaState::AdvanceCommittedOpIdUnlocked() ReplicaState::ApplyPendingOperationsUnlocked() ReplicaState::NotifyReplicationFinishedUnlocked() ConsensusRound::NotifyReplicationFinished() OperationDriver::ReplicationFinished() OperationDriver::ApplyTask() Operation::Replicated() SnapshotOperation::DoReplicated() SnapshotOperation::Apply() MasterSnapshotCoordinator::Impl::RestoreSysCatalogReplicated() <---- TabletSnapshotOpRequestPB::RESTORE_SYS_CATALOG CatalogManager::PrepareRestore() std::lock_guard<simple_spinlock> l(state_lock_); <---- try to acquire M0 Case 2: M0 == CatalogManager::state_lock_ M1 == ReplicaState::update_lock_ (of sys catalog's tablet peer) M2 == MasterSnapshotCoordinator::Impl::mutex_ Thread A: CatalogManager::CollectEntriesForSnapshot() CatalogManager::CollectEntries() CatalogManager::CheckIsLeaderAndReady() std::lock_guard<simple_spinlock> l(state_lock_); <---- M0 acquired RaftConsensus::ConsensusState() auto lock = state_->LockForRead(); <---- try to acquire M1 Thread B: PeerMessageQueue::ResponseFromPeer() PeerMessageQueue::NotifyObserversOfMajorityReplOpChange() PeerMessageQueue::NotifyObserversOfMajorityReplOpChangeTask() RaftConsensus::UpdateMajorityReplicated() Status s = state_->LockForMajorityReplicatedIndexUpdate(&lock); <---- M1 acquired ReplicaState::UpdateMajorityReplicatedUnlocked() ReplicaState::AdvanceCommittedOpIdUnlocked() ReplicaState::ApplyPendingOperationsUnlocked() ReplicaState::NotifyReplicationFinishedUnlocked() ConsensusRound::NotifyReplicationFinished() OperationDriver::ReplicationFinished() OperationDriver::ApplyTask() OperationDriver::Replicated() SnapshotOperation::DoReplicated() SnapshotOperation::Apply() MasterSnapshotCoordinator::Impl::CreateReplicated() <---- TabletSnapshotOpRequestPB::CREATE_ON_MASTER std::lock_guard<std::mutex> lock(mutex_); <---- try to acquire M2 Thread C: CatalogManager::LoadSysCatalogDataTask() MasterSnapshotCoordinator::Impl::SysCatalogLoaded() std::lock_guard<decltype(mutex_)> lock(mutex_); <---- M2 acquired MasterSnapshotCoordinator::Impl::VerifyRestoration() CatalogManager::VerifyRestoredObjects() CatalogManager::CollectEntriesForSnapshot() CatalogManager::CollectEntries() CatalogManager::CheckIsLeaderAndReady() std::lock_guard<simple_spinlock> l(state_lock_); <---- try to acquire M0 Above two deadlock cases are observed. The CheckIsLeaderAndReady function holds the state_lock_ in the entire function scope and then tries to acquire update_lock_ to read the ConsensusStatePB, which is problematic. Holding state_lock_ is not needed for getting the ConsensusStatePB, so this diff narrows the lock scope for state_lock_. Test Plan: N/A Reviewers: amitanand, skedia, zdrudi Reviewed By: zdrudi Subscribers: mbautin, bogdan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D21432
amitanandaiyer
added a commit
that referenced
this issue
Feb 3, 2023
Summary: # Enable deadlock detection in TSAN_OPTIONS # Some minor fixes for issues found # Issues related to cassandra-cpp-driver will be fixed with https://phabricator.dev.yugabyte.com/D20480. Until it lands and third-party is updated, we suppress errors from libcassandra # For MasterSnapshot related issues. opened #14677. Until it lands we suppress MasterSnapshotCoordinator related errors Test Plan: jenkins Reviewers: mbautin, bogdan, rthallam Reviewed By: rthallam Subscribers: asrivastava, rsami Differential Revision: https://phabricator.dev.yugabyte.com/D20426
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-4037
Description
Extracted from SnapshotScheduleTest_RemoveNewTablets.log
in
https://detective-gcp.dev.yugabyte.com/D20426
The text was updated successfully, but these errors were encountered: