Summary:
In working on diff https://phorge.dev.yugabyte.com/D32682, I found one deadlock
that caused a test to be flaky. The dead lock involved 3 threads.
Thread A is executing yb::RWCLock::UpgradeToCommitLock:
```
void RWCLock::UpgradeToCommitLock() {
lock_.lock();
DCHECK(write_locked_);
#ifndef NDEBUG
bool first_wait = true;
while (reader_count_ > 0) {
if (!no_readers_.TimedWait(first_wait ? kFirstWait : kSecondWait)) {
std::ostringstream ss;
ss << "Too long commit lock wait, num readers: " << reader_count_
<< ", current thread stack: " << GetStackTrace();
```
Thread A blocks because `reader_count_ > 0`, while it holds the write lock:
```
DCHECK(write_locked_);
```
Thread A is executing the work flow of an ALTER TABLE statement with a call chain
```
yb::RWCLock::UpgradeToCommitLock
yb::CowObject<yb::master::PersistentTableInfo>::CommitMutation
yb::CowWriteLock<yb::master::PersistentTableInfo>::Commit
yb::master::CatalogManager::ClearYsqlDdlTxnState
yb::master::CatalogManager::HandleSuccessfulYsqlDdlTxn
yb::master::CatalogManager::YsqlDdlTxnCompleteCallbackInternal
```
From test log I see the thread (Thread B) that caused `reader_count_ > 0` is
calling `GetTableSchemaInternal`:
Thread B:
```
Status CatalogManager::GetTableSchemaInternal(const GetTableSchemaRequestPB* req,
GetTableSchemaResponsePB* resp,
bool get_fully_applied_indexes) {
VLOG(1) << "Servicing GetTableSchema request for " << req->ShortDebugString();
// Lookup the table and verify if it exists.
TRACE("Looking up table");
scoped_refptr<TableInfo> table = VERIFY_RESULT(FindTable(req->table()));
TRACE("Locking table");
auto l = table->LockForRead();
......
auto nsinfo = FindNamespaceById(table->namespace_id());
```
Thread B causes `reader_count_ > 0` via `table->LockForRead()` and it is blocked
inside `FindNamespaceById` in order to get a shared lock on `mutex_` of `CatalogManager`:
```
Result<scoped_refptr<NamespaceInfo>> CatalogManager::FindNamespaceById(
const NamespaceId& id) const {
SharedLock lock(mutex_);
return FindNamespaceByIdUnlocked(id);
}
```
The `mutex_` is held exclusively by Thread C from `CatalogManager::CreateTable`:
```
{
LockGuard lock(mutex_);
auto ns_lock = ns->LockForRead();
......
if (IsIndex(req)) {
TRACE("Locking indexed table");
indexed_table_write_lock = indexed_table->LockForWrite();
}
......
}
```
Note Thread C exclusively holds `mutex_` via `LockGuard`, and it is blocked on
`LockForWrite`, which calls `RWCLock::WriteLock` and cannot proceed because
`write_locked_` (owned by Thread A).
```
void RWCLock::WriteLock() {
ThreadRestrictions::AssertWaitAllowed();
MutexLock l(lock_);
// Wait for any other mutations to finish.
#ifndef NDEBUG
bool first_wait = true;
while (write_locked_) {
if (!no_mutators_.TimedWait(first_wait ? kFirstWait : kSecondWait)) {
```
In summary, we have a deadlock cycle:
Thread A (holds write_locked_) waits for Thread B to release reader_count_
Thread B (holds reader_count_) waits for Thread C to release mutex_
Thread C (holds mutex_) waits for Thread A to release write_locked_
To break the dead lock cycle, I made changes in Thread C, so that it will release
the exclusive lock on `mutex_` before calling `LockForWrite`. In addition,
it does not need to re-acquire `mutex_` after `LockForWrite` because the rest
code in the relevant scope does not need to be protected by having an exclusive
lock on `mutex_`.
Jira: DB-11784
Test Plan: Tested in the context of https://phorge.dev.yugabyte.com/D32682.
Reviewers: hsunder, asrivastava, zdrudi
Reviewed By: asrivastava, zdrudi
Subscribers: zdrudi, ybase, yql
Differential Revision: https://phorge.dev.yugabyte.com/D35843