[A code causes deadlock] (Version: [rippled 1.6]) #4023

luleigreat · 2021-12-10T09:27:17Z

Hi, all
I have discovered a code that can cause deadlock:

void
BatchWriter::store(std::shared_ptr<NodeObject> const& object)
{
    std::unique_lock<decltype(mWriteMutex)> sl(mWriteMutex);

    // If the batch has reached its limit, we wait
    // until the batch writer is finished
    while (mWriteSet.size() >= batchWriteLimitSize)
        mWriteCondition.wait(sl);

    mWriteSet.push_back(object);

    if (!mWritePending)
    {
        mWritePending = true;

        m_scheduler.scheduleTask(*this);
    }
}

void
NodeStoreScheduler::scheduleTask(NodeStore::Task& task)
{
    if (jobQueue_.isStopped())
        return;

    if (!jobQueue_.addJob(jtWRITE, "NodeObject::store", [&task](Job&) {
            task.performScheduledTask();
        }))
    {
        // Job not added, presumably because we're shutting down.
        // Recover by executing the task synchronously.
        task.performScheduledTask();
    }
}

If mWriteSet.size() >= batchWriteLimitSize condition met, all jobs in jobqueue may waiting for this condition variable, because the last time m_scheduler.scheduleTask execute success, it just added a job, but cannot assure the job will execute immediately.
If all jobs are locked(that's the scene I have encountered: all jobs are either waiting for the InboundLedger::update lock or waiting for the mWriteCondition condition variable), and the performScheduledTask can never execute ,it will deadlock!

I am working on rippled 1.6 ,and I have reviewed the relating code on rippled:develop，it seems the problem still lay there.

Is this problem resolved ? I think the condition variable usage (mWriteCondition) can be removed, is there a better resolution?

The text was updated successfully, but these errors were encountered:

HowardHinnant · 2021-12-13T22:27:04Z

Can you reliably reproduce this deadlocked state? If so, can you include directions for doing so? Thanks.

luleigreat · 2021-12-14T08:10:39Z

The [workers] is configured 9, and there are totally 9 jobs running when deadlock occured.
The directions for doing so: the server have 16 core cpus with 3.0Ghz frequency, and only configured workers to 9, I think the reason this deadlock occured is the cpu ability is far more stronger than the disk io, and leading to a lot of node cannot be written to disk immediatly.

HowardHinnant self-assigned this Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[A code causes deadlock] (Version: [rippled 1.6]) #4023

[A code causes deadlock] (Version: [rippled 1.6]) #4023

luleigreat commented Dec 10, 2021 •

edited

Loading

HowardHinnant commented Dec 13, 2021

luleigreat commented Dec 14, 2021 •

edited

Loading

[A code causes deadlock] (Version: [rippled 1.6]) #4023

[A code causes deadlock] (Version: [rippled 1.6]) #4023

Comments

luleigreat commented Dec 10, 2021 • edited Loading

HowardHinnant commented Dec 13, 2021

luleigreat commented Dec 14, 2021 • edited Loading

luleigreat commented Dec 10, 2021 •

edited

Loading

luleigreat commented Dec 14, 2021 •

edited

Loading