Thread sanitizer warnings #5

dvicini · 2023-11-09T14:00:22Z

Hi Wenzel,

I've recently run the Nanothread tests with Clang's ThreadSanitizer. I previously only ran address and undefined behavior sanitizers, but for a threading library the ThreadSanitizer might be interesting as well.

To run the sanitizer, I compile with following options and then run the tests as is (on a Linux system, not sure this works on an ARM Mac):

cmake ..  -DCMAKE_C_FLAGS="-fsanitize=thread -g" -DCMAKE_CXX_FLAGS="-fsanitize=thread -g"  -DCMAKE_EXE_LINKER_FLAGS="-fsanitize=thread -g" -DCMAKE_MODULE_LINKER_FLAGS="-fsanitize=thread -g"  -DNANOTHREAD_ENABLE_TESTS=ON -GNinja && ninja

I haven't had time to dig into the details, but here are already some things I noticed. As usual with these sanitizers, some warnings might be overly conservative. So far I noticed:

In pool_destroy, setting the pool size to 0 sets pool->workers[pool->workers.size() + i]->stop = true; (nanothread.cpp:194). The sanitizer sees this as a race condition with the read access in nanothread.cpp:488. Not sure if there is a good solution, since pool_destroy does what it's supposed to do here (tests/test01 reproduces this).
The write task->next = node; in task release (queue.cpp:231) is marked as a race condition with the atomic read in pop() (queue.cpp:338). Here I am unsure if that's a real issue or not. (tests/test02 reproduces this)
The atomic read in alloc (queue.cpp:133) seems to be in a data race with release as well (queue.cpp:231). This happens during task_submit_dep. (tests/test03 reproduces this).
A similar race appears to happen between the assignment of func in task_submit_dep (nanothread.cpp:301) and access to it in pool_execute_task (nanothread.cpp:346)

Curious to hear if any of these could actually be real issues or if it's the sanitizer being overly cautious. Maybe there is a connection to this rare issue that some people have seen: mitsuba-renderer/mitsuba3#849

The text was updated successfully, but these errors were encountered:

wjakob · 2023-12-03T08:34:46Z

Hi Delio (cc @Speierers, @njroussel) -- thank you for this report.

I believe to have found the root cause of the rare issue you mentioned. It's actually a race condition in Dr.Jit that leads to an incorrect use of the nanothread API by releasing a task twice in a row. This corrupts internal data structures and eventually produces an assertion failure. I can get @Speierers' reproducer to work after fixing this bug.

I actually began by using TSAN to investigate the problem as suggested by you. Although a nifty tool, it seems unusable for analyzing lock-free queues.

Here is some information that I had to swap back into memory as I was going through this, in case it's useful for posterity (including future @wjakob):

Nanothread implements a Michael-Scott queue, which is a standard design pattern for lockless queues.
Lockless algorithms are incredibly tricky to get right—general consensus is that it's almost always a bad idea to use them. Task distribution queues are a case where I think an exception to the rule is unavoidable. The data structure is heavily contended, and the default OS abstraction for locks (futex) will detect this contention to park/wake up threads, which comes with a significant cost. When submitting many small work items, the park/wakeup cost dominates --with an earlier std::mutex-protected design of nanothread, the Dr.Jit LLVM test suite became slower by several factors just from the task distribution.
On AArch64 machines, concurrent writes/reads by multiple processors can be reordered almost arbitrarily, which further complicates the design of lockless algorithms. Nanothread uses C11/C++11 load-acquire / store-release operations to make the enqueuing/dequeuing well-defined on platforms with such weakly ordered memory semantics.
The Michael-Scott queue walks along a linked list that may be concurrently extended by other threads. A recurring pattern is that it loads the same memory address twice in a row using a load-acquire (LDAR) to obtain a certificate that a concurrent write has not taken place.
It also generally does operations/further reads between these two paired reads. This will look like a race condition to TSAN because a concurrent update modified this additional data, but the warning in TSAN is unaware of the fact that the LDAR logic will then discard the operation within nanothread. You can find a discussion of the same issue in the Boost implementation of a lockless queue: Lockfree Queue Triggers Thread Sanitizer Data Race Warning boostorg/lockfree#78
These lock-free data structures are sufficiently tricky that there is an unmentioned problem even in the original paper (e.g. discussed here). The traversal of the linked list may go through released nodes, which is in principle fine because of the logic to detect updates (such as part of the list being concurrently updated and released). However, if the implementation actually free()s unused nodes, then such a read has a side effect, which is that it can crash the application with a page fault. nanothread avoids that issue by never freeing nodes. They are put into a queue for later reuse.

dvicini · 2023-12-03T15:31:22Z

Yes that makes sense. I am not surprised that TSAN cannot handle the edge-cases of lock free queues. Regardless of that, it's good that you caught the incorrect use of the nanothread API. Let's hope this assertion error is gone for good 🤞

wjakob closed this as completed Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread sanitizer warnings #5

Thread sanitizer warnings #5

dvicini commented Nov 9, 2023

wjakob commented Dec 3, 2023 •

edited

Loading

dvicini commented Dec 3, 2023

Thread sanitizer warnings #5

Thread sanitizer warnings #5

Comments

dvicini commented Nov 9, 2023

wjakob commented Dec 3, 2023 • edited Loading

dvicini commented Dec 3, 2023

wjakob commented Dec 3, 2023 •

edited

Loading