-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread sanitizer warnings #5
Comments
Hi Delio (cc @Speierers, @njroussel) -- thank you for this report. I believe to have found the root cause of the rare issue you mentioned. It's actually a race condition in Dr.Jit that leads to an incorrect use of the nanothread API by releasing a task twice in a row. This corrupts internal data structures and eventually produces an assertion failure. I can get @Speierers' reproducer to work after fixing this bug. I actually began by using TSAN to investigate the problem as suggested by you. Although a nifty tool, it seems unusable for analyzing lock-free queues. Here is some information that I had to swap back into memory as I was going through this, in case it's useful for posterity (including future @wjakob):
|
Yes that makes sense. I am not surprised that TSAN cannot handle the edge-cases of lock free queues. Regardless of that, it's good that you caught the incorrect use of the nanothread API. Let's hope this assertion error is gone for good 🤞 |
Hi Wenzel,
I've recently run the Nanothread tests with Clang's ThreadSanitizer. I previously only ran address and undefined behavior sanitizers, but for a threading library the ThreadSanitizer might be interesting as well.
To run the sanitizer, I compile with following options and then run the tests as is (on a Linux system, not sure this works on an ARM Mac):
I haven't had time to dig into the details, but here are already some things I noticed. As usual with these sanitizers, some warnings might be overly conservative. So far I noticed:
In
pool_destroy
, setting the pool size to 0 setspool->workers[pool->workers.size() + i]->stop = true;
(nanothread.cpp:194). The sanitizer sees this as a race condition with the read access innanothread.cpp:488
. Not sure if there is a good solution, sincepool_destroy
does what it's supposed to do here (tests/test01
reproduces this).The write
task->next = node;
in taskrelease
(queue.cpp:231
) is marked as a race condition with the atomic read inpop()
(queue.cpp:338
). Here I am unsure if that's a real issue or not. (tests/test02
reproduces this)The atomic read in
alloc
(queue.cpp:133
) seems to be in a data race withrelease
as well (queue.cpp:231
). This happens duringtask_submit_dep
. (tests/test03
reproduces this).A similar race appears to happen between the assignment of
func
intask_submit_dep
(nanothread.cpp:301
) and access to it inpool_execute_task
(nanothread.cpp:346
)Curious to hear if any of these could actually be real issues or if it's the sanitizer being overly cautious. Maybe there is a connection to this rare issue that some people have seen: mitsuba-renderer/mitsuba3#849
The text was updated successfully, but these errors were encountered: