-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Segmentation fault in interruptible.hpp #1225
Comments
Does cugraph run something in multiple processes there? The raft/cpp/include/raft/core/interruptible.hpp Lines 181 to 184 in 6a7e125
(update: no, this didn't work) |
@achirkin, I'm wondering if we could use somehting at the global level or non-member thread-local level like a cc @jrhemstad for thoughts as well- it looks like we have a race condition happening in the callback for |
As I mentioned in the slack thread, This is due to the arbitrary destruction order between It seems like there is no guaranteed destruction order between a thread local static variable defined inside a member function and static class member variables. If I confirmed this is indeed happening by defining
and adding Also added the printout statement in the custom deleter (https://github.com/rapidsai/raft/blob/branch-23.02/cpp/include/raft/core/interruptible.hpp#L213); this custom deleter will be called when In testing, the print statement from |
Here's a rather ugly, but seemingly working, fix #1229 - wrap the |
Because there's no way to control the order of destruction between the global and thread-local static objects, the token registry may sometimes be accessed after it has already been destructed (in the program exit handlers). This fix wraps the registry in a shared pointer and keeps the weak pointers in the deleters which cause the problem, thus it avoids accessing the registry after it's been destroyed. Closes #1225 Closes #1275 Authors: - Artem M. Chirkin (https://github.com/achirkin) - Corey J. Nolet (https://github.com/cjnolet) - Allard Hendriksen (https://github.com/ahendriksen) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1229
cugraph has been encountering a segfault in
interruptible.hpp
. It seems to only appear when taking the Python path through their code, for some reason, but it's very consistent. The reproducer for this issue (assumes #1224 has been reverted) is:And it's important to run this in a loop because it doesn't happen everytime:
The following shows the error happening on line 214 here, which appears it might be related to an object being used in a callback after it's been cleaned up (something like
registry_.find(xxx)
afterregistry_
has already been deallocated, maybe?)cc @achirkin @tfeher
I'm not sure why this doesn't seem to be happening in pylibraft, cuml, or cuopt but it's definitely consistently reproducible in cugraph.
@ChuckHastings @alexbarghi-nv @rlratzel @seunghwak @BradReesWork FYI
The text was updated successfully, but these errors were encountered: