-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deliberately leak PTDS thread_local events in stream ordered mr #1375
Conversation
If an object with thread_local modifier has static storage duration, its destructor (if it exists) will run below main (https://eel.is/c++draft/basic.start.term). The CUDA runtime also sets up (when the first call into the runtime is made) a teardown of the driver that runs atexit. Although [basic.start.term#5](https://eel.is/c++draft/basic.start.term#5) provides guarantees on the order in which these destructors are called, it appears that no C++ stdlib implementation correctly implements this for thread_local objects with static storage duration. Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below main. Hence, we cannot call cudaEventDestroy to deallocate our thread_local events. Since there are a finite number of these event (ndevices * nparticipating_threads), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems. - Closes rapidsai#1371
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this fix is quite right?
I couldn't think of a good way to test this (though locally it fixes the issue in #1371). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 praise: Ignore my previous review. Nice work. I like the simplification fixing this bug uncovers!
include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Outdated
Show resolved
Hide resolved
include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Outdated
Show resolved
Hide resolved
@wence- if you want to make the two suggested changes above I think we can get this merged. |
Co-authored-by: Jake Hemstad <[email protected]>
include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Outdated
Show resolved
Hide resolved
/merge |
Lawrence Mitchell Wed Nov 8 23:36:18 2023 +0000 Deliberately leak PTDS thread_local events in stream ordered mr (#1375) An object with `thread_local` modifier has thread storage duration, its destructor (if it exists) will after the thread exits, which, on the main thread, is below `main` (https://eel.is/c++draft/basic.start.term). The CUDA runtime sets up (when the first call into the runtime is made) a teardown of the driver that runs `atexit`. Although [basic.start.term#5](https://eel.is/c++draft/basic.start.term#5) provides guarantees on the order in which these destructors are called (thread storage duration objects are destructed _before_ any `atexit` handlers run), it appears that gnu libstdc++ does not always implement this correctly (if not compiled with `_GLIBCXX_HAVE___CXA_THREAD_ATEXIT`). Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below `main`. Hence, we cannot call `cudaEventDestroy` to deallocate our `thread_local` events. Since there are a finite number of these event (`ndevices * nparticipating_threads`), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems. - Closes #1371 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Mark Harris (https://github.com/harrism) - Jake Hemstad (https://github.com/jrhemstad) URL: rapidsai/rmm#1375
Description
An object with
thread_local
modifier has thread storage duration, its destructor (if it exists) will after the thread exits, which, on the main thread, is belowmain
(https://eel.is/c++draft/basic.start.term). The CUDA runtime sets up (when the first call into the runtime is made) a teardown of the driver that runsatexit
. Although basic.start.term#5 provides guarantees on the order in which these destructors are called (thread storage duration objects are destructed before anyatexit
handlers run), it appears that gnu libstdc++ does not always implement this correctly (if not compiled with_GLIBCXX_HAVE___CXA_THREAD_ATEXIT
).Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below
main
. Hence, we cannot callcudaEventDestroy
to deallocate ourthread_local
events. Since there are a finite number of these event (ndevices * nparticipating_threads
), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems.thread_local
struct before intializing rmm in PTDS causes a cuda error at exit #1371Checklist