Deliberately leak PTDS thread_local events in stream ordered mr #1375

wence- · 2023-11-08T10:27:19Z

Description

An object with thread_local modifier has thread storage duration, its destructor (if it exists) will after the thread exits, which, on the main thread, is below main (https://eel.is/c++draft/basic.start.term). The CUDA runtime sets up (when the first call into the runtime is made) a teardown of the driver that runs atexit. Although basic.start.term#5 provides guarantees on the order in which these destructors are called (thread storage duration objects are destructed before any atexit handlers run), it appears that gnu libstdc++ does not always implement this correctly (if not compiled with _GLIBCXX_HAVE___CXA_THREAD_ATEXIT).

Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below main. Hence, we cannot call cudaEventDestroy to deallocate our thread_local events. Since there are a finite number of these event (ndevices * nparticipating_threads), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems.

Closes [BUG] Initializing a non-trivial thread_local struct before intializing rmm in PTDS causes a cuda error at exit #1371

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

If an object with thread_local modifier has static storage duration, its destructor (if it exists) will run below main (https://eel.is/c++draft/basic.start.term). The CUDA runtime also sets up (when the first call into the runtime is made) a teardown of the driver that runs atexit. Although [basic.start.term#5](https://eel.is/c++draft/basic.start.term#5) provides guarantees on the order in which these destructors are called, it appears that no C++ stdlib implementation correctly implements this for thread_local objects with static storage duration. Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below main. Hence, we cannot call cudaEventDestroy to deallocate our thread_local events. Since there are a finite number of these event (ndevices * nparticipating_threads), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems. - Closes rapidsai#1371

harrism

I don't think this fix is quite right?

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp

wence- · 2023-11-08T11:37:58Z

I couldn't think of a good way to test this (though locally it fixes the issue in #1371).

harrism

👏 praise: ‏ Ignore my previous review. Nice work. I like the simplification fixing this bug uncovers!

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp

harrism · 2023-11-08T22:58:28Z

@wence- if you want to make the two suggested changes above I think we can get this merged.

Co-authored-by: Jake Hemstad <[email protected]>

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp

wence- · 2023-11-08T23:27:49Z

/merge

Lawrence Mitchell Wed Nov 8 23:36:18 2023 +0000 Deliberately leak PTDS thread_local events in stream ordered mr (#1375) An object with `thread_local` modifier has thread storage duration, its destructor (if it exists) will after the thread exits, which, on the main thread, is below `main` (https://eel.is/c++draft/basic.start.term). The CUDA runtime sets up (when the first call into the runtime is made) a teardown of the driver that runs `atexit`. Although [basic.start.term#5](https://eel.is/c++draft/basic.start.term#5) provides guarantees on the order in which these destructors are called (thread storage duration objects are destructed _before_ any `atexit` handlers run), it appears that gnu libstdc++ does not always implement this correctly (if not compiled with `_GLIBCXX_HAVE___CXA_THREAD_ATEXIT`). Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below `main`. Hence, we cannot call `cudaEventDestroy` to deallocate our `thread_local` events. Since there are a finite number of these event (`ndevices * nparticipating_threads`), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems. - Closes #1371 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Mark Harris (https://github.com/harrism) - Jake Hemstad (https://github.com/jrhemstad) URL: rapidsai/rmm#1375

wence- requested a review from a team as a code owner November 8, 2023 10:27

wence- requested review from rongou and jrhemstad November 8, 2023 10:27

github-actions bot added the cpp Pertains to C++ code label Nov 8, 2023

wence- mentioned this pull request Nov 8, 2023

[BUG] Initializing a non-trivial thread_local struct before intializing rmm in PTDS causes a cuda error at exit #1371

Closed

harrism requested changes Nov 8, 2023

View reviewed changes

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Show resolved Hide resolved

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Show resolved Hide resolved

harrism approved these changes Nov 8, 2023

View reviewed changes

harrism added bug Something isn't working non-breaking Non-breaking change labels Nov 8, 2023

wence- commented Nov 8, 2023

View reviewed changes

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Show resolved Hide resolved

jrhemstad reviewed Nov 8, 2023

View reviewed changes

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Outdated Show resolved Hide resolved

jrhemstad approved these changes Nov 8, 2023

View reviewed changes

Temporary to simplify event creation

76324e6

Co-authored-by: Jake Hemstad <[email protected]>

wence- commented Nov 8, 2023

View reviewed changes

include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp Outdated Show resolved Hide resolved

Style

d89bd8d

rapids-bot bot merged commit d407fd3 into rapidsai:branch-23.12 Nov 8, 2023
44 checks passed

wence- deleted the wence/fix/1371 branch November 8, 2023 23:37

vyasr mentioned this pull request Oct 18, 2024

[BUG] cuFile driver closing causes segfault upon program termination rapidsai/cudf#17121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deliberately leak PTDS thread_local events in stream ordered mr #1375

Deliberately leak PTDS thread_local events in stream ordered mr #1375

wence- commented Nov 8, 2023 •

edited

Loading

harrism left a comment

wence- commented Nov 8, 2023

harrism left a comment

harrism commented Nov 8, 2023

wence- commented Nov 8, 2023

Deliberately leak PTDS thread_local events in stream ordered mr #1375

Deliberately leak PTDS thread_local events in stream ordered mr #1375

Conversation

wence- commented Nov 8, 2023 • edited Loading

Description

Checklist

harrism left a comment

Choose a reason for hiding this comment

wence- commented Nov 8, 2023

harrism left a comment

Choose a reason for hiding this comment

harrism commented Nov 8, 2023

wence- commented Nov 8, 2023

wence- commented Nov 8, 2023 •

edited

Loading