Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deliberately leak PTDS thread_local events in stream ordered mr #1375

Merged
merged 3 commits into from
Nov 8, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Nov 8, 2023

Description

An object with thread_local modifier has thread storage duration, its destructor (if it exists) will after the thread exits, which, on the main thread, is below main (https://eel.is/c++draft/basic.start.term). The CUDA runtime sets up (when the first call into the runtime is made) a teardown of the driver that runs atexit. Although basic.start.term#5 provides guarantees on the order in which these destructors are called (thread storage duration objects are destructed before any atexit handlers run), it appears that gnu libstdc++ does not always implement this correctly (if not compiled with _GLIBCXX_HAVE___CXA_THREAD_ATEXIT).

Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below main. Hence, we cannot call cudaEventDestroy to deallocate our thread_local events. Since there are a finite number of these event (ndevices * nparticipating_threads), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

If an object with thread_local modifier has static storage duration,
its destructor (if it exists) will run below
main (https://eel.is/c++draft/basic.start.term). The CUDA runtime also
sets up (when the first call into the runtime is made) a teardown of
the driver that runs atexit. Although
[basic.start.term#5](https://eel.is/c++draft/basic.start.term#5)
provides guarantees on the order in which these destructors are
called, it appears that no C++ stdlib implementation correctly
implements this for thread_local objects with static storage duration.

Moreover (possibly consequently) it is considered undefined behaviour
to call into the CUDA runtime below main. Hence, we cannot call
cudaEventDestroy to deallocate our thread_local events. Since there
are a finite number of these event (ndevices *
nparticipating_threads), rather than attempting to destroy them we
choose to leak them, thus avoiding any sequencing problems.

- Closes rapidsai#1371
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fix is quite right?

@wence-
Copy link
Contributor Author

wence- commented Nov 8, 2023

I couldn't think of a good way to test this (though locally it fixes the issue in #1371).

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 praise: ‏ Ignore my previous review. Nice work. I like the simplification fixing this bug uncovers!

@harrism harrism added bug Something isn't working non-breaking Non-breaking change labels Nov 8, 2023
@harrism
Copy link
Member

harrism commented Nov 8, 2023

@wence- if you want to make the two suggested changes above I think we can get this merged.

@wence-
Copy link
Contributor Author

wence- commented Nov 8, 2023

/merge

@rapids-bot rapids-bot bot merged commit d407fd3 into rapidsai:branch-23.12 Nov 8, 2023
44 checks passed
@wence- wence- deleted the wence/fix/1371 branch November 8, 2023 23:37
domcharrier added a commit to ROCm/hipMM that referenced this pull request Dec 12, 2024
Lawrence Mitchell
Wed Nov 8 23:36:18 2023 +0000
Deliberately leak PTDS thread_local events in stream ordered mr (#1375)
An object with `thread_local` modifier has thread storage duration, its destructor (if it exists) will after the thread exits, which, on the main thread, is below `main` (https://eel.is/c++draft/basic.start.term). The CUDA runtime sets up (when the first call into the runtime is made) a teardown of the driver that runs `atexit`. Although [basic.start.term#5](https://eel.is/c++draft/basic.start.term#5) provides guarantees on the order in which these destructors are called (thread storage duration objects are destructed _before_ any `atexit` handlers run), it appears that gnu libstdc++ does not always implement this correctly (if not compiled with `_GLIBCXX_HAVE___CXA_THREAD_ATEXIT`).

Moreover (possibly consequently) it is considered undefined behaviour to call into the CUDA runtime below `main`. Hence, we cannot call `cudaEventDestroy` to deallocate our `thread_local` events. Since there are a finite number of these event (`ndevices * nparticipating_threads`), rather than attempting to destroy them we choose to leak them, thus avoiding any sequencing problems.

- Closes #1371

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: rapidsai/rmm#1375
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change
Projects
Status: Done
3 participants