Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333

Merged
merged 14 commits into from
Sep 13, 2023

Conversation

harrism
Copy link
Member

@harrism harrism commented Aug 31, 2023

Description

As described in #1306 , stream_ordered_memory_resource maintained thread-local storage of an event to synchronize in the per-thread default stream. However, this caused the event to be recorded on the wrong stream if allocations were made on memory resources associated with different devices, because allocation on the first device on the PTDS would initialize the TLS for that stream, and subsequent device pools would try to use the already initialized TLS.

This PR adds a new test that only runs on multidevice systems (more correctly, does nothing on single device systems). The test creates two per-device pools, and creates and destroys a device buffer on each.

It also fixes stream_ordered_memory_resource to store the ID of the device that is current when it is constructed, and then to store a vector of one event per device in TLS rather than a single event. When the PTDS is passed to get_event, the event for the MR's stored device ID is used. This should correctly support PTDS with multiple threads and multiple devices (still one MR per device).

The PR also includes some changes to the device ID utilities in cuda_device.hpp. There is a new RAII device helper class, and a get_num_cuda_devices() function.

Finally, there is a small addition to the .clang-tidy to disable warnings about do...while loops inside of RMM error checking macros.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@harrism harrism added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Aug 31, 2023
@harrism harrism self-assigned this Aug 31, 2023
@harrism harrism requested a review from a team as a code owner August 31, 2023 07:23
@harrism harrism requested review from cwharris and jrhemstad August 31, 2023 07:23
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 31, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@harrism harrism requested a review from wence- August 31, 2023 07:23
@github-actions github-actions bot added the cpp Pertains to C++ code label Aug 31, 2023
@harrism
Copy link
Member Author

harrism commented Aug 31, 2023

/ok to test

@harrism
Copy link
Member Author

harrism commented Aug 31, 2023

/ok to test

1 similar comment
@harrism
Copy link
Member Author

harrism commented Aug 31, 2023

/ok to test

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harrism. These changes look correct to me, I mostly have an additional query about what guarantees we want RMM to be providing in a multi-device setting wrt user management of device context switching.

Rationale: I still think there are circumstances which one might expect to work but which do not with these changes, and I suppose we should decide if we want to make those work, or document that they do not, and why.

tests/mr/device/pool_mr_tests.cpp Show resolved Hide resolved
rmm::device_buffer buf_b(16, rmm::cuda_stream_per_thread, mrs[1].get());
}

RMM_CUDA_TRY(cudaSetDevice(0));
Copy link
Contributor

@wence- wence- Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user forgets to call cudaSetDevice(0) here then ~buf_a() runs with device one's context active. In this case both with rmm::cuda_stream_default and rmm::cuda_stream_per_thread, the context attached to the stream will not match the context active when the event was created.

Do we want to handle that case transparently in this PR? I think no is fine, but if not we should add some documentation to the pool memory resource noting that it is the user's responsibility to ensure that in a multi-device setting, destructors are run with the current device set to the device of the allocating mr.

I think the same comment also applies to the allocation. Edit: not necessary, since the docs already make use of an MR with a different device active UB.

Do we want to support this pattern below? Edit: No, this usage of a memory resource created with device-zero active and then used with device-one active is explicitly called out as undefined behaviour in the documentation.

  {
    cudaSetDevice(0);
    {
      cudaSetDevice(1);
      rmm::device_uvector<int> a(10, rmm::cuda_stream_default, mrs[0].get());
      rmm::device_uvector<int> b(20, rmm::cuda_stream_default, mrs[1].get());
    }
    cudaSetDevice(0);
    rmm::device_uvector<int> c{0, rmm::cuda_stream_default, mrs[0].get()};
  }

Note use of cuda_stream_default (rather than stream_per_thread). Here, the user has first used the mr created with device-zero active to allocate a buffer with device-one active. So first time through the stream-event map gets populated with an event that was created with the device-one context active. This is then picked up when c is destructed, but now the device-zero context is active and we get an error again.

Again, I think it's fine to punt on fixing this in this PR, but we should at least document the restrictions on the mr object (which in this case would be that allocation must run with the device that the mr was created on active). Edit: this is mentioned as UB in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to spend some time thinking about this. We may want our RAII data structures to ensure they activate the device they were allocated on in their destructors...

Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Given your comment below about UB in the docs, I've updated my comment to answer some of the queries: we don't need to handle device switching in allocation, and my usage example is invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.

Yeah, I am not sure which side of then fence I am on here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, and discussing with @jrhemstad , I am inclined to leave this as UB for now. I don't think we want to be switching devices back and forth every time we mr->deallocate(). I'm not confident that this can be done without a large impact on performance, and there will be situations where the user has more information and can optimize device switching.

If it becomes a problem for containers, perhaps we can add it at the allocator level, rather than the memory_resource level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense.

Suggestion (non-blocking): Shall we add a sentence explicitly mentioning destructors to the "Multiple devices" section of the documentation.

Perhaps after the existing first paragraph:

A device_memory_resource should only be used when the active CUDA device is the same device that was active when the device_memory_resource was created. Otherwise behavior is undefined.

"Note that this means that destructors must run with the same CUDA device active as used when creating their associated memory resource."

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Let me add something along these lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added doc and example.

tests/mr/device/pool_mr_tests.cpp Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented Aug 31, 2023

The cpp builds failed due to (I think) conda-forge/rhash-feedstock#25. It looks like an updated cmake conda package is now built so I have rerun the tests 🤞

@wence-
Copy link
Contributor

wence- commented Sep 5, 2023

/ok to test

@harrism
Copy link
Member Author

harrism commented Sep 6, 2023

/ok to test

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harrism, I've dismissed my requested changes, and propose one non-blocking suggestion to add an explicit sentence about destructors in the multiple-device setting to the docs. (My review has no power here though, since not a cpp-codeowner...)

@harrism
Copy link
Member Author

harrism commented Sep 6, 2023

@wence- Thank you for your attention to detail! Even if you are not a cpp-codeowner, since you reviewed so carefully, I count your approval toward the 2 C++ approvals we require.

@harrism
Copy link
Member Author

harrism commented Sep 7, 2023

/ok to test

@harrism harrism added 4 - Needs Reviewer Waiting for reviewer to review or respond and removed 3 - Ready for review Ready for review by team labels Sep 7, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the issue, code, and prior PR discussions here. To the best of my knowledge, this looks like a good fix. I learned a fair bit while reviewing this PR, so thanks for the lessons @harrism @wence- !

README.md Outdated Show resolved Hide resolved
@harrism
Copy link
Member Author

harrism commented Sep 12, 2023

/ok to test

@harrism
Copy link
Member Author

harrism commented Sep 12, 2023

/merge

@harrism harrism added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 12, 2023
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra shared_ptr copy isn't the end of the world, so you could always address that later.

@harrism harrism removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 13, 2023
@harrism
Copy link
Member Author

harrism commented Sep 13, 2023

/ok to test

@harrism
Copy link
Member Author

harrism commented Sep 13, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Reviewer Waiting for reviewer to review or respond bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[QST] Is get_event()'s thread_local problematic?
4 participants