Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333
Fix stream_ordered_memory_resource attempt to record event in stream from another device #1333
Changes from 13 commits
39edcd0
421007b
429c71d
f01d797
79a298d
d27f574
f299c03
28c094d
951c34b
9f5c6db
bbe46bd
397336b
df469f4
44c89cc
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user forgets to call
cudaSetDevice(0)
here then~buf_a()
runs with device one's context active. In this case both withrmm::cuda_stream_default
andrmm::cuda_stream_per_thread
, the context attached to the stream will not match the context active when the event was created.Do we want to handle that case transparently in this PR? I think no is fine, but if not we should add some documentation to the pool memory resource noting that it is the user's responsibility to ensure that in a multi-device setting, destructors are run with the current device set to the device of the allocating
mr
.I think the same comment also applies to the allocation.Edit: not necessary, since the docs already make use of an MR with a different device active UB.Do we want to support this pattern below? Edit: No, this usage of a memory resource created with device-zero active and then used with device-one active is explicitly called out as undefined behaviour in the documentation.
Note use ofcuda_stream_default
(rather thanstream_per_thread
). Here, the user has first used themr
created with device-zero active to allocate a buffer with device-one active. So first time through the stream-event map gets populated with an event that was created with the device-one context active. This is then picked up whenc
is destructed, but now the device-zero context is active and we get an error again.Again, I think it's fine to punt on fixing this in this PR, but we should at least document the restrictions on theEdit: this is mentioned as UB in the docs.mr
object (which in this case would be that allocation must run with the device that the mr was created on active).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to spend some time thinking about this. We may want our RAII data structures to ensure they activate the device they were allocated on in their destructors...
Or we may not. That may forcing performance costs on the user under the covers when they might not need the help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Given your comment below about UB in the docs, I've updated my comment to answer some of the queries: we don't need to handle device switching in allocation, and my usage example is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am not sure which side of then fence I am on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this, and discussing with @jrhemstad , I am inclined to leave this as UB for now. I don't think we want to be switching devices back and forth every time we
mr->deallocate()
. I'm not confident that this can be done without a large impact on performance, and there will be situations where the user has more information and can optimize device switching.If it becomes a problem for containers, perhaps we can add it at the
allocator
level, rather than thememory_resource
level.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this makes sense.
Suggestion (non-blocking): Shall we add a sentence explicitly mentioning destructors to the "Multiple devices" section of the documentation.
Perhaps after the existing first paragraph:
"Note that this means that destructors must run with the same CUDA device active as used when creating their associated memory resource."
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Let me add something along these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added doc and example.