-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable opportunistic reuse in async mr when cuda driver < 11.5 #993
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not fully bullet proof: the old cudaDevAttrMemoryPoolsSupported
check needs to be kept as the support could be hardware dependent.
Also, #990 is a renovation of the async MR support, so I'd suggest to keep others in the loop.
Added back the device attribute check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to fully disable cudaMallocAsync for <11.5. We just need to disable opportunistic reuse. This is done via cudaMemPoolSetAttribute
and setting cudaMemPoolReuseAllowOpportunistic
to zero.
Do we want to disable it for all cuda driver versions? |
No, just for less than 11.5. |
Done. |
@rongou can you please edit the title and description to better reflect the changes? |
@harrism done. |
@leofang please re-review. |
@leofang please take another look. Thanks! |
@robertmaynard @rongou fyi, this is going to need to be updated with #990. So which ever merges first, the other will need to update. |
@gpucibot merge |
With NVIDIA/spark-rapids#4710 we found some issues with the async pool that may cause memory errors with older drivers. This was confirmed with the cuda team. For driver version < 11.5, we'll disable
cudaMemPoolReuseAllowOpportunistic
.@abellina