-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip async mr tests when cuda runtime/driver < 11.2 #986
Conversation
cc @ajschmidt8 (for awareness) |
@rongou, thanks for this! Can you apply the patch below to your PR? This will correctly execute all of the tests and let us ensure that these changes are working as expected. diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
index d2065e0d..95f95d4a 100755
--- a/ci/gpu/build.sh
+++ b/ci/gpu/build.sh
@@ -102,7 +102,7 @@ else
gpuci_logger "Running googletests"
# run gtests from librmm_tests package
- for gt in "$CONDA_PREFIX/bin/gtests/librmm/*" ; do
+ for gt in "$CONDA_PREFIX/bin/gtests/librmm/"*; do
${gt} --gtest_output=xml:${TESTRESULTS_DIR}/
exitcode=$?
if (( ${exitcode} != 0 )); then
|
@ajschmidt8 done. |
@rongou, ah. So these failures are where I got stumped in my previous PR. Not sure what to make of them. |
int driver_version{}; | ||
RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int driver_version{}; | |
RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version)); | |
int driver_supports_pool; | |
RMM_CUDA_TRY(cudaDeviceGetAttribute(&driver_supports_pool, cudaDevAttrMemoryPoolsSupported); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also needed here:
rmm/tests/mr/device/mr_test.hpp
Lines 258 to 261 in d8dc715
inline auto make_arena() | |
{ | |
return rmm::mr::make_owning_wrapper<rmm::mr::arena_memory_resource>(make_cuda()); | |
} |
We should wrap this logic in a function like is_cuda_pool_available
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe static bool cuda_async_memory_resource::is_supported()
. Then use this in the cuda_async_mr
ctor as well as here in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a symbol not found error if I use cudaDevAttrMemoryPoolsSupported
with cuda 11.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well you still have to guard it just like in the ctor using the flag we already have. The problem we are solving isn't compiling with cuda 11.0. It's compiling with 11.2+ and running on <11.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing. I think the checks can be simpler and all be encapsulated in a static MR method.
tests/mr/device/skip_async.hpp
Outdated
{ | ||
static auto runtime_version{[] { | ||
int runtime_version{}; | ||
RMM_CUDA_TRY(cudaRuntimeGetVersion(&runtime_version)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a static member of the cuda_async_memory_resource
. And it should use the simple check that is already in the ctor, not check explicit driver and runtime versions. The check in the ctor already factors in both driver and runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
static bool is_supported() | ||
{ | ||
static auto runtime_version{[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convince me that this isn't sufficient.
static auto runtime_version{[] { | |
static bool is_supported() | |
{ | |
#ifdef RMM_CUDA_MALLOC_ASYNC_SUPPORT | |
// Check if cudaMallocAsync Memory pool supported | |
auto const device = rmm::detail::current_device(); | |
int cuda_pool_supported{}; | |
auto result = | |
cudaDeviceGetAttribute(&cuda_pool_supported, cudaDevAttrMemoryPoolsSupported, device.value()); | |
return result == cudaSuccess and cuda_pool_supported; | |
#else | |
return false; | |
#endif | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can compile with a newer cuda toolkit but run in an older runtime, then you'd get a symbol not found error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which symbol would not be found? cudaDeviceAttribute will exist on all CUDA versions we support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudaDevAttrMemoryPoolsSupported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once compiled that would be a value compiled into our binary that would get passed to the API, wouldn't it? Not a symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure this code already works in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually I think it's just the check doesn't work (build with cuda 11.6, run with 11.0):
./gtests/CUDA_ASYNC_MR_TEST: symbol lookup error: ./gtests/CUDA_ASYNC_MR_TEST: undefined symbol: cudaMemPoolCreate, version libcudart.so.11.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I printed out the value of cuda_pool_supported
, it's actually 1
when running under 11.0.
tests/mr/device/mr_test.hpp
Outdated
@@ -233,7 +232,7 @@ struct mr_factory { | |||
struct mr_test : public ::testing::TestWithParam<mr_factory> { | |||
void SetUp() override | |||
{ | |||
if (GetParam().name == "CUDA_Async" && should_skip_async()) { | |||
if (GetParam().name == "CUDA_Async" && !rmm::mr::cuda_async_memory_resource::is_supported()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more robust to put it in the make_async
factory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is better, but please take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very small request. Otherwise looks good. Thanks @rongou !
tests/mr/device/mr_test.hpp
Outdated
return std::make_shared<rmm::mr::cuda_async_memory_resource>(); | ||
} | ||
return std::shared_ptr<rmm::mr::cuda_async_memory_resource>{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit subtle since the only difference is ()
vs. {}
. Maybe use an explicit nullptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -234,6 +234,10 @@ struct mr_test : public ::testing::TestWithParam<mr_factory> { | |||
{ | |||
auto factory = GetParam().factory; | |||
mr = factory(); | |||
if (mr == nullptr) { | |||
GTEST_SKIP() << "Skipping tests since the memory resource is not supported with this CUDA " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the GTEST_SKIP
directly in the make_cuda_async
doesn't work? I'm not sure what GTEST_SKIP
actually does or what scope it needs to be used in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a macro that only works in a test method or SetUp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I thought it might throw an exception that gtest knows to catch or something.
rerun tests |
1 similar comment
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! LGTM 🔥
@gpucibot merge |
@rongou and @ajschmidt8 in the future please remember that we require two C++ codeowner approvals before merging C++ PRs. There was only one approval. |
The cuda async allocator was added in 11.2, if the runtime or driver is older than that the tests wouldn't work.
Fixes #985