Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip async mr tests when cuda runtime/driver < 11.2 #986

Merged
merged 8 commits into from
Feb 25, 2022

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Feb 23, 2022

The cuda async allocator was added in 11.2, if the runtime or driver is older than that the tests wouldn't work.

Fixes #985

@rongou rongou added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Feb 23, 2022
@rongou rongou requested review from harrism and ajschmidt8 February 23, 2022 17:58
@rongou rongou requested a review from a team as a code owner February 23, 2022 17:58
@rongou rongou self-assigned this Feb 23, 2022
@rongou rongou requested a review from vyasr February 23, 2022 17:58
@github-actions github-actions bot added the cpp Pertains to C++ code label Feb 23, 2022
@jakirkham
Copy link
Member

cc @ajschmidt8 (for awareness)

@ajschmidt8
Copy link
Member

ajschmidt8 commented Feb 23, 2022

@rongou, thanks for this! Can you apply the patch below to your PR? This will correctly execute all of the tests and let us ensure that these changes are working as expected.

diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
index d2065e0d..95f95d4a 100755
--- a/ci/gpu/build.sh
+++ b/ci/gpu/build.sh
@@ -102,7 +102,7 @@ else
 
     gpuci_logger "Running googletests"
     # run gtests from librmm_tests package
-    for gt in "$CONDA_PREFIX/bin/gtests/librmm/*" ; do
+    for gt in "$CONDA_PREFIX/bin/gtests/librmm/"*; do
         ${gt} --gtest_output=xml:${TESTRESULTS_DIR}/
         exitcode=$?
         if (( ${exitcode} != 0 )); then

@rongou rongou requested a review from a team as a code owner February 23, 2022 18:56
@github-actions github-actions bot added the gpuCI label Feb 23, 2022
@rongou
Copy link
Contributor Author

rongou commented Feb 23, 2022

@ajschmidt8 done.

@ajschmidt8
Copy link
Member

ajschmidt8 commented Feb 23, 2022

@rongou, ah. So these failures are where I got stumped in my previous PR. Not sure what to make of them.

image

Comment on lines 33 to 34
int driver_version{};
RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int driver_version{};
RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version));
int driver_supports_pool;
RMM_CUDA_TRY(cudaDeviceGetAttribute(&driver_supports_pool, cudaDevAttrMemoryPoolsSupported);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also needed here:

inline auto make_arena()
{
return rmm::mr::make_owning_wrapper<rmm::mr::arena_memory_resource>(make_cuda());
}

We should wrap this logic in a function like is_cuda_pool_available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe static bool cuda_async_memory_resource::is_supported(). Then use this in the cuda_async_mr ctor as well as here in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a symbol not found error if I use cudaDevAttrMemoryPoolsSupported with cuda 11.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well you still have to guard it just like in the ctor using the flag we already have. The problem we are solving isn't compiling with cuda 11.0. It's compiling with 11.2+ and running on <11.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. I think the checks can be simpler and all be encapsulated in a static MR method.

{
static auto runtime_version{[] {
int runtime_version{};
RMM_CUDA_TRY(cudaRuntimeGetVersion(&runtime_version));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a static member of the cuda_async_memory_resource. And it should use the simple check that is already in the ctor, not check explicit driver and runtime versions. The check in the ctor already factors in both driver and runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
static bool is_supported()
{
static auto runtime_version{[] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convince me that this isn't sufficient.

Suggested change
static auto runtime_version{[] {
static bool is_supported()
{
#ifdef RMM_CUDA_MALLOC_ASYNC_SUPPORT
// Check if cudaMallocAsync Memory pool supported
auto const device = rmm::detail::current_device();
int cuda_pool_supported{};
auto result =
cudaDeviceGetAttribute(&cuda_pool_supported, cudaDevAttrMemoryPoolsSupported, device.value());
return result == cudaSuccess and cuda_pool_supported;
#else
return false;
#endif
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can compile with a newer cuda toolkit but run in an older runtime, then you'd get a symbol not found error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which symbol would not be found? cudaDeviceAttribute will exist on all CUDA versions we support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudaDevAttrMemoryPoolsSupported

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once compiled that would be a value compiled into our binary that would get passed to the API, wouldn't it? Not a symbol.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this code already works in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I think it's just the check doesn't work (build with cuda 11.6, run with 11.0):

./gtests/CUDA_ASYNC_MR_TEST: symbol lookup error: ./gtests/CUDA_ASYNC_MR_TEST: undefined symbol: cudaMemPoolCreate, version libcudart.so.11.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I printed out the value of cuda_pool_supported, it's actually 1 when running under 11.0.

@@ -233,7 +232,7 @@ struct mr_factory {
struct mr_test : public ::testing::TestWithParam<mr_factory> {
void SetUp() override
{
if (GetParam().name == "CUDA_Async" && should_skip_async()) {
if (GetParam().name == "CUDA_Async" && !rmm::mr::cuda_async_memory_resource::is_supported()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more robust to put it in the make_async factory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it return?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is better, but please take a look.

@rongou rongou requested review from harrism and jrhemstad February 24, 2022 01:12
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One very small request. Otherwise looks good. Thanks @rongou !

Comment on lines 256 to 258
return std::make_shared<rmm::mr::cuda_async_memory_resource>();
}
return std::shared_ptr<rmm::mr::cuda_async_memory_resource>{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit subtle since the only difference is () vs. {}. Maybe use an explicit nullptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -234,6 +234,10 @@ struct mr_test : public ::testing::TestWithParam<mr_factory> {
{
auto factory = GetParam().factory;
mr = factory();
if (mr == nullptr) {
GTEST_SKIP() << "Skipping tests since the memory resource is not supported with this CUDA "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the GTEST_SKIP directly in the make_cuda_async doesn't work? I'm not sure what GTEST_SKIP actually does or what scope it needs to be used in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a macro that only works in a test method or SetUp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought it might throw an exception that gtest knows to catch or something.

@ajschmidt8
Copy link
Member

rerun tests

1 similar comment
@ajschmidt8
Copy link
Member

rerun tests

@rongou rongou requested a review from jrhemstad February 25, 2022 16:45
Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM 🔥

@rongou
Copy link
Contributor Author

rongou commented Feb 25, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ace0964 into rapidsai:branch-22.04 Feb 25, 2022
@harrism
Copy link
Member

harrism commented Feb 27, 2022

@rongou and @ajschmidt8 in the future please remember that we require two C++ codeowner approvals before merging C++ PRs. There was only one approval.

@rongou rongou deleted the skip-async branch June 10, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team bug Something isn't working cpp Pertains to C++ code gpuCI non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Configure CUDA Malloc Async tests to be skipped in unsupported environments
5 participants