Attempt to address oom failures in test suite #1672

wence- · 2024-09-06T16:48:11Z

Description

Audit the existing MR tests and serialize those that make large allocations (specifically, a pool with 90% of the available device memory). This also allows us to remove serialization from some of the tests which don't make large allocations.

Closes [BUG] Flaky tests #1671

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Audit the existing MR tests and serialize those that make large allocations (specifically, a pool with 90% of the available device memory). This also allows us to remove serialization from some of the tests which don't make large allocations.

bdice

Thanks for auditing the test suite and finding the root causes! I have one suggestion for us to refactor/simplify the arena test.

bdice · 2024-09-06T16:53:21Z

tests/CMakeLists.txt

@@ -125,7 +125,7 @@ ConfigureTest(DEVICE_MR_REF_TEST mr/device/mr_ref_tests.cpp
 ConfigureTest(ADAPTOR_TEST mr/device/adaptor_tests.cpp)

 # pool mr tests
-ConfigureTest(POOL_MR_TEST mr/device/pool_mr_tests.cpp GPUS 1 PERCENT 60)
+ConfigureTest(POOL_MR_TEST mr/device/pool_mr_tests.cpp GPUS 1 PERCENT 100)


Note: Large allocation occurs here:

rmm/tests/mr/device/pool_mr_tests.cpp

Line 74 in 9864b51

auto const ninety_percent_pool = rmm::percent_of_free_device_memory(90);

bdice · 2024-09-06T16:57:51Z

tests/CMakeLists.txt

@@ -182,7 +182,7 @@ ConfigureTest(PREFETCH_TEST prefetch_tests.cpp)
 ConfigureTest(LOGGER_TEST logger_tests.cpp)

 # arena MR tests
-ConfigureTest(ARENA_MR_TEST mr/device/arena_mr_tests.cpp GPUS 1 PERCENT 60)
+ConfigureTest(ARENA_MR_TEST mr/device/arena_mr_tests.cpp GPUS 1 PERCENT 100)


Note: Large allocation occurs here:

rmm/tests/mr/device/arena_mr_tests.cpp

Lines 487 to 495 in 9864b51

TEST_F(ArenaTest, AllocateNinetyPercent) // NOLINT

{

EXPECT_NO_THROW([]() { // NOLINT(cppcoreguidelines-avoid-goto)

auto const free = rmm::available_device_memory().first;

auto const ninety_percent = rmm::align_up(

static_cast<std::size_t>(static_cast<double>(free) * 0.9), rmm::CUDA_ALLOCATION_ALIGNMENT);

arena_mr mr(rmm::mr::get_current_device_resource(), ninety_percent);

}());

}

We should refactor this test to use rmm::percent_of_free_device_memory. I think the only difference is whether it uses rmm::align_up vs. rmm::align_down, but that seems unimportant here?

I'll do this one separately.

harrism · 2024-09-06T21:08:57Z

Thanks @wence- !

wence- · 2024-09-09T13:26:16Z

/merge

github-actions bot added CMake cpp Pertains to C++ code labels Sep 6, 2024

wence- added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 6, 2024

bdice approved these changes Sep 6, 2024

View reviewed changes

harrism approved these changes Sep 6, 2024

View reviewed changes

wence- marked this pull request as ready for review September 9, 2024 13:25

wence- requested a review from a team as a code owner September 9, 2024 13:25

rapids-bot bot merged commit 1e5fa03 into rapidsai:branch-24.10 Sep 9, 2024
57 checks passed

wence- deleted the wence/fix/1671 branch September 9, 2024 13:26

wence- mentioned this pull request Sep 9, 2024

Use rmm::percent_of_free_device_memory in arena_mr_tests #1674

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to address oom failures in test suite #1672

Attempt to address oom failures in test suite #1672

wence- commented Sep 6, 2024

bdice left a comment

bdice Sep 6, 2024

bdice Sep 6, 2024

wence- Sep 9, 2024

harrism commented Sep 6, 2024

wence- commented Sep 9, 2024

	TEST_F(ArenaTest, AllocateNinetyPercent) // NOLINT
	{
	EXPECT_NO_THROW([]() { // NOLINT(cppcoreguidelines-avoid-goto)
	auto const free = rmm::available_device_memory().first;
	auto const ninety_percent = rmm::align_up(
	static_cast<std::size_t>(static_cast<double>(free) * 0.9), rmm::CUDA_ALLOCATION_ALIGNMENT);
	arena_mr mr(rmm::mr::get_current_device_resource(), ninety_percent);
	}());
	}

Attempt to address oom failures in test suite #1672

Attempt to address oom failures in test suite #1672

Conversation

wence- commented Sep 6, 2024

Description

Checklist

bdice left a comment

Choose a reason for hiding this comment

bdice Sep 6, 2024

Choose a reason for hiding this comment

bdice Sep 6, 2024

Choose a reason for hiding this comment

wence- Sep 9, 2024

Choose a reason for hiding this comment

harrism commented Sep 6, 2024

wence- commented Sep 9, 2024