-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Arena MR to support simultaneous access by PTDS and other streams #1395
Fix Arena MR to support simultaneous access by PTDS and other streams #1395
Conversation
another pool of streams being used at the same time Signed-off-by: Thomas Graves <[email protected]>
/ok to test |
making sure deallocations work Signed-off-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
Note, I added a test here. If I remove my fix to arena_memory_resources the test fails because it throws:
If there are better ideas for a test happy to update or add. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly commentary so that I understand what the test is doing, and a few small questions.
tests/mr/device/arena_mr_tests.cpp
Outdated
arena_mr mr(rmm::mr::get_current_device_resource(), arena_size); | ||
std::vector<std::thread> threads; | ||
std::size_t num_threads{3}; | ||
auto view = std::make_shared<rmm::cuda_stream_view>(rmm::cuda_stream_per_thread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Does this need to be shared or can you just use auto view = rmm::cuda_stream_per_thread
? and then replace view->value()
by view
below.
tests/mr/device/arena_mr_tests.cpp
Outdated
std::vector<std::thread> threads; | ||
std::size_t num_threads{3}; | ||
auto view = std::make_shared<rmm::cuda_stream_view>(rmm::cuda_stream_per_thread); | ||
void* thread_ptr = mr.allocate(256, view->value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this one allocates from a thread arena (because it's the PTDS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes because its cuda_stream_per_thread
tests/mr/device/arena_mr_tests.cpp
Outdated
auto* ptr1 = mr.allocate(superblock::minimum_size); | ||
auto* ptr2 = mr.allocate(superblock::minimum_size); | ||
auto* ptr3 = mr.allocate(superblock::minimum_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can't be satisfied by the defragged existing superblock because that only has 1MiB - 256bytes space.
So now, the superblock that contains thread_ptr
lives in the set
maintained by the global arena.
Deallocating at this point was something that was already handled.
tests/mr/device/arena_mr_tests.cpp
Outdated
auto* ptr1 = mr.allocate(superblock::minimum_size); | ||
auto* ptr2 = mr.allocate(superblock::minimum_size); | ||
auto* ptr3 = mr.allocate(superblock::minimum_size); | ||
auto* ptr4 = mr.allocate(32_KiB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allocation can be satisfied by the superblock that contains the allocation of thread_ptr
. So now, the superblock which contains the allocation for thread_ptr
lives in the set
maintained in the stream_arenas_
(specifically the default stream).
tests/mr/device/arena_mr_tests.cpp
Outdated
auto* ptr3 = mr.allocate(superblock::minimum_size); | ||
auto* ptr4 = mr.allocate(32_KiB); | ||
// NOLINTNEXTLINE(cppcoreguidelines-avoid-goto) | ||
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, view->value())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And now this deallocation will not find the superblock for this pointer in the thread arena it was initially allocated with, nor in the global arena. But rather in one of the stream arenas.
And previously this would fail. Now, your fix searches (after failing to find in the global arena) the stream arenas, and the deallocation succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
tests/mr/device/arena_mr_tests.cpp
Outdated
mr.deallocate(ptr1, superblock::minimum_size); | ||
mr.deallocate(ptr2, superblock::minimum_size); | ||
mr.deallocate(ptr3, superblock::minimum_size); | ||
mr.deallocate(ptr4, 32_KiB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And clean up, these deallocations have "always" been fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes just cleanup
tests/mr/device/arena_mr_tests.cpp
Outdated
} | ||
for (auto& thread : threads) { | ||
thread.join(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: It seems to me that you don't actually need more than one thread for this test. Just to have this allocate/deallocate pattern using a combination of PTDS and non-default streams. Do I have it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, I think that (prior to your fix), replacing this thread-based allocation with:
for (int i = 0; i < 3; i++) {
cuda_stream stream{};
void *ptr = mr.allocate(32_KiB, stream);
mr.deallocate(ptr, 32_KiB, stream);
}
Would also provoke failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct and I don't really need 3 so simplifying this
Co-authored-by: Lawrence Mitchell <[email protected]>
Co-authored-by: Lawrence Mitchell <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
Pushed changes to the test to simplify it and add comments. @wence- let me know if it I missed any comment or you have further questions. |
Signed-off-by: Thomas Graves <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this looks good! Top debugging work too.
I'll wait for @harrism to take a look as well (this was the first time I was reading the arena code...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, from my (fairly limited) knowledge of the arena allocator.
Co-authored-by: Bradley Dice <[email protected]>
if (!global_arena_.deallocate(ptr, bytes)) { RMM_FAIL("allocation not found"); } | ||
if (!global_arena_.deallocate(ptr, bytes)) { | ||
// It's possible to use per thread default streams along with another pool of streams. | ||
// This means that it's possible for an arena to move from a thread or stream arena back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for an allocation to move ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Signed-off-by: Thomas Graves <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: This looks great -- nice that a relatively simple test verifies it. Nice debugging and good explanatory comments.
I would like to just make the streams explicit in all allocations/deallocations for clarity. There are a few nits that I marked "non-blocking" that I will leave up to you whether you change them.
tests/mr/device/arena_mr_tests.cpp
Outdated
auto per_thread_stream = rmm::cuda_stream_per_thread; | ||
// Create an allocation from a per thread arena | ||
void* thread_ptr = mr.allocate(256, per_thread_stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit [non-blocking]: No need to store this constant in a variable.
auto per_thread_stream = rmm::cuda_stream_per_thread; | |
// Create an allocation from a per thread arena | |
void* thread_ptr = mr.allocate(256, per_thread_stream); | |
// Create an allocation from a per thread arena | |
void* thread_ptr = mr.allocate(256, rmm::cuda_stream_per_thread); |
tests/mr/device/arena_mr_tests.cpp
Outdated
// The original thread ptr is now owned by a stream arena so make | ||
// sure deallocation works. | ||
// NOLINTNEXTLINE(cppcoreguidelines-avoid-goto) | ||
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit [non-blocking]:
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, per_thread_stream)); | |
EXPECT_NO_THROW(mr.deallocate(thread_ptr, 256, rmm::cuda_stream_per_thread)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit [non-blocking]:
Tests will fail if there is an exception whether or not you EXPECT_NO_THROW
. Not throwing is expected by default. So you can remove the macro and the NOLINTNEXTLINE
.
tests/mr/device/arena_mr_tests.cpp
Outdated
// the next allocation causes defrag. Defrag causes all superblocks | ||
// from the thread and stream arena allocated above to go back to | ||
// global arena and it allocates one superblock to the stream arena. | ||
auto* ptr1 = mr.allocate(superblock::minimum_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What stream will this use?
Suggestion: Can you make the stream explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
auto* ptr1 = mr.allocate(superblock::minimum_size); | |
auto* ptr1 = mr.allocate(superblock::minimum_size, rmm::cuda_stream_view{}); |
tests/mr/device/arena_mr_tests.cpp
Outdated
// Allocate again to make sure all superblocks from | ||
// global arena are owned by a stream arena instead of a thread arena | ||
// or the global arena. | ||
auto* ptr2 = mr.allocate(32_KiB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What stream will this use?
Suggestion: Can you make the stream explicit?
tests/mr/device/arena_mr_tests.cpp
Outdated
mr.deallocate(ptr1, superblock::minimum_size); | ||
mr.deallocate(ptr2, 32_KiB); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What stream will this use?
Suggestion: Can you make the stream explicit?
Signed-off-by: Thomas Graves <[email protected]>
Fixed all nits and requested changes. Reran test without the fix in arena_memory_resource and verified it fails, applied patch and reran test and it passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, clear test. Nice that it requires no threads or even loops or branches!
/merge |
…rapidsai#1395) Replaces rapidsai#1394, this is targeted for 24.02. fixes rapidsai#1393 In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with: ``` terminate called after throwing an instance of 'rmm::logic_error' what(): RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-594-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:238: allocation not found ``` The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used. It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena. I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - Rong Ou (https://github.com/rongou) - Mark Harris (https://github.com/harrism) URL: rapidsai#1395
…#1395) (#1396) This PR backports #1395 from 24.02 to 23.12. It contains an arena MR fix for simultaneous access by PTDS and other streams. Backport requested by @sameerz @GregoryKimball. Authors: - Thomas Graves (https://github.com/tgravescs) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Mark Harris (https://github.com/harrism)
Description
Replaces #1394, this is targeted for 24.02.
fixes #1393
In Spark with the Spark Rapids accelerator using cudf 23.12 snapshot we have an application that is reading ORC files, doing some light processing and then writing ORC files. It consistently fails while doing the ORC write with:
The underlying issue is brought about because Spark with the Rapids accelerate is using ARENA allocator with per default streams enabled. CUDF recently added its own stream pool that is used in addition to when per default streams are used.
It's now possible to use per thread default streams along with another pool of streams. This means that it's possible for an arena to move from a thread or stream arena back into the global arena during a defragmentation and then move down into another arena type. For instance, thread arena -> global arena -> stream arena. If this happens and there was an allocation from it while it was a thread arena, we now have to check to see if the allocation is part of a stream arena.
I added a test here. I was trying to make sure that all the allocations were now in stream arenas, if there is a better way to do this please let me know.
Checklist