-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct signatures for torch allocator plug in #1407
Conversation
467f347
to
fe84d63
Compare
fe84d63
to
6cb1242
Compare
6cb1242
to
ec3dbc3
Compare
Since pytorch/pytorch#91398, the signature of the pluggable allocate and deallocate functions must accept the device id. The current version only accepts a device id for allocate, which means that when using a stream ordered allocator with devices other than device zero, we pass an invalid stream into the deallocation function. To fix this, adapt the signature to match the one pytorch expects. Now, since we have the device available during allocation and deallocation, we would like to use that device to obtain the appropriate memory resource. Unfortunately, since RMM's cuda_device_id does not have a nullary constructor, we can't use it in Cython without some hacky workarounds. However, since we don't actually need to build a Python module, but rather just a single shared library that offers two extern "C" functions, let's just write our allocator hooks directly in C++. - Closes rapidsai#1405
ec3dbc3
to
dcca9bb
Compare
@shwina I think there was some discussion when that pytorch PR went through about whether the |
Why do you need a nullary ctor? It's probably OK to just add However, since the device is being passed to the function, it may be better to use I really prefer NOT to do this if we can find out the pyTorch semantics, because we shouldn't have to query and set the current device on every device memory allocation / deallocation... But this would be a good solution if we need it. |
Due to the way cython transpiles, it complains if you have:
Produces:
But having moved this plug in to being written directly in C++ I don't need to worry about workarounds for this |
PyTorch does not guarantee that plug-in allocator functions are called with the active device matching the device on which the allocation/deallocation is requested. Hence, we must explicitly handle that here by selecting the appropriate device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just one nit.
I wonder if we should benchmark before and after due to the repeated device queries this will result in -- I don't know if we have a benchmark though.
I wrote something simple. I can't benchmark direct from C++ easily (I have not attempted to build and link against torch...). I check for amdahl (so small allocations) on a two-device system with. The allocation/deallocation I measure is just a loop doing
And then use three different allocation functions:
Results
BAD_CXX case for a call that takes ~3500ns. |
/merge |
Strange, I don't see any reason the new code should be faster. |
Since we wrote the original code in cython the transpiled C++ code was a little bit heavier weight (even with all the type annotations), it looked like this: void *allocate(size_t size, int device, void *stream) {
void *ptr = NULL;
PyGILState_STATE __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
try {
rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource();
rmm::cuda_stream_view stream_view = rmm::cuda_stream_view((cudaStream_t)stream);
ptr = mr->allocate(size, stream_view);
} catch(...) {
__Pyx_CppExn2PyErr();
goto error;
}
goto success;
error:;
const char *filename = NULL;
__Pyx_AddTraceback("rmm._lib.old_torch.allocate", filename);
success:;
__Pyx_PyGILState_Release(__pyx_gilstate_save);
return ptr;
} In particular, the main difference is that we always used to pay to acquire and release the Python GIL. Aside, this was probably another bug in waiting in multithreaded environments. |
The RMM Python source now contains a non-generated C++ file, `_torch_allocator.cpp`, from rapidsai/rmm#1407, so we need to avoid deleting it when cleaning rmm-python. This approach might not be the best if RMM Python goes on to include more C++ sources, but this fixes the clean / build process for now.
Description
Since pytorch/pytorch#91398, the signature of the pluggable allocate and deallocate functions must accept the device id. The current version only accepts a device id for allocate, which means that when using a stream ordered allocator with devices other than device zero, we pass an invalid stream into the deallocation function. To fix this, adapt the signature to match the one pytorch expects.
Now, since we have the device available during allocation and deallocation, we would like to use that device to obtain the appropriate memory resource.
Unfortunately, since RMM's cuda_device_id does not have a nullary constructor, we can't use it in Cython without some hacky workarounds.
However, since we don't actually need to build a Python module, but rather just a single shared library that offers two extern "C" functions, let's just write our allocator hooks directly in C++.
Checklist