-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dlpack implementation for PbTensor #223
Conversation
src/pb_tensor.cc
Outdated
// assume the legacy default stream. | ||
// Reference: | ||
// https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html | ||
return FromDLPackCapsule(name, tensor.attr("__dlpack__")()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think based on the API descriptions we should provide -1?
If stream is -1, the value may be used by the consumer to signal “producer must not perform any synchronization”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this. The way it is now , we are telling to the producer that he can assume default stream, which we are on now. I believe this is also an indication, that sync is not required, since python backend does not allow async execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't infer that from the API description. It just mentions that the tensor would be on the default stream but the consumer would still need to synchronize even if it is on the default stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I believe PyTorch treats the passed None
as no sync as well, I probably was influenced by their implementation a little bit. I'll make a change to pass -1
for no sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm passing stream=None
in Revision 1, this way if tensor is on CPU, this is the only option that allowed, and if a tensor is on GPU, legacy stream is assumed according to API standard
README.md
Outdated
@@ -1199,6 +1199,13 @@ class TritonPythonModel: | |||
# tensor. | |||
input0 = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(pytorch_tensor)) | |||
``` | |||
Starting from 23.04 release, Python backend allows external tensors being passed | |||
directly to `pb_utils.Tensor.from_dlpack()` without an explicit DLPack encoding. | |||
For instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Starting from 23.04 release, Python backend allows tensors that implement __dlpack__
and __dlpack_device__
interface to be converted to Python backend tensors."
Also link the __dlpack__
to corresponding API descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems like the copyright for some of the files needs to be updated. e.g. src/pb_stub_utils.cc
, src/pb_tensor.h
...
93aed88
to
2799865
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'll defer the final approval to Iman.
src/pb_tensor.cc
Outdated
@@ -31,6 +31,7 @@ | |||
#ifdef TRITON_PB_STUB | |||
#include "pb_stub_utils.h" | |||
namespace py = pybind11; | |||
using namespace py::literals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not use using namespace
since it can pollute the symbol space and create conflicts. You can create aliases for easier usability. For example, namespace pyl = py::literals
.
src/pb_tensor.cc
Outdated
{ | ||
if (name == "") { | ||
throw PythonBackendException("Tensor name cannot be an empty string."); | ||
} | ||
if (py::isinstance<py::capsule>(tensor)) { | ||
return FromDLPackCapsule(name, tensor); | ||
} else if (py::hasattr(tensor, "__dlpack__")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make sure it implements both __dlpack__
and __dlpack_device__
? It looks like we don't need __dlpack_device__
when python backend is the consumer since the same information is available in dlpack capsule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to ensure dlpack_device is implemented here, since we don't call it explicitly, and it is true, capsule should contain device info.
src/pb_tensor.cc
Outdated
const std::string& name, const py::capsule& dlpack_tensor) | ||
{ | ||
#ifdef TRITON_ENABLE_GPU | ||
cudaError_t err = cudaStreamSynchronize(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need the synchronization when the tensor type is GPU
. We should add a condition around this so that it is executed only on GPU tensors. Also, what happens in the multi-GPU case? Do we need to change the context to the correct GPU before synchronization?
src/pb_tensor.cc
Outdated
(stream.is(py::int_(1)) || stream.is(py::none())) | ||
? 0 | ||
: reinterpret_cast<cudaStream_t>(py::cast<uint64_t>(stream)); | ||
cudaError_t err = cudaStreamSynchronize(cuda_stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't synchronize their stream. I think to guarantee stream ordering we don't have to do anything with their stream since the pointers are already synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we can assume that pointers are synchronized at this point. When external stream calls pbTensor's __dlpack__
method, which invokes this function, I believe we are not guaranteed that pointers will be sync'ed. In older DLPack implementations with intermediate capsule, I believe some implementations assumed that to_dlpack
and from_dlpack
are on the same stream, thus they are aligned. in here, __dlpack__
is called on the external tensor's stream and is accessing pbTensor
on the legacy stream, so we must ensure that any pending operations on the default stream are finished. I agree that current implementation is not the right way of ensuring that, but we still need to make sure that all work on pbTensor
is done, before passing it to external tensor
2799865
to
f2e2722
Compare
f2e2722
to
ecf1888
Compare
426e0fa
to
b2af292
Compare
@Tabrizian it is ready for re-review. My only concern and topic for discussion id device sync here. It is necessary for non-blocking compute streams (in case they are used). My issue is, it is quite enforcing and may have an impact on the speed, as an example in BLS cases probably? As an alternative, we can make this sync conditional on either an environment label, or parameter, set with And because of the device sync, synchronization is only left for new api. I think it make sense to keep old api without device sync. I can add a note to documentation regarding this. |
src/pb_tensor.cc
Outdated
// In case there is a pending job on the data, where this capsule | ||
// is pointing to, we need to wait for it before consuming. | ||
// This is important for when data is located in different | ||
// context (GPU) or work is done on non-blocking streams. | ||
err = cudaDeviceSynchronize(); | ||
if (err != cudaSuccess) { | ||
throw PythonBackendException( | ||
"Failed to synchronize CUDA device with id " + | ||
std::to_string( | ||
overridden ? capsule_device_info.second : current_device)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this synchronization even if we are passing default stream to __dlpack__
?
"DLPack capsule passed pointer to memory allocated on GPU device, \ | ||
when GPU is not available"); | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure that the device type is CPU. DLPack supports other device types too which we are not able to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check on L407-L409 to throw an exception, if device is not CPU or CUDAHost(which we support)
I was curious why the sync was required. Did you observe a failing test without synchronization? |
@Tabrizian Yes, in the case of multi-gpu multi-stream, when streams are non - blocking, I encountered, that without synchronization, this test fails and we don't get an expected data. |
@Tabrizian I found out that an issue in observed failing tests maybe not to the lack of synchronization in the |
@Tabrizian It's ready for re-review. The only time we need synchronization is when we need to change context and there is a pending work on the default stream. Since standard does not say to sync on default stream, not everyone does. Then in this case we can return earlier, than work is done, thus |
// and we will return pointer to the data on different GPU too early | ||
// (i.e. before pending work is done). Thus we sync on the default stream | ||
// only in the case we switched to a different context. | ||
err = overridden ? cudaStreamSynchronize(0) : cudaSuccess; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why synchronization is needed. Looking at CuPy's dlpack implementation it doesn't seem like they are synchronizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on testing and researching: when we need to switch devices, e.g. from GPU:0 to GPU:1, to consume the capsule, if there is a pending work on the default stream on GPU:1, wait-for-the-event trick will not be invoked, since the external tensor is on the default stream and cupy and pytorch do not call for wait-for-the-event trick if the tensor and passed stream are on the default stream. If we don't sync on the default stream, we just consume capsule and switch back, so situation as in multi-gpu test will fail, i.e. pending job on that tensor on the default stream
Note for this PR: there is a necessity to follow up with CuPY and PyTorch to investigate why they do not (?) handle stream=1, i.e. do not convert it to the default stream (legacy stream) |
This PR updates DLPack implementation for PbTensor according to Array API and DLPack v0.8 requirements.
Major changes:
__dlpack__
and__dlpack_device__
implementations according to Array API standardPbTensor.from_dlpack()
accepts PyCapsule (as before) and external tensor directly.PbTensor::FromDLPack
takes care of that. FormerPbTensor::FromDLPack
->PbTensor::FromDLPackCapsule
PbTensor::DLPack
, which is bind to__dlpack__(self, Optional[Union[int, Any]] stream)
make sure to synchronize streams, if needed (passed stream is not a default stream, not -1, and not a legacy stream). Since PbTensor does not support async execution, ifDLPack
receives default stream, sync is not needed.switch
intoPbTensor::DeviceType
, because I realized I need this forDLPackDevice
DLPackDevice
has returned typestd::pair<int32_t, int64_t>
with the following in mind. This isn't finished yet. When it is done, a more elegant type description can be put instead. For now I used types, which are indicated in the code. Open to any suggestions/alternativesFixes issue: 3944