Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dlpack implementation for PbTensor #223

Merged
merged 8 commits into from
May 8, 2023

Conversation

oandreeva-nv
Copy link
Contributor

This PR updates DLPack implementation for PbTensor according to Array API and DLPack v0.8 requirements.

Major changes:

  • Added __dlpack__ and __dlpack_device__ implementations according to Array API standard
  • Current implementation is backward-compatible, i.e. PbTensor.from_dlpack() accepts PyCapsule (as before) and external tensor directly. PbTensor::FromDLPack takes care of that. Former PbTensor::FromDLPack -> PbTensor::FromDLPackCapsule
  • PbTensor::DLPack, which is bind to __dlpack__(self, Optional[Union[int, Any]] stream) make sure to synchronize streams, if needed (passed stream is not a default stream, not -1, and not a legacy stream). Since PbTensor does not support async execution, if DLPack receives default stream, sync is not needed.
  • I've implemented some refactoring, i.e. separated switch into PbTensor::DeviceType, because I realized I need this for DLPackDevice
  • DLPackDevice has returned type std::pair<int32_t, int64_t> with the following in mind. This isn't finished yet. When it is done, a more elegant type description can be put instead. For now I used types, which are indicated in the code. Open to any suggestions/alternatives
  • Added bool support

Fixes issue: 3944

src/pb_tensor.cc Outdated
// assume the legacy default stream.
// Reference:
// https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
return FromDLPackCapsule(name, tensor.attr("__dlpack__")());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think based on the API descriptions we should provide -1?

If stream is -1, the value may be used by the consumer to signal “producer must not perform any synchronization”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this. The way it is now , we are telling to the producer that he can assume default stream, which we are on now. I believe this is also an indication, that sync is not required, since python backend does not allow async execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't infer that from the API description. It just mentions that the tensor would be on the default stream but the consumer would still need to synchronize even if it is on the default stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I believe PyTorch treats the passed None as no sync as well, I probably was influenced by their implementation a little bit. I'll make a change to pass -1 for no sync.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm passing stream=None in Revision 1, this way if tensor is on CPU, this is the only option that allowed, and if a tensor is on GPU, legacy stream is assumed according to API standard

README.md Outdated
@@ -1199,6 +1199,13 @@ class TritonPythonModel:
# tensor.
input0 = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(pytorch_tensor))
```
Starting from 23.04 release, Python backend allows external tensors being passed
directly to `pb_utils.Tensor.from_dlpack()` without an explicit DLPack encoding.
For instance:
Copy link
Member

@Tabrizian Tabrizian Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Starting from 23.04 release, Python backend allows tensors that implement __dlpack__ and __dlpack_device__ interface to be converted to Python backend tensors."

Also link the __dlpack__ to corresponding API descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like the copyright for some of the files needs to be updated. e.g. src/pb_stub_utils.cc, src/pb_tensor.h ...

@oandreeva-nv oandreeva-nv force-pushed the oandreeva_dlpack_update branch 3 times, most recently from 93aed88 to 2799865 Compare March 28, 2023 06:01
krishung5
krishung5 previously approved these changes Mar 28, 2023
Copy link
Contributor

@krishung5 krishung5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll defer the final approval to Iman.

src/pb_tensor.cc Outdated
@@ -31,6 +31,7 @@
#ifdef TRITON_PB_STUB
#include "pb_stub_utils.h"
namespace py = pybind11;
using namespace py::literals;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not use using namespace since it can pollute the symbol space and create conflicts. You can create aliases for easier usability. For example, namespace pyl = py::literals.

src/pb_tensor.cc Outdated
{
if (name == "") {
throw PythonBackendException("Tensor name cannot be an empty string.");
}
if (py::isinstance<py::capsule>(tensor)) {
return FromDLPackCapsule(name, tensor);
} else if (py::hasattr(tensor, "__dlpack__")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make sure it implements both __dlpack__ and __dlpack_device__? It looks like we don't need __dlpack_device__ when python backend is the consumer since the same information is available in dlpack capsule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to ensure dlpack_device is implemented here, since we don't call it explicitly, and it is true, capsule should contain device info.

src/pb_tensor.cc Outdated
const std::string& name, const py::capsule& dlpack_tensor)
{
#ifdef TRITON_ENABLE_GPU
cudaError_t err = cudaStreamSynchronize(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need the synchronization when the tensor type is GPU. We should add a condition around this so that it is executed only on GPU tensors. Also, what happens in the multi-GPU case? Do we need to change the context to the correct GPU before synchronization?

src/pb_tensor.cc Outdated
(stream.is(py::int_(1)) || stream.is(py::none()))
? 0
: reinterpret_cast<cudaStream_t>(py::cast<uint64_t>(stream));
cudaError_t err = cudaStreamSynchronize(cuda_stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't synchronize their stream. I think to guarantee stream ordering we don't have to do anything with their stream since the pointers are already synchronized.

Copy link
Contributor Author

@oandreeva-nv oandreeva-nv Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can assume that pointers are synchronized at this point. When external stream calls pbTensor's __dlpack__ method, which invokes this function, I believe we are not guaranteed that pointers will be sync'ed. In older DLPack implementations with intermediate capsule, I believe some implementations assumed that to_dlpack and from_dlpack are on the same stream, thus they are aligned. in here, __dlpack__ is called on the external tensor's stream and is accessing pbTensor on the legacy stream, so we must ensure that any pending operations on the default stream are finished. I agree that current implementation is not the right way of ensuring that, but we still need to make sure that all work on pbTensor is done, before passing it to external tensor

@oandreeva-nv oandreeva-nv force-pushed the oandreeva_dlpack_update branch from f2e2722 to ecf1888 Compare April 25, 2023 21:55
@oandreeva-nv oandreeva-nv force-pushed the oandreeva_dlpack_update branch from 426e0fa to b2af292 Compare April 29, 2023 00:44
@oandreeva-nv oandreeva-nv changed the title Update dlpack implementation for PbTensor [WIP]Update dlpack implementation for PbTensor May 1, 2023
@oandreeva-nv
Copy link
Contributor Author

oandreeva-nv commented May 2, 2023

@Tabrizian it is ready for re-review. My only concern and topic for discussion id device sync here. It is necessary for non-blocking compute streams (in case they are used). My issue is, it is quite enforcing and may have an impact on the speed, as an example in BLS cases probably? As an alternative, we can make this sync conditional on either an environment label, or parameter, set with --backend-config flag. Let me know what you think

And because of the device sync, synchronization is only left for new api. I think it make sense to keep old api without device sync. I can add a note to documentation regarding this.

src/pb_tensor.cc Outdated
Comment on lines 367 to 377
// In case there is a pending job on the data, where this capsule
// is pointing to, we need to wait for it before consuming.
// This is important for when data is located in different
// context (GPU) or work is done on non-blocking streams.
err = cudaDeviceSynchronize();
if (err != cudaSuccess) {
throw PythonBackendException(
"Failed to synchronize CUDA device with id " +
std::to_string(
overridden ? capsule_device_info.second : current_device));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this synchronization even if we are passing default stream to __dlpack__?

"DLPack capsule passed pointer to memory allocated on GPU device, \
when GPU is not available");
#endif
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure that the device type is CPU. DLPack supports other device types too which we are not able to handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check on L407-L409 to throw an exception, if device is not CPU or CUDAHost(which we support)

@Tabrizian
Copy link
Member

My issue is, it is quite enforcing and may have an impact on the speed, as an example in BLS cases probably? As an alternative, we can make this sync conditional on either an environment label, or parameter, set with --backend-config flag. Let me know what you think

I was curious why the sync was required. Did you observe a failing test without synchronization?

@oandreeva-nv
Copy link
Contributor Author

@Tabrizian Yes, in the case of multi-gpu multi-stream, when streams are non - blocking, I encountered, that without synchronization, this test fails and we don't get an expected data.

@oandreeva-nv
Copy link
Contributor Author

@Tabrizian I found out that an issue in observed failing tests maybe not to the lack of synchronization in the from_dlpack method, but rather something is impacting tests, when they ran together. In isolation each test is passing without device sync. Currently investigating an issue

@oandreeva-nv
Copy link
Contributor Author

@Tabrizian It's ready for re-review. The only time we need synchronization is when we need to change context and there is a pending work on the default stream. Since standard does not say to sync on default stream, not everyone does. Then in this case we can return earlier, than work is done, thus cudaStreamSync(0) is added. I've also filed a ticket to investigate a strange behavior with cupy , multi-gpu non-blocking streams : DLIS-4887. For now, I run this test on a different second GPU.

@oandreeva-nv oandreeva-nv requested a review from Tabrizian May 8, 2023 17:40
@oandreeva-nv oandreeva-nv changed the title [WIP]Update dlpack implementation for PbTensor Update dlpack implementation for PbTensor May 8, 2023
// and we will return pointer to the data on different GPU too early
// (i.e. before pending work is done). Thus we sync on the default stream
// only in the case we switched to a different context.
err = overridden ? cudaStreamSynchronize(0) : cudaSuccess;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why synchronization is needed. Looking at CuPy's dlpack implementation it doesn't seem like they are synchronizing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on testing and researching: when we need to switch devices, e.g. from GPU:0 to GPU:1, to consume the capsule, if there is a pending work on the default stream on GPU:1, wait-for-the-event trick will not be invoked, since the external tensor is on the default stream and cupy and pytorch do not call for wait-for-the-event trick if the tensor and passed stream are on the default stream. If we don't sync on the default stream, we just consume capsule and switch back, so situation as in multi-gpu test will fail, i.e. pending job on that tensor on the default stream

@oandreeva-nv
Copy link
Contributor Author

Note for this PR: there is a necessity to follow up with CuPY and PyTorch to investigate why they do not (?) handle stream=1, i.e. do not convert it to the default stream (legacy stream)

@oandreeva-nv oandreeva-nv merged commit 6c4b817 into main May 8, 2023
@Tabrizian Tabrizian deleted the oandreeva_dlpack_update branch November 10, 2023 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants