Update dlpack implementation for PbTensor #223

oandreeva-nv · 2023-03-24T08:39:42Z

This PR updates DLPack implementation for PbTensor according to Array API and DLPack v0.8 requirements.

Major changes:

Added __dlpack__ and __dlpack_device__ implementations according to Array API standard
- Python Specification for DLPack
- [Array API specification] (https://data-apis.org/array-api/latest/API_specification/index.html)
Current implementation is backward-compatible, i.e. PbTensor.from_dlpack() accepts PyCapsule (as before) and external tensor directly. PbTensor::FromDLPack takes care of that. Former PbTensor::FromDLPack -> PbTensor::FromDLPackCapsule
PbTensor::DLPack, which is bind to __dlpack__(self, Optional[Union[int, Any]] stream) make sure to synchronize streams, if needed (passed stream is not a default stream, not -1, and not a legacy stream). Since PbTensor does not support async execution, if DLPack receives default stream, sync is not needed.
I've implemented some refactoring, i.e. separated switch into PbTensor::DeviceType, because I realized I need this for DLPackDevice
DLPackDevice has returned type std::pair<int32_t, int64_t> with the following in mind. This isn't finished yet. When it is done, a more elegant type description can be put instead. For now I used types, which are indicated in the code. Open to any suggestions/alternatives
Added bool support

Fixes issue: 3944

src/pb_tensor.h

src/pb_tensor.cc

Tabrizian · 2023-03-24T15:05:03Z

src/pb_tensor.cc

+    // assume the legacy default stream.
+    // Reference:
+    // https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html
+    return FromDLPackCapsule(name, tensor.attr("__dlpack__")());


I think based on the API descriptions we should provide -1?

If stream is -1, the value may be used by the consumer to signal “producer must not perform any synchronization”.

We can do this. The way it is now , we are telling to the producer that he can assume default stream, which we are on now. I believe this is also an indication, that sync is not required, since python backend does not allow async execution.

I didn't infer that from the API description. It just mentions that the tensor would be on the default stream but the consumer would still need to synchronize even if it is on the default stream.

I see, I believe PyTorch treats the passed None as no sync as well, I probably was influenced by their implementation a little bit. I'll make a change to pass -1 for no sync.

I'm passing stream=None in Revision 1, this way if tensor is on CPU, this is the only option that allowed, and if a tensor is on GPU, legacy stream is assumed according to API standard

Tabrizian · 2023-03-24T15:06:54Z

README.md

@@ -1199,6 +1199,13 @@ class TritonPythonModel:
    # tensor.
    input0 = pb_utils.Tensor.from_dlpack("INPUT0", to_dlpack(pytorch_tensor))
 ```
+Starting from 23.04 release, Python backend allows external tensors being passed 
+directly to `pb_utils.Tensor.from_dlpack()` without an explicit DLPack encoding. 
+For instance:


"Starting from 23.04 release, Python backend allows tensors that implement __dlpack__ and __dlpack_device__ interface to be converted to Python backend tensors."

Also link the __dlpack__ to corresponding API descriptions.

krishung5

nit: seems like the copyright for some of the files needs to be updated. e.g. src/pb_stub_utils.cc, src/pb_tensor.h ...

src/pb_tensor.cc

krishung5

LGTM! I'll defer the final approval to Iman.

Tabrizian · 2023-03-28T20:03:37Z

src/pb_tensor.cc

@@ -31,6 +31,7 @@
 #ifdef TRITON_PB_STUB
 #include "pb_stub_utils.h"
 namespace py = pybind11;
+using namespace py::literals;


I would rather not use using namespace since it can pollute the symbol space and create conflicts. You can create aliases for easier usability. For example, namespace pyl = py::literals.

Tabrizian · 2023-03-28T20:06:36Z

src/pb_tensor.cc

 {
  if (name == "") {
    throw PythonBackendException("Tensor name cannot be an empty string.");
  }
+  if (py::isinstance<py::capsule>(tensor)) {
+    return FromDLPackCapsule(name, tensor);
+  } else if (py::hasattr(tensor, "__dlpack__")) {


Do we need to make sure it implements both __dlpack__ and __dlpack_device__? It looks like we don't need __dlpack_device__ when python backend is the consumer since the same information is available in dlpack capsule?

We don't need to ensure dlpack_device is implemented here, since we don't call it explicitly, and it is true, capsule should contain device info.

Tabrizian · 2023-03-28T20:09:28Z

src/pb_tensor.cc

+    const std::string& name, const py::capsule& dlpack_tensor)
+{
+#ifdef TRITON_ENABLE_GPU
+  cudaError_t err = cudaStreamSynchronize(0);


We only need the synchronization when the tensor type is GPU. We should add a condition around this so that it is executed only on GPU tensors. Also, what happens in the multi-GPU case? Do we need to change the context to the correct GPU before synchronization?

Tabrizian · 2023-03-28T20:12:44Z

src/pb_tensor.cc

+        (stream.is(py::int_(1)) || stream.is(py::none()))
+            ? 0
+            : reinterpret_cast<cudaStream_t>(py::cast<uint64_t>(stream));
+    cudaError_t err = cudaStreamSynchronize(cuda_stream);


We shouldn't synchronize their stream. I think to guarantee stream ordering we don't have to do anything with their stream since the pointers are already synchronized.

I don't think that we can assume that pointers are synchronized at this point. When external stream calls pbTensor's __dlpack__ method, which invokes this function, I believe we are not guaranteed that pointers will be sync'ed. In older DLPack implementations with intermediate capsule, I believe some implementations assumed that to_dlpack and from_dlpack are on the same stream, thus they are aligned. in here, __dlpack__ is called on the external tensor's stream and is accessing pbTensor on the legacy stream, so we must ensure that any pending operations on the default stream are finished. I agree that current implementation is not the right way of ensuring that, but we still need to make sure that all work on pbTensor is done, before passing it to external tensor

…needed, added sync

oandreeva-nv · 2023-05-02T17:07:21Z

@Tabrizian it is ready for re-review. My only concern and topic for discussion id device sync here. It is necessary for non-blocking compute streams (in case they are used). My issue is, it is quite enforcing and may have an impact on the speed, as an example in BLS cases probably? As an alternative, we can make this sync conditional on either an environment label, or parameter, set with --backend-config flag. Let me know what you think

And because of the device sync, synchronization is only left for new api. I think it make sense to keep old api without device sync. I can add a note to documentation regarding this.

Tabrizian · 2023-05-03T18:43:50Z

src/pb_tensor.cc

+    // In case there is a pending job on the data, where this capsule
+    // is pointing to, we need to wait for it before consuming.
+    // This is important for when data is located in different
+    // context (GPU) or work is done on non-blocking streams.
+    err = cudaDeviceSynchronize();
+    if (err != cudaSuccess) {
+      throw PythonBackendException(
+          "Failed to synchronize CUDA device with id " +
+          std::to_string(
+              overridden ? capsule_device_info.second : current_device));
+    }


Do we still need this synchronization even if we are passing default stream to __dlpack__?

Tabrizian · 2023-05-03T18:45:21Z

src/pb_tensor.cc

+        "DLPack capsule passed pointer to memory allocated on GPU device, \
+          when GPU is not available");
+#endif
+  }


We need to make sure that the device type is CPU. DLPack supports other device types too which we are not able to handle.

I've added a check on L407-L409 to throw an exception, if device is not CPU or CUDAHost(which we support)

Tabrizian · 2023-05-03T18:47:06Z

My issue is, it is quite enforcing and may have an impact on the speed, as an example in BLS cases probably? As an alternative, we can make this sync conditional on either an environment label, or parameter, set with --backend-config flag. Let me know what you think

I was curious why the sync was required. Did you observe a failing test without synchronization?

oandreeva-nv · 2023-05-03T19:21:17Z

@Tabrizian Yes, in the case of multi-gpu multi-stream, when streams are non - blocking, I encountered, that without synchronization, this test fails and we don't get an expected data.

oandreeva-nv · 2023-05-04T01:06:40Z

@Tabrizian I found out that an issue in observed failing tests maybe not to the lack of synchronization in the from_dlpack method, but rather something is impacting tests, when they ran together. In isolation each test is passing without device sync. Currently investigating an issue

oandreeva-nv · 2023-05-08T17:40:45Z

@Tabrizian It's ready for re-review. The only time we need synchronization is when we need to change context and there is a pending work on the default stream. Since standard does not say to sync on default stream, not everyone does. Then in this case we can return earlier, than work is done, thus cudaStreamSync(0) is added. I've also filed a ticket to investigate a strange behavior with cupy , multi-gpu non-blocking streams : DLIS-4887. For now, I run this test on a different second GPU.

Tabrizian · 2023-05-08T21:11:46Z

src/pb_tensor.cc

+    // and we will return pointer to the data on different GPU too early 
+    // (i.e. before pending work is done). Thus we sync on the default stream
+    // only in the case we switched to a different context.
+    err = overridden ? cudaStreamSynchronize(0) : cudaSuccess;


I'm not sure why synchronization is needed. Looking at CuPy's dlpack implementation it doesn't seem like they are synchronizing.

This is based on testing and researching: when we need to switch devices, e.g. from GPU:0 to GPU:1, to consume the capsule, if there is a pending work on the default stream on GPU:1, wait-for-the-event trick will not be invoked, since the external tensor is on the default stream and cupy and pytorch do not call for wait-for-the-event trick if the tensor and passed stream are on the default stream. If we don't sync on the default stream, we just consume capsule and switch back, so situation as in multi-gpu test will fail, i.e. pending job on that tensor on the default stream

oandreeva-nv · 2023-05-08T23:20:11Z

Note for this PR: there is a necessity to follow up with CuPY and PyTorch to investigate why they do not (?) handle stream=1, i.e. do not convert it to the default stream (legacy stream)

oandreeva-nv requested review from Tabrizian and krishung5 March 24, 2023 08:53

Tabrizian requested changes Mar 24, 2023

View reviewed changes

krishung5 reviewed Mar 27, 2023

View reviewed changes

src/pb_tensor.cc Outdated Show resolved Hide resolved

oandreeva-nv force-pushed the oandreeva_dlpack_update branch 3 times, most recently from 93aed88 to 2799865 Compare March 28, 2023 06:01

oandreeva-nv requested review from Tabrizian and krishung5 March 28, 2023 17:41

krishung5 previously approved these changes Mar 28, 2023

View reviewed changes

Tabrizian reviewed Mar 28, 2023

View reviewed changes

oandreeva-nv dismissed krishung5’s stale review via f2e2722 April 21, 2023 22:45

oandreeva-nv force-pushed the oandreeva_dlpack_update branch from 2799865 to f2e2722 Compare April 21, 2023 22:45

oandreeva-nv added 4 commits April 25, 2023 14:55

Update dlpack implementation for PbTensor

6474699

Revision 1.

e72351d

Revision 2.

0091ce5

Removing synch on default stream since it is not needed

ecf1888

oandreeva-nv force-pushed the oandreeva_dlpack_update branch from f2e2722 to ecf1888 Compare April 25, 2023 21:55

oandreeva-nv added 2 commits April 28, 2023 12:30

Added device switching when tensor comes from different context

796e5da

Added reinterpret cast to pybind11 object, changing device only when …

b2af292

…needed, added sync

oandreeva-nv force-pushed the oandreeva_dlpack_update branch from 426e0fa to b2af292 Compare April 29, 2023 00:44

oandreeva-nv changed the title ~~Update dlpack implementation for PbTensor~~ [WIP]Update dlpack implementation for PbTensor May 1, 2023

Cleaned code, removed reinterpret cast

31fd0d3

oandreeva-nv requested review from Tabrizian and nnshah1 May 2, 2023 17:07

oandreeva-nv mentioned this pull request May 3, 2023

Add multi-gpu, multi-stream testing for dlpack tensors triton-inference-server/server#5550

Merged

Tabrizian reviewed May 3, 2023

View reviewed changes

Adjusted synchronization based on recent bug found in tests

8711ce0

oandreeva-nv requested a review from Tabrizian May 8, 2023 17:40

oandreeva-nv changed the title ~~[WIP]Update dlpack implementation for PbTensor~~ Update dlpack implementation for PbTensor May 8, 2023

Tabrizian reviewed May 8, 2023

View reviewed changes

Tabrizian approved these changes May 8, 2023

View reviewed changes

oandreeva-nv merged commit 6c4b817 into main May 8, 2023

Tabrizian deleted the oandreeva_dlpack_update branch November 10, 2023 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dlpack implementation for PbTensor #223

Update dlpack implementation for PbTensor #223

oandreeva-nv commented Mar 24, 2023

Tabrizian Mar 24, 2023

oandreeva-nv Mar 24, 2023

Tabrizian Mar 24, 2023

oandreeva-nv Mar 24, 2023

oandreeva-nv Mar 28, 2023

Tabrizian Mar 24, 2023 •

edited

Loading

oandreeva-nv Mar 28, 2023

krishung5 left a comment

krishung5 left a comment

Tabrizian Mar 28, 2023

Tabrizian Mar 28, 2023

oandreeva-nv Mar 28, 2023

Tabrizian Mar 28, 2023

Tabrizian Mar 28, 2023

oandreeva-nv Mar 29, 2023 •

edited

Loading

oandreeva-nv commented May 2, 2023 •

edited

Loading

Tabrizian May 3, 2023

Tabrizian May 3, 2023

oandreeva-nv May 8, 2023

Tabrizian commented May 3, 2023

oandreeva-nv commented May 3, 2023

oandreeva-nv commented May 4, 2023

oandreeva-nv commented May 8, 2023

Tabrizian May 8, 2023

oandreeva-nv May 8, 2023

oandreeva-nv commented May 8, 2023

Update dlpack implementation for PbTensor #223

Update dlpack implementation for PbTensor #223

Conversation

oandreeva-nv commented Mar 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tabrizian Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krishung5 left a comment

Choose a reason for hiding this comment

krishung5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oandreeva-nv Mar 29, 2023 • edited Loading

Choose a reason for hiding this comment

oandreeva-nv commented May 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tabrizian commented May 3, 2023

oandreeva-nv commented May 3, 2023

oandreeva-nv commented May 4, 2023

oandreeva-nv commented May 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oandreeva-nv commented May 8, 2023

Tabrizian Mar 24, 2023 •

edited

Loading

oandreeva-nv Mar 29, 2023 •

edited

Loading

oandreeva-nv commented May 2, 2023 •

edited

Loading