CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

tianleiwu · 2023-03-11T00:23:13Z

Describe the issue

During cuda graph catpure, ORT will trigger cudaStreamSynchronize, which is not allowed in CUDA graph catpure. Call stack is like the following:

libonnxruntime_providers_cuda.so!onnxruntime::CudaStream::CleanUpOnRunEnd(onnxruntime::CudaStream * const this) git\onnxruntime\onnxruntime\core\providers\cuda\cuda_stream_handle.cc:141)
onnxruntime_pybind11_state.so!onnxruntime::DeviceStreamCollectionImpl::CleanUp(onnxruntime::DeviceStreamCollectionImpl * const this, bool sync_streams) git\onnxruntime\onnxruntime\core\framework\device_stream_collection.cc:30)
onnxruntime_pybind11_state.so!onnxruntime::DeviceStreamCollection::CleanUp(onnxruntime::DeviceStreamCollection * const this, bool sync_streams) git\onnxruntime\onnxruntime\core\framework\device_stream_collection.cc:113)
onnxruntime_pybind11_state.so!onnxruntime::utils::ExecuteGraph(const onnxruntime::SessionState & session_state, onnxruntime::FeedsFetchesManager & feeds_fetches_manager, gsl::span<OrtValue const, 18446744073709551615> feeds, std::vector<OrtValue, std::allocator<OrtValue> > & fetches, ExecutionMode execution_mode, const bool & terminate_flag, const onnxruntime::logging::Logger & logger, bool sync_execution_provider, bool only_execute_path_to_fetches, onnxruntime::Stream * parent_stream) git\onnxruntime\onnxruntime\core\framework\utils.cc:782)
onnxruntime_pybind11_state.so!onnxruntime::utils::ExecuteGraph(const onnxruntime::SessionState & session_state, onnxruntime::FeedsFetchesManager & feeds_fetches_manager, gsl::span<OrtValue const, 18446744073709551615> feeds, std::vector<OrtValue, std::allocator<OrtValue> > & fetches, ExecutionMode execution_mode, const onnxruntime::RunOptions & run_options, const onnxruntime::logging::Logger & logger) git\onnxruntime\onnxruntime\core\framework\utils.cc:817)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, const onnxruntime::RunOptions & run_options, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615> feed_names, gsl::span<OrtValue const, 18446744073709551615> feeds, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615> output_names, std::vector<OrtValue, std::allocator<OrtValue> > * p_fetches, const std::vector<OrtDevice, std::allocator<OrtDevice> > * p_fetches_device_info) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2001)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, const onnxruntime::RunOptions & run_options, onnxruntime::IOBinding & io_binding) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2155)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, onnxruntime::IOBinding & io_binding) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2160)
onnxruntime_pybind11_state.so!onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)>::operator()(onnxruntime::python::PyInferenceSession *, onnxruntime::SessionIOBinding &, onnxruntime::RunOptions *) const(const onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)> * const __closure, onnxruntime::python::PyInferenceSession * sess, onnxruntime::SessionIOBinding & io_binding, onnxruntime::RunOptions * run_options) git\onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc:1668)
onnxruntime_pybind11_state.so!pybind11::detail::argument_loader<onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, OrtRunOptions*>::call_impl<void, onnxruntime::python::addObjectMethods(pybind11::module&, onnxruntime::Environment&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)>&, 0, 1, 2, pybind11::detail::void_type>(onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)> &, std::index_sequence, pybind11::detail::void_type &&)

Error is like the following (I added file and line):

2023-03-10 11:24:05.061767687 [E:onnxruntime:Default, cuda_call.cc:116 CudaCall] CUDA failure 900: operation not permitted when stream is capturing ; GPU=0 ; hostname=??; file=/work/tlwu/git/onnxruntime/onnxruntime/core/providers/cuda/cuda_stream_handle.cc ; line=141 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(GetHandle()));

To reproduce

The error is not always triggered with small model. But with larger model like unet, it can always reproduce.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

hariharans29 · 2023-03-11T00:36:11Z

Hi @feihugis - I recall you saying that the model your team flighted also used CUDA Graph. Did you run into issues like the above while trying to capture the graph ? AFAIK - Cuda stream synchronize has always existed in the code. I wonder why we didn't see something like this while testing your model.

hariharans29 · 2023-03-11T00:47:52Z

@tianleiwu - Could it be that in the "large" unet model, it is using a kernel that internally uses cudaStreamSynchronize() ? This may be one of the cases where we can't use CUDA Graphs unfortunately.

For the "small" model, it may be that the stream synchronize using op/kernel doesn't kick-in? If you look at the CUDA EP setup that captures the graph, we first finish capturing the graph in OnRunEnd() here -

onnxruntime/onnxruntime/core/providers/cuda/cuda_execution_provider.cc

Line 387 in a8ad0ed

GetPerThreadContext().CaptureEnd();

and only then do the stream sync here -

onnxruntime/onnxruntime/core/providers/cuda/cuda_execution_provider.cc

Line 397 in a8ad0ed

    
           CUDA_RETURN_IF_ERROR(cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)));

before returning control back to the caller.

Unfortunately, if one of the intermediate kernels it encounters between graph capture begin and graph capture end contains synchronization logic, it cannot be captured.

feihugis · 2023-04-12T18:37:22Z

Hi @feihugis - I recall you saying that the model your team flighted also used CUDA Graph. Did you run into issues like the above while trying to capture the graph ? AFAIK - Cuda stream synchronize has always existed in the code. I wonder why we didn't see something like this while testing your model.

Hi @hariharans29 and @tianleiwu sorry for the late response. I did not see this message and suddenly saw it when I search my email for something else.

Yes, the model we had mainstreamed around one year ago did not meet any issue when capturing the CUDA graph.

Recently when I tried GPT2+Beam Search, I met similar issues. After making some codes changes (feihugis@de67b88), CUDA Graph capturing can work, but as some of ops are not on GPU, the outputs are not correct.

Please feel free to ping me on Team if I missed your comments.

Fix two issues related to cuda graph capture: #14942 and #15002 Issue 1: Previously, graph capture starts at the second run. However, memory pattern optimization will allocate memory from the second run, and cudamalloc is not allowed during graph capture. In this PR, the graph capture will start graph capture after 2 runs to avoid the issue. Issue 2: #13495 introduced multiple stream support. But stream cleanup will call cudaStreamSyncronize which is not allowed in cuda graph capture. In this PR, we move stream cleanup after cuda graph capture. Update the squeeze net test model with dynamic axis so that we can test with larger batch size. Add a test that could reproduce the bug (when changing min runs from 2 back to 1).

snnn · 2024-07-23T16:55:57Z

I still see this error when running multiple models in parallel. You can reproduce the error by running:

./onnx_test_runner -e cuda /data/onnx

The folder /data/onnx holds test models and their input/output data from https://github.com/onnx/onnx

snnn · 2024-07-23T16:56:46Z

2024-07-23 16:30:08.420038342 [E:onnxruntime:Default, dataitem_request.cc:32 operator()] argmin_default_axis_random:Non-zero status code returned while running ArgMin node. Name:'' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing
2024-07-23 16:30:08.889316320 [E:onnxruntime:clip, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Clip node. Name:'' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing

tianleiwu · 2024-07-23T17:31:28Z

@snnn, this issue is for cuda graph error in single thread. Your reported error is another issue of multi-threading.

Stream capturing error shall not appear when cuda graph is not enabled. If you see that error in onnx test runner, that basically means ORT has some code is not thread-safe, which cause buffer overrun and mess up the call stack.

tham-tran-ts · 2024-12-19T04:11:38Z

Is there any update in this issue, I have the same problems with start triton server

fclearner · 2025-01-08T08:18:06Z

Is there any update in this issue, I have the same problems with start triton server

hi, did u solve it, I met the same problem

tianleiwu · 2025-01-09T21:25:16Z

@tham-tran-ts , @fclearner please provide reproduce (test script and model) if you need help.

fclearner · 2025-01-09T22:34:58Z

thanks, I have solved it, the cause is setting "export CUDA_LAUNCH_BLOCKING=1" makes the onnx not threading-safe on gpu

github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Mar 11, 2023

tianleiwu mentioned this issue Mar 11, 2023

Fix cuda graph capture #15005

Merged

tianleiwu closed this as completed Jun 15, 2023

lminer mentioned this issue Jan 22, 2024

model loading failed if uses warmup triton-inference-server/server#6670

Open

snnn reopened this Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

tianleiwu commented Mar 11, 2023

hariharans29 commented Mar 11, 2023

hariharans29 commented Mar 11, 2023

feihugis commented Apr 12, 2023

snnn commented Jul 23, 2024

snnn commented Jul 23, 2024

tianleiwu commented Jul 23, 2024 •

edited

Loading

tham-tran-ts commented Dec 19, 2024

fclearner commented Jan 8, 2025

tianleiwu commented Jan 9, 2025

fclearner commented Jan 9, 2025

CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

Comments

tianleiwu commented Mar 11, 2023

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

hariharans29 commented Mar 11, 2023

hariharans29 commented Mar 11, 2023

feihugis commented Apr 12, 2023

snnn commented Jul 23, 2024

snnn commented Jul 23, 2024

tianleiwu commented Jul 23, 2024 • edited Loading

tham-tran-ts commented Dec 19, 2024

fclearner commented Jan 8, 2025

tianleiwu commented Jan 9, 2025

fclearner commented Jan 9, 2025

tianleiwu commented Jul 23, 2024 •

edited

Loading