Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Graph Error - CUDA failure 900: operation not permitted when stream is capturing #15002

Open
tianleiwu opened this issue Mar 11, 2023 · 10 comments
Labels
ep:CUDA issues related to the CUDA execution provider

Comments

@tianleiwu
Copy link
Contributor

Describe the issue

During cuda graph catpure, ORT will trigger cudaStreamSynchronize, which is not allowed in CUDA graph catpure. Call stack is like the following:

libonnxruntime_providers_cuda.so!onnxruntime::CudaStream::CleanUpOnRunEnd(onnxruntime::CudaStream * const this) git\onnxruntime\onnxruntime\core\providers\cuda\cuda_stream_handle.cc:141)
onnxruntime_pybind11_state.so!onnxruntime::DeviceStreamCollectionImpl::CleanUp(onnxruntime::DeviceStreamCollectionImpl * const this, bool sync_streams) git\onnxruntime\onnxruntime\core\framework\device_stream_collection.cc:30)
onnxruntime_pybind11_state.so!onnxruntime::DeviceStreamCollection::CleanUp(onnxruntime::DeviceStreamCollection * const this, bool sync_streams) git\onnxruntime\onnxruntime\core\framework\device_stream_collection.cc:113)
onnxruntime_pybind11_state.so!onnxruntime::utils::ExecuteGraph(const onnxruntime::SessionState & session_state, onnxruntime::FeedsFetchesManager & feeds_fetches_manager, gsl::span<OrtValue const, 18446744073709551615> feeds, std::vector<OrtValue, std::allocator<OrtValue> > & fetches, ExecutionMode execution_mode, const bool & terminate_flag, const onnxruntime::logging::Logger & logger, bool sync_execution_provider, bool only_execute_path_to_fetches, onnxruntime::Stream * parent_stream) git\onnxruntime\onnxruntime\core\framework\utils.cc:782)
onnxruntime_pybind11_state.so!onnxruntime::utils::ExecuteGraph(const onnxruntime::SessionState & session_state, onnxruntime::FeedsFetchesManager & feeds_fetches_manager, gsl::span<OrtValue const, 18446744073709551615> feeds, std::vector<OrtValue, std::allocator<OrtValue> > & fetches, ExecutionMode execution_mode, const onnxruntime::RunOptions & run_options, const onnxruntime::logging::Logger & logger) git\onnxruntime\onnxruntime\core\framework\utils.cc:817)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, const onnxruntime::RunOptions & run_options, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615> feed_names, gsl::span<OrtValue const, 18446744073709551615> feeds, gsl::span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, 18446744073709551615> output_names, std::vector<OrtValue, std::allocator<OrtValue> > * p_fetches, const std::vector<OrtDevice, std::allocator<OrtDevice> > * p_fetches_device_info) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2001)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, const onnxruntime::RunOptions & run_options, onnxruntime::IOBinding & io_binding) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2155)
onnxruntime_pybind11_state.so!onnxruntime::InferenceSession::Run(onnxruntime::InferenceSession * const this, onnxruntime::IOBinding & io_binding) git\onnxruntime\onnxruntime\core\session\inference_session.cc:2160)
onnxruntime_pybind11_state.so!onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)>::operator()(onnxruntime::python::PyInferenceSession *, onnxruntime::SessionIOBinding &, onnxruntime::RunOptions *) const(const onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)> * const __closure, onnxruntime::python::PyInferenceSession * sess, onnxruntime::SessionIOBinding & io_binding, onnxruntime::RunOptions * run_options) git\onnxruntime\onnxruntime\python\onnxruntime_pybind_state.cc:1668)
onnxruntime_pybind11_state.so!pybind11::detail::argument_loader<onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, OrtRunOptions*>::call_impl<void, onnxruntime::python::addObjectMethods(pybind11::module&, onnxruntime::Environment&, onnxruntime::python::ExecutionProviderRegistrationFn)::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)>&, 0, 1, 2, pybind11::detail::void_type>(onnxruntime::python::<lambda(onnxruntime::python::PyInferenceSession*, onnxruntime::SessionIOBinding&, onnxruntime::RunOptions*)> &, std::index_sequence, pybind11::detail::void_type &&)

Error is like the following (I added file and line):

2023-03-10 11:24:05.061767687 [E:onnxruntime:Default, cuda_call.cc:116 CudaCall] CUDA failure 900: operation not permitted when stream is capturing ; GPU=0 ; hostname=??; file=/work/tlwu/git/onnxruntime/onnxruntime/core/providers/cuda/cuda_stream_handle.cc ; line=141 ; expr=cudaStreamSynchronize(static_cast<cudaStream_t>(GetHandle()));

To reproduce

The error is not always triggered with small model. But with larger model like unet, it can always reproduce.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

@github-actions github-actions bot added the ep:CUDA issues related to the CUDA execution provider label Mar 11, 2023
@hariharans29
Copy link
Member

Hi @feihugis - I recall you saying that the model your team flighted also used CUDA Graph. Did you run into issues like the above while trying to capture the graph ? AFAIK - Cuda stream synchronize has always existed in the code. I wonder why we didn't see something like this while testing your model.

@hariharans29
Copy link
Member

@tianleiwu - Could it be that in the "large" unet model, it is using a kernel that internally uses cudaStreamSynchronize() ? This may be one of the cases where we can't use CUDA Graphs unfortunately.

For the "small" model, it may be that the stream synchronize using op/kernel doesn't kick-in? If you look at the CUDA EP setup that captures the graph, we first finish capturing the graph in OnRunEnd() here -

and only then do the stream sync here -
CUDA_RETURN_IF_ERROR(cudaStreamSynchronize(static_cast<cudaStream_t>(stream_)));
before returning control back to the caller.

Unfortunately, if one of the intermediate kernels it encounters between graph capture begin and graph capture end contains synchronization logic, it cannot be captured.

@feihugis
Copy link
Contributor

Hi @feihugis - I recall you saying that the model your team flighted also used CUDA Graph. Did you run into issues like the above while trying to capture the graph ? AFAIK - Cuda stream synchronize has always existed in the code. I wonder why we didn't see something like this while testing your model.

Hi @hariharans29 and @tianleiwu sorry for the late response. I did not see this message and suddenly saw it when I search my email for something else.

Yes, the model we had mainstreamed around one year ago did not meet any issue when capturing the CUDA graph.

Recently when I tried GPT2+Beam Search, I met similar issues. After making some codes changes (feihugis@de67b88), CUDA Graph capturing can work, but as some of ops are not on GPU, the outputs are not correct.

Please feel free to ping me on Team if I missed your comments.

tianleiwu added a commit that referenced this issue Jun 15, 2023
Fix two issues related to cuda graph capture:
#14942 and
#15002

Issue 1: Previously, graph capture starts at the second run. However,
memory pattern optimization will allocate memory from the second run,
and cudamalloc is not allowed during graph capture. In this PR, the
graph capture will start graph capture after 2 runs to avoid the issue.

Issue 2: #13495 introduced
multiple stream support. But stream cleanup will call
cudaStreamSyncronize which is not allowed in cuda graph capture. In this
PR, we move stream cleanup after cuda graph capture.

Update the squeeze net test model with dynamic axis so that we can test
with larger batch size. Add a test that could reproduce the bug (when
changing min runs from 2 back to 1).
@snnn
Copy link
Member

snnn commented Jul 23, 2024

I still see this error when running multiple models in parallel. You can reproduce the error by running:

./onnx_test_runner -e cuda /data/onnx 

The folder /data/onnx holds test models and their input/output data from https://github.com/onnx/onnx

@snnn snnn reopened this Jul 23, 2024
@snnn
Copy link
Member

snnn commented Jul 23, 2024

2024-07-23 16:30:08.420038342 [E:onnxruntime:Default, dataitem_request.cc:32 operator()] argmin_default_axis_random:Non-zero status code returned while running ArgMin node. Name:'' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing
2024-07-23 16:30:08.889316320 [E:onnxruntime:clip, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running Clip node. Name:'' Status Message: CUDA error cudaErrorStreamCaptureUnsupported:operation not permitted when stream is capturing

@tianleiwu
Copy link
Contributor Author

tianleiwu commented Jul 23, 2024

@snnn, this issue is for cuda graph error in single thread. Your reported error is another issue of multi-threading.

Stream capturing error shall not appear when cuda graph is not enabled. If you see that error in onnx test runner, that basically means ORT has some code is not thread-safe, which cause buffer overrun and mess up the call stack.

@tham-tran-ts
Copy link

Is there any update in this issue, I have the same problems with start triton server

@fclearner
Copy link

Is there any update in this issue, I have the same problems with start triton server

hi, did u solve it, I met the same problem

@tianleiwu
Copy link
Contributor Author

@tham-tran-ts , @fclearner please provide reproduce (test script and model) if you need help.

@fclearner
Copy link

thanks, I have solved it, the cause is setting "export CUDA_LAUNCH_BLOCKING=1" makes the onnx not threading-safe on gpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider
Projects
None yet
Development

No branches or pull requests

6 participants