Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA support #57

Open
dllu opened this issue Feb 7, 2021 · 4 comments
Open

CUDA support #57

dllu opened this issue Feb 7, 2021 · 4 comments

Comments

@dllu
Copy link

dllu commented Feb 7, 2021

I am interested in getting onnxruntime-rs running with CUDA based inference. (I'm also interested in getting AMDMIGraphX inference working but that's a whole nother can of worms)

Anyway in onnxruntime-rs/onnxruntime-sys/examples/c_api_sample.rs there is:

c_api_sample.rs:52:    // E.g. for CUDA include cuda_provider_factory.h and uncomment the following line:
c_api_sample.rs:53:    // OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);

But uncommenting the line doesn't work since symbols OrtSessionOptionsAppendExecutionProvider_CUDA and sessionOptions are not available.

Also generally the CUDA doesn't seem to be working since it is still using CPU for inferencing even though I compiled with

ORT_LIB_LOCATION /usr/local/
ORT_STRATEGY system
ORT_USE_CUDA 1

with onnxruntime compiled with ./build.sh --use_cuda --cudnn_home /usr/ --cuda_home /opt/cuda/ --config RelWithDebInfo --parallel --build_shared_lib and installed in /usr/local.

@dllu
Copy link
Author

dllu commented Feb 7, 2021

hmmm I was able to compile it by adding cuda_provider_factory.h to the wrapper.h and changing

-    // OrtSessionOptionsAppendExecutionProvider_CUDA(sessionOptions, 0);
+    unsafe {
+        OrtSessionOptionsAppendExecutionProvider_CUDA(session_options_ptr, 0);
+    }
onnxruntime-sys » cargo run --release --example c_api_sample
    Finished release [optimized] target(s) in 0.03s
     Running `/home/dllu/builds/onnxruntime-rs/target/release/examples/c_api_sample`
Using Onnxruntime C API
2021-02-07 14:22:25.905008087 [I:onnxruntime:, inference_session.cc:225 operator()] Flush-to-zero and denormal-as-zero are off
2021-02-07 14:22:25.905026622 [I:onnxruntime:, inference_session.cc:232 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true
2021-02-07 14:22:26.065756648 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065778619 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.065789219 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for CudaPinned with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065794719 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.065800780 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for CUDA_CPU with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065805940 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.065829023 [I:onnxruntime:, inference_session.cc:1083 Initialize] Initializing session.
2021-02-07 14:22:26.065835946 [I:onnxruntime:, inference_session.cc:1108 Initialize] Adding default CPU execution provider.
2021-02-07 14:22:26.065841687 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.065847277 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.068332739 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2021-02-07 14:22:26.068805883 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2021-02-07 14:22:26.069032486 [I:onnxruntime:, reshape_fusion.cc:37 ApplyImpl] Total fused reshape node count: 0
2021-02-07 14:22:26.069129407 [V:onnxruntime:, inference_session.cc:877 TransformGraph] Node placements
2021-02-07 14:22:26.069134326 [V:onnxruntime:, inference_session.cc:879 TransformGraph] All nodes have been placed on [CUDAExecutionProvider].
2021-02-07 14:22:26.069315986 [V:onnxruntime:, session_state.cc:76 CreateGraphInfo] SaveMLValueNameIndexMapping
2021-02-07 14:22:26.069354908 [V:onnxruntime:, session_state.cc:122 CreateGraphInfo] Done saving OrtValue mappings.
2021-02-07 14:22:26.398456273 [I:onnxruntime:test, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-02-07 14:22:26.398479366 [V:onnxruntime:test, bfc_arena.cc:41 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-02-07 14:22:26.398711981 [I:onnxruntime:, session_state_utils.cc:100 SaveInitializedTensors] Saving initialized tensors.
2021-02-07 14:22:26.398872991 [I:onnxruntime:, session_state_utils.cc:170 SaveInitializedTensors] [Memory] SessionStateInitializer statically allocates 4942848 bytes for Cuda

2021-02-07 14:22:26.401419217 [I:onnxruntime:, session_state_utils.cc:212 SaveInitializedTensors] Done saving initialized tensors
2021-02-07 14:22:26.401758791 [I:onnxruntime:, inference_session.cc:1258 Initialize] Session successfully initialized.
Number of inputs = 1
Input 0 : name=data_0
Input 0 : type=1
Input 0 : num_dims=4
Input 0 : dim 0=1
Input 0 : dim 1=3
Input 0 : dim 2=224
Input 0 : dim 3=224
2021-02-07 14:22:26.401885488 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:11 rounded_bytes:602112
2021-02-07 14:22:26.401899093 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 1048576 bytes.
2021-02-07 14:22:26.401904694 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 5991424
2021-02-07 14:22:26.401912919 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc264e46c00 to 0x7fc264f46c00
2021-02-07 14:22:26.401968072 [I:onnxruntime:, sequential_executor.cc:157 Execute] Begin execution
2021-02-07 14:22:26.401978962 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:13 rounded_bytes:3154176
2021-02-07 14:22:26.402159369 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 4194304 bytes.
2021-02-07 14:22:26.402164990 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 10185728
2021-02-07 14:22:26.402171021 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc265c00000 to 0x7fc266000000
2021-02-07 14:22:26.425433288 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:17 rounded_bytes:33554432
2021-02-07 14:22:26.425716587 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 33554432 bytes.
2021-02-07 14:22:26.425722548 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 43740160
2021-02-07 14:22:26.425726035 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc21a000000 to 0x7fc21c000000
2021-02-07 14:22:26.546222827 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:17 rounded_bytes:33554432
2021-02-07 14:22:26.546430695 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 33554432 bytes.
2021-02-07 14:22:26.546439421 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 77294592
2021-02-07 14:22:26.546443970 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc218000000 to 0x7fc21a000000
2021-02-07 14:22:26.563948623 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for CudaPinned. bin_num:0 rounded_bytes:256
2021-02-07 14:22:26.564005620 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 1048576 bytes.
2021-02-07 14:22:26.564011200 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 1048576
2021-02-07 14:22:26.564017101 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x7fc265a00200 to 0x7fc265b00200
2021-02-07 14:22:26.580391023 [I:onnxruntime:, sequential_executor.cc:475 Execute] [Memory] ExecutionFrame dynamically allocates 10003456 bytes for Cuda

2021-02-07 14:22:26.580407113 [I:onnxruntime:test, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cpu. bin_num:4 rounded_bytes:4096
2021-02-07 14:22:26.580417713 [I:onnxruntime:test, bfc_arena.cc:165 Extend] Extended allocation by 1048576 bytes.
2021-02-07 14:22:26.580422953 [I:onnxruntime:test, bfc_arena.cc:168 Extend] Total allocated bytes: 1048576
2021-02-07 14:22:26.580428403 [I:onnxruntime:test, bfc_arena.cc:171 Extend] Allocated memory at 0x556cf8b023c0 to 0x556cf8c023c0
Score for class [0] =  0.000045440655
Score for class [1] =  0.0038458658
Score for class [2] =  0.00012494654
Score for class [3] =  0.0011804515
Score for class [4] =  0.0013169352
Done!

@dllu
Copy link
Author

dllu commented Feb 7, 2021

Oh yeah actually I was able to get CUDA based inferencing working with just

diff --git a/onnxruntime-sys/wrapper.h b/onnxruntime-sys/wrapper.h
index e63d352..c7c0cde 100644
--- a/onnxruntime-sys/wrapper.h
+++ b/onnxruntime-sys/wrapper.h
@@ -1 +1,2 @@
-#include "onnxruntime_c_api.h"
+#include "onnxruntime/core/providers/cuda/cuda_provider_factory.h"
+#include "onnxruntime/core/session/onnxruntime_c_api.h"
diff --git a/onnxruntime/src/session.rs b/onnxruntime/src/session.rs
index c3b6e88..53d2b0b 100644
--- a/onnxruntime/src/session.rs
+++ b/onnxruntime/src/session.rs
@@ -125,6 +125,14 @@ impl<'a> SessionBuilder<'a> {
         Ok(self)
     }

+    /// Use CUDA
+    pub fn use_cuda(self) -> Result<SessionBuilder<'a>> {
+        unsafe {
+            sys::OrtSessionOptionsAppendExecutionProvider_CUDA(self.session_options_ptr, 0);
+        }
+        Ok(self)
+    }
+
     /// Set the session's allocator
     ///
     /// Defaults to [`AllocatorType::Arena`](../enum.AllocatorType.html#variant.Arena)

and then regenerating the bindings and building with the ORT_USE_GPU environment variables and stuff. On my machine a Titan Xp with CUDA is about 8 to 10 times faster than using the CPU (AMD Ryzen 9 3900x).

Not sure how to make these changes work with people who don't use CUDA though. Maybe need some kind of cfg thing.

hooray

@dllu dllu changed the title Uncommenting suggested line in c_api_sample.rs doesn't work CUDA support Feb 8, 2021
@nbigaouette
Copy link
Owner

Great! I'm glad you were able to make it work.

As you found out, uncommenting the lines in the example will not work; it's a copy-paste from the original C example which I left when I ported the example. See https://github.com/microsoft/onnxruntime/blob/v1.4.0/csharp/test/Microsoft.ML.OnnxRuntime.EndToEndTests.Capi/C_Api_Sample.cpp#L41-L43

From your last patch it seems there is no need for a cfg since the API exists in the runtime.

I'm not sure about how the function is being called though. If you look at how different functions are called, there is a difference. For example in with_number_threads() the function is accessed from g_ort() and the status return value is checked for error.

I don't have access to an nvidia system for now so it's hard for me to test this...

@dllu
Copy link
Author

dllu commented Feb 9, 2021

Unlike SetIntraOpNumThreads, it seems that the function OrtSessionOptionsAppendExecutionProvider_CUDA exists in the global scope and you don't need to call it through g_ort(). However, we should probably check its return value for error.

relevant documentation from onnxruntime_c_api.h:

  /**
    * To use additional providers, you must build ORT with the extra providers enabled. Then call one of these
    * functions to enable them in the session:
    *   OrtSessionOptionsAppendExecutionProvider_CPU
    *   OrtSessionOptionsAppendExecutionProvider_CUDA
    *   OrtSessionOptionsAppendExecutionProvider_<remaining providers...>
    * The order they are called indicates the preference order as well. In other words call this method
    * on your most preferred execution provider first followed by the less preferred ones.
    * If none are called Ort will use its internal CPU execution provider.
    */

The cfg is needed because it may fail to link if the ORT isn't compiled with the CUDA execution provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants