CUDA: update compilation flags for improved performance #1099

royshil · 2025-02-03T18:02:11Z

This adds CUDA nvcc compile parallelization to speed up .cu files compilation (which take >3 hours today).
Setting --threads=0 lets the system find out how many cores it can use for parallelization.
Per NVidia documents: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#threads-number-t

WilliamTambellini · 2025-02-03T18:07:20Z

src/ggml-cuda/CMakeLists.txt

@@ -96,7 +96,7 @@ if (CUDAToolkit_FOUND)

    set(CUDA_CXX_FLAGS "")

-    set(CUDA_FLAGS -use_fast_math)
+    set(CUDA_FLAGS -use_fast_math --threads=0 --split-compile=0)


should we better create a new cmake option: eg CUDA_COMPILE_THREADS ?

yea thats an easy fix. let me do that

JohannesGaessler · 2025-02-03T20:18:15Z

What is the advantage vs. specifying the number of threads via CMake? For example, this is the command that I use locally:

cmake -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON .. && time cmake --build . -j 32 -- --quiet

royshil · 2025-02-03T20:33:09Z

What is the advantage vs. specifying the number of threads via CMake? For example, this is the command that I use locally:
cmake -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON .. && time cmake --build . -j 32 -- --quiet

@JohannesGaessler this is for the nvcc compiler for the .cu files rather than the C compiler. for nvcc to parallelize it needs to get the --threads option, which is not provided to it by CMake by default

royshil · 2025-02-03T20:34:06Z

BTW using this option i was able to cut from-scratch compile time by 50% from ~4hrs to 2hr.

JohannesGaessler · 2025-02-03T20:38:20Z

CMake -j works for CUDA device code compilation.

royshil · 2025-02-03T20:39:21Z

CMake -j works for CUDA device code compilation.

@JohannesGaessler on Windows it does not.

slaren · 2025-02-03T20:56:27Z

How are you testing? -j does work on Windows for me.

slaren · 2025-02-03T21:04:10Z

BTW using this option i was able to cut from-scratch compile time by 50% from ~4hrs to 2hr.

Even at 2 hours, that's much higher than expected, even when building for all the supported architectures. Can you share more details about the setup that you are using to build? Hardware, MSVC and CUDA toolkit versions, and anything else that you think may be relevant. -j alone usually works well enough because the CUDA backend has a lot of files that can be built in parallel.

royshil · 2025-02-03T21:11:09Z

@slaren if you look at recent windows cublas builds for whisper.cpp e.g. https://github.com/ggerganov/whisper.cpp/actions/runs/13115822916/job/36589762164 you'll notice it takes roughly 4 hours to complete.
ggml doesn't build windows cuda in CI so you're likely not seeing this, but my ggml/whisper.cpp build project consistently hits ~4hr compile times https://github.com/locaal-ai/occ-ai-dep-whispercpp/actions
this is all on Github runners. my home i9 4090 completes compilation in ~45 minutes with --threads=0

slaren · 2025-02-03T21:19:47Z

Um yeah, the whisper CI does not even use -j, so I am not surprised. The llama.cpp CI takes ~35 minutes even when everything needs to be rebuilt (usually it is much less thanks to ccache). Most computers have a lot more threads available than the github runners, and due to memory limitations we actually use one fewer thread, so this should be close to the worst case performance.

royshil · 2025-02-03T21:29:33Z

@slaren so i just tried cmake --parallel on my i9 and it completed in less than 10 minutes. i'll apply that in my CI as well and see whats the impact. but the --threads and --split-compile args for nvcc are different than the overall process parallelization options for CMake. it parallelizes the build internally for an individual .cu file

slaren · 2025-02-03T21:31:32Z

Yes absolutely, if it improves performance we should add it. But it may also cause thread contention if used together with -j, which may result in overall lower performance, which is why it would be good to see a comparison.

royshil · 2025-02-04T13:53:55Z

Here are my results overall

-j && --threads: 1h 48m 18s
only --threads: 1h 53m 25s
only -j: 1h 48m 40s

So we can conclude either -j or --threads give a boost, but using both at the same time doesn't give any extra boost. As expected

CUDA: update compilation flags for improved performance

5a9dc3e

WilliamTambellini reviewed Feb 3, 2025

View reviewed changes

CMake: add option for CUDA compile threads and update flags

9172f86

royshil requested a review from WilliamTambellini February 3, 2025 18:25

WilliamTambellini approved these changes Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: update compilation flags for improved performance #1099

CUDA: update compilation flags for improved performance #1099

royshil commented Feb 3, 2025

WilliamTambellini Feb 3, 2025

royshil Feb 3, 2025

JohannesGaessler commented Feb 3, 2025

royshil commented Feb 3, 2025

royshil commented Feb 3, 2025

JohannesGaessler commented Feb 3, 2025

royshil commented Feb 3, 2025

slaren commented Feb 3, 2025

slaren commented Feb 3, 2025

royshil commented Feb 3, 2025

slaren commented Feb 3, 2025 •

edited

Loading

royshil commented Feb 3, 2025

slaren commented Feb 3, 2025

royshil commented Feb 4, 2025

CUDA: update compilation flags for improved performance #1099

Are you sure you want to change the base?

CUDA: update compilation flags for improved performance #1099

Conversation

royshil commented Feb 3, 2025

WilliamTambellini Feb 3, 2025

Choose a reason for hiding this comment

royshil Feb 3, 2025

Choose a reason for hiding this comment

JohannesGaessler commented Feb 3, 2025

royshil commented Feb 3, 2025

royshil commented Feb 3, 2025

JohannesGaessler commented Feb 3, 2025

royshil commented Feb 3, 2025

slaren commented Feb 3, 2025

slaren commented Feb 3, 2025

royshil commented Feb 3, 2025

slaren commented Feb 3, 2025 • edited Loading

royshil commented Feb 3, 2025

slaren commented Feb 3, 2025

royshil commented Feb 4, 2025

slaren commented Feb 3, 2025 •

edited

Loading