Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA: update compilation flags for improved performance #1099

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

royshil
Copy link

@royshil royshil commented Feb 3, 2025

This adds CUDA nvcc compile parallelization to speed up .cu files compilation (which take >3 hours today).
Setting --threads=0 lets the system find out how many cores it can use for parallelization.
Per NVidia documents: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#threads-number-t

@@ -96,7 +96,7 @@ if (CUDAToolkit_FOUND)

set(CUDA_CXX_FLAGS "")

set(CUDA_FLAGS -use_fast_math)
set(CUDA_FLAGS -use_fast_math --threads=0 --split-compile=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we better create a new cmake option: eg CUDA_COMPILE_THREADS ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea thats an easy fix. let me do that

@JohannesGaessler
Copy link
Collaborator

What is the advantage vs. specifying the number of threads via CMake? For example, this is the command that I use locally:

cmake -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON .. && time cmake --build . -j 32 -- --quiet

@royshil
Copy link
Author

royshil commented Feb 3, 2025

What is the advantage vs. specifying the number of threads via CMake? For example, this is the command that I use locally:

cmake -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON .. && time cmake --build . -j 32 -- --quiet

@JohannesGaessler this is for the nvcc compiler for the .cu files rather than the C compiler. for nvcc to parallelize it needs to get the --threads option, which is not provided to it by CMake by default

@royshil
Copy link
Author

royshil commented Feb 3, 2025

BTW using this option i was able to cut from-scratch compile time by 50% from ~4hrs to 2hr.

@JohannesGaessler
Copy link
Collaborator

CMake -j works for CUDA device code compilation.

@royshil
Copy link
Author

royshil commented Feb 3, 2025

CMake -j works for CUDA device code compilation.

@JohannesGaessler on Windows it does not.

@slaren
Copy link
Member

slaren commented Feb 3, 2025

How are you testing? -j does work on Windows for me.

@slaren
Copy link
Member

slaren commented Feb 3, 2025

BTW using this option i was able to cut from-scratch compile time by 50% from ~4hrs to 2hr.

Even at 2 hours, that's much higher than expected, even when building for all the supported architectures. Can you share more details about the setup that you are using to build? Hardware, MSVC and CUDA toolkit versions, and anything else that you think may be relevant. -j alone usually works well enough because the CUDA backend has a lot of files that can be built in parallel.

@royshil
Copy link
Author

royshil commented Feb 3, 2025

@slaren if you look at recent windows cublas builds for whisper.cpp e.g. https://github.com/ggerganov/whisper.cpp/actions/runs/13115822916/job/36589762164 you'll notice it takes roughly 4 hours to complete.
ggml doesn't build windows cuda in CI so you're likely not seeing this, but my ggml/whisper.cpp build project consistently hits ~4hr compile times https://github.com/locaal-ai/occ-ai-dep-whispercpp/actions
this is all on Github runners. my home i9 4090 completes compilation in ~45 minutes with --threads=0

@slaren
Copy link
Member

slaren commented Feb 3, 2025

Um yeah, the whisper CI does not even use -j, so I am not surprised. The llama.cpp CI takes ~35 minutes even when everything needs to be rebuilt (usually it is much less thanks to ccache). Most computers have a lot more threads available than the github runners, and due to memory limitations we actually use one fewer thread, so this should be close to the worst case performance.

@royshil
Copy link
Author

royshil commented Feb 3, 2025

@slaren so i just tried cmake --parallel on my i9 and it completed in less than 10 minutes. i'll apply that in my CI as well and see whats the impact. but the --threads and --split-compile args for nvcc are different than the overall process parallelization options for CMake. it parallelizes the build internally for an individual .cu file

@slaren
Copy link
Member

slaren commented Feb 3, 2025

Yes absolutely, if it improves performance we should add it. But it may also cause thread contention if used together with -j, which may result in overall lower performance, which is why it would be good to see a comparison.

@royshil
Copy link
Author

royshil commented Feb 4, 2025

Here are my results overall

  • -j && --threads: 1h 48m 18s
  • only --threads: 1h 53m 25s
  • only -j: 1h 48m 40s

So we can conclude either -j or --threads give a boost, but using both at the same time doesn't give any extra boost. As expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants