-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to verify if my local build has openMP support #1137
Comments
This is something a was wondering about recently (I don't have a M1/M2 Mac at the moment to verify this). Is only a single thread used for the entire execution? I was under the impression that Apple Accelerate uses multiple threads by default and it can be controlled with the environment variable
Can you post your CMake command line and its output? Before running the command, please remove the file Also, if the |
Yeah I verified that only 1 thread is being used, even with VECLIB_MAXIMUM_THREADS=4 ctranslate was installed and i cycled reinstall a couple of times. Since that didn't seem to have an effect, I went with building my own So, just to get it to find openmp, i had to use this,
This is the output of cmake
|
This looks fine to me but I don't know if the flags are correct. It would be better to rely on the flags that are automatically set by CMake. What happens when you only set If CMake does not find the OpenMP library, can you try the following?
|
I think I know what's going on. I have the default pip installed version in Also the pip installs a bunch of packers like
However the custom build only builds ctranslate2. How do I build and install the full package? This is lower priority for me now since I am able to make full use of the hardware on GPUs and those are much faster than my M1 anyways. Really appreciate what you have done here and how helpful you are :) |
I have the same issue as the OP. In addition to the CMake flags discussed above I also tried the following
Which gives this output
I also see the following warnings
Any suggestions would be highly appreciated. |
@yadamonk Do you confirm that you also recompiled the Python package and installed it with |
Yes, after building the library I proceed as follows
The package installs fine and works but seems to only utilize one core. |
How did you verify that only one core is used? Did you try setting a number of threads explicitly? Note that most of the computation is happening in Apple Accelerate which only uses one core anyway. So the impact of OpenMP could be more subtle. |
I run inference in a Jupyter notebook as follows
In activity monitor I only see around 170% CPU utilization on a 12 core machine. |
Can you try setting |
|
I looked into |
It’s possible that the lack of multithreading in Apple Accelerate starts to be an issue with large models. There are other configurations you can try on Apple M1. For example you can try compiling with You can also try using OpenBLAS instead of Apple Accelerate. |
It would be awesome to get in8 working. I found CTranslate2 via llama.cpp which works like a charm even for the 30B parameter model. Sadly, using Using Both versions compile and install fine. |
Could this be a Python thing? From a pure C++ implementation, everything seems to work. However, here are a few things I had found out when I was trying to make it all work:
|
Thank you so much @panosk and @guillaumekln. You were right. I need both |
Great! Is it now faster than Apple Accelerate? |
Yes, significantly. Inference time went from minutes to seconds. On my 12 core M2 Max optimal performance seems to be with |
I'm using a generalized approach to cover all platforms, CPUs, and use cases, and these settings seems to work best:
|
Should this error still be triggered on M2 macs:
with CTranslate2 3.10.3 prebuilt binaries? Edit: I just realized the changes here were only merged into master, and 3.10.3 is behind. |
Hello, First, I installed OpenMP, then I installed ruy (using bazel build), and I built ctranslate2 using: My log were: Using this method on a 3 minutes audio, the cpu usage increased from 120% to about 450% and the inferencing time reduced from 33s to 25s which is great! Thanks! Edit: BTW, by using 12 threads (instead of 8) the speed became too worse! |
You are running a different model than the previous comments so it is expected you are seing a different behavior. Since you are running a Whisper model, you could try using a larger model size and see how the CPU usage changes. |
Thank you for your help. @guillaumekln |
The latest version (3.11.0) enables the Ruy backend for the macOS ARM Python wheels. So now you don't to need to recompile the package to execute in 8-bit with multiple threads. |
That's great! Thanks! |
I wonder about support for Apple Silicon GPU/MPS |
@ciekawy You can open another issue for this feature if you want to, but I don't plan to work on this in the short term. |
In #1188, I'm adding a custom threading implementation that is used when OpenMP is not available. I'm wondering how useful this is for Apple Silicon, especially in the context of Whisper models. @zara0m Could you help testing this new build and see how it impacts the performance? You could try with both
Note that Apple Accelerate will still use a single core, but now other operations are multithreaded. |
I just tested it, |
Thanks @zara0m for the test. So I will include this change in the next version. |
System: M1 Mac
With vanilla ctranslate2 (installed via pip), I was unable to use more than 1 threads and was getting this warning when i try to increase the threads "The number of threads (intra_threads) is ignored in this build"
So, with great effort, I was finally able to build my local version from source. I supplied OPENMP as "COMP". I installed it and did the python wrapper installation too. Still getting the same error.
How do I check how my local build of ctranslate2 is configured?
Thanks!
The text was updated successfully, but these errors were encountered: