Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to verify if my local build has openMP support #1137

Closed
maxcribe opened this issue Mar 17, 2023 · 30 comments
Closed

How to verify if my local build has openMP support #1137

maxcribe opened this issue Mar 17, 2023 · 30 comments

Comments

@maxcribe
Copy link

System: M1 Mac

With vanilla ctranslate2 (installed via pip), I was unable to use more than 1 threads and was getting this warning when i try to increase the threads "The number of threads (intra_threads) is ignored in this build"

So, with great effort, I was finally able to build my local version from source. I supplied OPENMP as "COMP". I installed it and did the python wrapper installation too. Still getting the same error.

How do I check how my local build of ctranslate2 is configured?

Thanks!

@guillaumekln
Copy link
Collaborator

This is something a was wondering about recently (I don't have a M1/M2 Mac at the moment to verify this).

Is only a single thread used for the entire execution? I was under the impression that Apple Accelerate uses multiple threads by default and it can be controlled with the environment variable VECLIB_MAXIMUM_THREADS, but that could be wrong.

So, with great effort, I was finally able to build my local version from source. I supplied OPENMP as "COMP". I installed it and did the python wrapper installation too. Still getting the same error.

Can you post your CMake command line and its output? Before running the command, please remove the file CMakeCache.txt to get the full output log.

Also, if the ctranslate2 package was already installed, did you make sure to use pip install --force-reinstall when installing the compiled wheel?

@maxcribe
Copy link
Author

maxcribe commented Mar 18, 2023

Yeah I verified that only 1 thread is being used, even with VECLIB_MAXIMUM_THREADS=4

ctranslate was installed and i cycled reinstall a couple of times. Since that didn't seem to have an effect, I went with building my own

So, just to get it to find openmp, i had to use this,

cmake -DOPENMP_RUNTIME=COMP -DWITH_ACCELERATE=ON -DOpenMP_libomp_LIBRARY="/opt/homebrew/opt/libomp/lib/libomp.dylib" -DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include" 

This is the output of cmake

CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.


-- The C compiler identification is AppleClang 14.0.0.14000029
-- The CXX compiler identification is AppleClang 14.0.0.14000029
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Build spdlog: 1.10.0
-- Build type: Release
-- Compiling for multiple CPU ISA and enabling runtime dispatch
-- Found OpenMP_C: -Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include (found version "5.0") 
-- Found OpenMP_CXX: -Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Using OpenMP: /opt/homebrew/opt/libomp/lib/libomp.dylib
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/System/Library/Frameworks/Accelerate.framework  
-- Configuring done
-- Generating done

-- Build files have been written to: <dir>

@guillaumekln
Copy link
Collaborator

guillaumekln commented Mar 20, 2023

This looks fine to me but I don't know if the flags are correct. It would be better to rely on the flags that are automatically set by CMake.

What happens when you only set -DOPENMP_RUNTIME=COMP? (After removing CMakeCache.txt as always.)

If CMake does not find the OpenMP library, can you try the following?

-DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/libomp/

@maxcribe
Copy link
Author

maxcribe commented Mar 21, 2023

I think I know what's going on. I have the default pip installed version in /Library/Frameworks/Python.framework/Versions/3.10/bin. However my custom built version goes in /usr/local/bin/

Also the pip installs a bunch of packers like

/Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-fairseq-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-marian-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-openai-gpt2-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-opennmt-py-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-opennmt-tf-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-opus-mt-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-transformers-converter

However the custom build only builds ctranslate2. How do I build and install the full package?

This is lower priority for me now since I am able to make full use of the hardware on GPUs and those are much faster than my M1 anyways.

Really appreciate what you have done here and how helpful you are :)

@yadamonk
Copy link

I have the same issue as the OP. In addition to the CMake flags discussed above I also tried the following

export CC=/opt/homebrew/opt/llvm/bin/clang
export CXX=/opt/homebrew/opt/llvm/bin/clang++
export LDFLAGS="-L/opt/homebrew/opt/llvm/lib"
export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"

cmake .. -DWITH_MKL=OFF -DWITH_ACCELERATE=ON -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/libomp/
make -j$(sysctl -n hw.logicalcpu)
sudo make install

Which gives this output

-- The C compiler identification is Clang 15.0.7
-- The CXX compiler identification is Clang 15.0.7
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/homebrew/opt/llvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/homebrew/opt/llvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Build spdlog: 1.10.0
-- Build type: Release
-- Compiling for multiple CPU ISA and enabling runtime dispatch
-- Found OpenMP_C: -fopenmp=libomp (found version "5.0") 
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Using OpenMP: /opt/homebrew/opt/llvm/lib/libomp.dylib
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/System/Library/Frameworks/Accelerate.framework  

I also see the following warnings

CTranslate2/src/cpu/kernels.cc:427:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
CTranslate2/src/cpu/kernels.cc:533:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
2 warnings generated.
CTranslate2/src/cpu/primitives.cc:815:43: warning: unused parameter 'transpose_a' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:815:61: warning: unused parameter 'transpose_b' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:816:44: warning: unused parameter 'm' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:816:53: warning: unused parameter 'n' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:816:62: warning: unused parameter 'k' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:817:44: warning: unused parameter 'alpha' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:818:52: warning: unused parameter 'a' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:818:61: warning: unused parameter 'lda' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:819:52: warning: unused parameter 'b' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:819:61: warning: unused parameter 'ldb' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:820:44: warning: unused parameter 'beta' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:821:47: warning: unused parameter 'c' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:821:56: warning: unused parameter 'ldc' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:822:53: warning: unused parameter 'a_shift_compensation' [-Wunused-parameter]
14 warnings generated.
CTranslate2/build/kernels_neon.cc:427:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
CTranslate2/build/kernels_neon.cc:533:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
2 warnings generated.

Any suggestions would be highly appreciated.

@guillaumekln
Copy link
Collaborator

@yadamonk Do you confirm that you also recompiled the Python package and installed it with pip install --force-reinstall?

@yadamonk
Copy link

yadamonk commented Mar 27, 2023

Yes, after building the library I proceed as follows

export C_INCLUDE_PATH="/usr/local/include/:$C_INCLUDE_PATH"
export CPLUS_INCLUDE_PATH="/usr/local/include/:$CPLUS_INCLUDE_PATH"
export LIBRARY_PATH="/usr/local/lib/:$LIBRARY_PATH"
export DYLD_LIBRARY_PATH="/usr/local/lib/:$DYLD_LIBRARY_PATH"

cd ../python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl --force-reinstall

The package installs fine and works but seems to only utilize one core.

@guillaumekln
Copy link
Collaborator

How did you verify that only one core is used? Did you try setting a number of threads explicitly?

Note that most of the computation is happening in Apple Accelerate which only uses one core anyway. So the impact of OpenMP could be more subtle.

@yadamonk
Copy link

I run inference in a Jupyter notebook as follows

import os
os.environ['VECLIB_MAXIMUM_THREADS'] = '12'

from ctranslate2 import Translator
from transformers import AutoTokenizer

model_id = 'google/flan-ul2'
model_path = './flan-ul2/default'

translator = Translator(model_path,
                        device='cpu',
                        compute_type='auto',
                        inter_threads=1)

tokenizer = AutoTokenizer.from_pretrained(model_id)

input_string = 'Translate English to German: How old are you?'

input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_string))

results = translator.translate_batch(source=[input_tokens])

output_tokens = results[0].hypotheses[0]

In activity monitor I only see around 170% CPU utilization on a 12 core machine.

@guillaumekln
Copy link
Collaborator

Can you try setting intra_threads=12? (Not inter_threads!)

@yadamonk
Copy link

intra_threads does not seem to have an impact, but it's good to see that the flag works. I think the pip package was throwing an error. Maybe the problem is with the size of the model. Will try a smaller one and report back.

@yadamonk
Copy link

I looked into t5-small, t5-large and flan-t5-xxl. For the smaller models inference is very fast (much faster than HF using Pytorch MPS). But for the large flan model CTranslate2 is over 4 times slower and even with intra_threads=12 CPU utilization is around 125% for most of the time it takes to predict with a brief peak at 195%.

@guillaumekln
Copy link
Collaborator

It’s possible that the lack of multithreading in Apple Accelerate starts to be an issue with large models.

There are other configurations you can try on Apple M1. For example you can try compiling with -DWITH_RUY=ON and use the int8 compute type.

You can also try using OpenBLAS instead of Apple Accelerate.

@yadamonk
Copy link

It would be awesome to get in8 working. I found CTranslate2 via llama.cpp which works like a charm even for the 30B parameter model. Sadly, using -DWITH_RUY=ON instead of -DWITH_ACCELERATE=ON crashes my notebook kernel.

Using -WITH_OPENBLAS=ON instead of -DWITH_ACCELERATE=ON gives me: RuntimeError: No SGEMM backend on CPU.

Both versions compile and install fine.

@panosk
Copy link
Contributor

panosk commented Mar 28, 2023

Could this be a Python thing? From a pure C++ implementation, everything seems to work. However, here are a few things I had found out when I was trying to make it all work:

  • Apple's Clang doesn't support OpenMP. The homebrew's clang has to be used
  • Accelerate seems to be a requirement, even when using WITH_RUY, otherwise no gemm is detected. However, with Ruy int8 works and speed is great.

@yadamonk
Copy link

Thank you so much @panosk and @guillaumekln. You were right. I need both -DWITH_ACCELERATE=ON and -DWITH_RUY=ON. Inference now works in int8 and CPU utilization peaks at over 1000% when using intra_threads=12. 🚀

@guillaumekln
Copy link
Collaborator

Great! Is it now faster than Apple Accelerate?

@yadamonk
Copy link

yadamonk commented Mar 28, 2023

Yes, significantly. Inference time went from minutes to seconds. On my 12 core M2 Max optimal performance seems to be with intra_threads=10. @panosk how do you do inference, any tips to share?

@panosk
Copy link
Contributor

panosk commented Mar 28, 2023

I'm using a generalized approach to cover all platforms, CPUs, and use cases, and these settings seems to work best:

  • reported CPUs from OS / 2 for num_replicas_per_device (inter_threads in Python I think) and num_threads_per_replica = 2 (intra_threads in Python I think, @guillaumekln please confirm the mapping :) )
  • batch type examples and batch size usually 16

@ururk
Copy link

ururk commented Mar 30, 2023

Should this error still be triggered on M2 macs:

[2023-03-30 13:20:53.657] [ctranslate2] [thread 1853092] [warning] The number of threads (intra_threads) is ignored in this build

with CTranslate2 3.10.3 prebuilt binaries?

Edit: I just realized the changes here were only merged into master, and 3.10.3 is behind.

@zara0m
Copy link

zara0m commented Apr 3, 2023

Hello,
Thanks for sharing these solutions to improve performance on Mac.
I tried them for:
WhisperModel(model_path, device="cpu", compute_type="auto", cpu_threads=8) on M1 mac with 10 cores.

First, I installed OpenMP, then I installed ruy (using bazel build), and I built ctranslate2 using:
-DWITH_MKL=OFF -DWITH_ACCELERATE=ON -DWITH_RUY=ON -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/libomp/

My log were:
[thread 73741] [info] CPU: ARM (NEON=true)
[thread 73741] [info] - Selected ISA: NEON
[thread 73741] [info] - Use Intel MKL: false
[thread 73741] [info] - SGEMM backend: Accelerate (packed: false)
[thread 73741] [info] - GEMM_S16 backend: none (packed: false)
[thread 73741] [info] - GEMM_S8 backend: Ruy (packed: false, u8s8 preferred: false)
[thread 73741] [info] Loaded model small-int8 on device cpu:0
[thread 73741] [info] - Binary version: 6
[thread 73741] [info] - Model specification revision: 3
[thread 73741] [info] - Selected compute type: int8

Using this method on a 3 minutes audio, the cpu usage increased from 120% to about 450% and the inferencing time reduced from 33s to 25s which is great!
However, I couldn't reach to 1000% peak of cpu utilization as mentioned in the above comments; I wonder if I miss something and whether I can improve it even more?!

Thanks!

Edit: BTW, by using 12 threads (instead of 8) the speed became too worse!

@guillaumekln
Copy link
Collaborator

guillaumekln commented Apr 4, 2023

You are running a different model than the previous comments so it is expected you are seing a different behavior.

Since you are running a Whisper model, you could try using a larger model size and see how the CPU usage changes.

@zara0m
Copy link

zara0m commented Apr 4, 2023

Thank you for your help. @guillaumekln

@guillaumekln
Copy link
Collaborator

The latest version (3.11.0) enables the Ruy backend for the macOS ARM Python wheels. So now you don't to need to recompile the package to execute in 8-bit with multiple threads.

@zara0m
Copy link

zara0m commented Apr 7, 2023

That's great! Thanks!

@ciekawy
Copy link

ciekawy commented Apr 14, 2023

I wonder about support for Apple Silicon GPU/MPS

@guillaumekln
Copy link
Collaborator

@ciekawy You can open another issue for this feature if you want to, but I don't plan to work on this in the short term.

@guillaumekln
Copy link
Collaborator

In #1188, I'm adding a custom threading implementation that is used when OpenMP is not available.

I'm wondering how useful this is for Apple Silicon, especially in the context of Whisper models.

@zara0m Could you help testing this new build and see how it impacts the performance? You could try with both int8 and float32 compute types.

  1. Go to the build page
  2. Download the artifact "python-wheels"
  3. Extract the archive
  4. Install the appropriate wheel with pip install --force-reinstall

Note that Apple Accelerate will still use a single core, but now other operations are multithreaded.

@zara0m
Copy link

zara0m commented Apr 24, 2023

@guillaumekln

I just tested it,
WhisperModel(model_path, device="cpu", compute_type="int8", cpu_threads=8) on M1 mac with 10 cores, small-int8 model, 3 minutes audio which originally took 33s (without Ruy and without thread):
having openMP installed, together with ruy (ctranslate2 3.11) , was 25s,
Now without having openMP and by installing the new wheel, it became 20s! That's really great! Thank you very much!

@guillaumekln
Copy link
Collaborator

Thanks @zara0m for the test. So I will include this change in the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants