How to verify if my local build has openMP support #1137

maxcribe · 2023-03-17T22:37:00Z

System: M1 Mac

With vanilla ctranslate2 (installed via pip), I was unable to use more than 1 threads and was getting this warning when i try to increase the threads "The number of threads (intra_threads) is ignored in this build"

So, with great effort, I was finally able to build my local version from source. I supplied OPENMP as "COMP". I installed it and did the python wrapper installation too. Still getting the same error.

How do I check how my local build of ctranslate2 is configured?

Thanks!

guillaumekln · 2023-03-18T08:29:24Z

This is something a was wondering about recently (I don't have a M1/M2 Mac at the moment to verify this).

Is only a single thread used for the entire execution? I was under the impression that Apple Accelerate uses multiple threads by default and it can be controlled with the environment variable VECLIB_MAXIMUM_THREADS, but that could be wrong.

So, with great effort, I was finally able to build my local version from source. I supplied OPENMP as "COMP". I installed it and did the python wrapper installation too. Still getting the same error.

Can you post your CMake command line and its output? Before running the command, please remove the file CMakeCache.txt to get the full output log.

Also, if the ctranslate2 package was already installed, did you make sure to use pip install --force-reinstall when installing the compiled wheel?

maxcribe · 2023-03-18T20:33:39Z

Yeah I verified that only 1 thread is being used, even with VECLIB_MAXIMUM_THREADS=4

ctranslate was installed and i cycled reinstall a couple of times. Since that didn't seem to have an effect, I went with building my own

So, just to get it to find openmp, i had to use this,

cmake -DOPENMP_RUNTIME=COMP -DWITH_ACCELERATE=ON -DOpenMP_libomp_LIBRARY="/opt/homebrew/opt/libomp/lib/libomp.dylib" -DOpenMP_C_FLAGS=-fopenmp=lomp \
-DOpenMP_CXX_FLAGS=-fopenmp=lomp \
-DOpenMP_C_LIB_NAMES="libomp" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_CXX_FLAGS="-Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include" \
-DOpenMP_CXX_LIB_NAMES="libomp" \
-DOpenMP_C_FLAGS="-Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include"

This is the output of cmake

CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.


-- The C compiler identification is AppleClang 14.0.0.14000029
-- The CXX compiler identification is AppleClang 14.0.0.14000029
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Build spdlog: 1.10.0
-- Build type: Release
-- Compiling for multiple CPU ISA and enabling runtime dispatch
-- Found OpenMP_C: -Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include (found version "5.0") 
-- Found OpenMP_CXX: -Xpreprocessor -fopenmp /opt/homebrew/opt/libomp/lib/libomp.dylib -I/opt/homebrew/opt/libomp/include (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Using OpenMP: /opt/homebrew/opt/libomp/lib/libomp.dylib
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Library/Developer/CommandLineTools/SDKs/MacOSX13.1.sdk/System/Library/Frameworks/Accelerate.framework  
-- Configuring done
-- Generating done

-- Build files have been written to: <dir>

guillaumekln · 2023-03-20T16:37:54Z

This looks fine to me but I don't know if the flags are correct. It would be better to rely on the flags that are automatically set by CMake.

What happens when you only set -DOPENMP_RUNTIME=COMP? (After removing CMakeCache.txt as always.)

If CMake does not find the OpenMP library, can you try the following?

-DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/libomp/

maxcribe · 2023-03-21T21:33:51Z

I think I know what's going on. I have the default pip installed version in /Library/Frameworks/Python.framework/Versions/3.10/bin. However my custom built version goes in /usr/local/bin/

Also the pip installs a bunch of packers like

/Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-fairseq-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-marian-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-openai-gpt2-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-opennmt-py-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-opennmt-tf-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-opus-mt-converter
    /Library/Frameworks/Python.framework/Versions/3.10/bin/ct2-transformers-converter

However the custom build only builds ctranslate2. How do I build and install the full package?

This is lower priority for me now since I am able to make full use of the hardware on GPUs and those are much faster than my M1 anyways.

Really appreciate what you have done here and how helpful you are :)

yadamonk · 2023-03-23T07:35:14Z

I have the same issue as the OP. In addition to the CMake flags discussed above I also tried the following

export CC=/opt/homebrew/opt/llvm/bin/clang
export CXX=/opt/homebrew/opt/llvm/bin/clang++
export LDFLAGS="-L/opt/homebrew/opt/llvm/lib"
export CPPFLAGS="-I/opt/homebrew/opt/llvm/include"

cmake .. -DWITH_MKL=OFF -DWITH_ACCELERATE=ON -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/libomp/
make -j$(sysctl -n hw.logicalcpu)
sudo make install

Which gives this output

-- The C compiler identification is Clang 15.0.7
-- The CXX compiler identification is Clang 15.0.7
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/homebrew/opt/llvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/homebrew/opt/llvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Build spdlog: 1.10.0
-- Build type: Release
-- Compiling for multiple CPU ISA and enabling runtime dispatch
-- Found OpenMP_C: -fopenmp=libomp (found version "5.0") 
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Using OpenMP: /opt/homebrew/opt/llvm/lib/libomp.dylib
-- Looking for dgemm_
-- Looking for dgemm_ - found
-- Found BLAS: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/System/Library/Frameworks/Accelerate.framework

I also see the following warnings

CTranslate2/src/cpu/kernels.cc:427:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
CTranslate2/src/cpu/kernels.cc:533:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
2 warnings generated.
CTranslate2/src/cpu/primitives.cc:815:43: warning: unused parameter 'transpose_a' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:815:61: warning: unused parameter 'transpose_b' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:816:44: warning: unused parameter 'm' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:816:53: warning: unused parameter 'n' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:816:62: warning: unused parameter 'k' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:817:44: warning: unused parameter 'alpha' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:818:52: warning: unused parameter 'a' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:818:61: warning: unused parameter 'lda' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:819:52: warning: unused parameter 'b' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:819:61: warning: unused parameter 'ldb' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:820:44: warning: unused parameter 'beta' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:821:47: warning: unused parameter 'c' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:821:56: warning: unused parameter 'ldc' [-Wunused-parameter]
CTranslate2/src/cpu/primitives.cc:822:53: warning: unused parameter 'a_shift_compensation' [-Wunused-parameter]
14 warnings generated.
CTranslate2/build/kernels_neon.cc:427:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
CTranslate2/build/kernels_neon.cc:533:5: warning: '#pragma float_control' is not supported on this target - ignored [-Wignored-pragmas]
2 warnings generated.

Any suggestions would be highly appreciated.

guillaumekln · 2023-03-27T12:07:08Z

@yadamonk Do you confirm that you also recompiled the Python package and installed it with pip install --force-reinstall?

yadamonk · 2023-03-27T17:55:57Z

Yes, after building the library I proceed as follows

export C_INCLUDE_PATH="/usr/local/include/:$C_INCLUDE_PATH"
export CPLUS_INCLUDE_PATH="/usr/local/include/:$CPLUS_INCLUDE_PATH"
export LIBRARY_PATH="/usr/local/lib/:$LIBRARY_PATH"
export DYLD_LIBRARY_PATH="/usr/local/lib/:$DYLD_LIBRARY_PATH"

cd ../python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl --force-reinstall

The package installs fine and works but seems to only utilize one core.

guillaumekln · 2023-03-27T18:09:10Z

How did you verify that only one core is used? Did you try setting a number of threads explicitly?

Note that most of the computation is happening in Apple Accelerate which only uses one core anyway. So the impact of OpenMP could be more subtle.

yadamonk · 2023-03-27T18:25:01Z

I run inference in a Jupyter notebook as follows

import os
os.environ['VECLIB_MAXIMUM_THREADS'] = '12'

from ctranslate2 import Translator
from transformers import AutoTokenizer

model_id = 'google/flan-ul2'
model_path = './flan-ul2/default'

translator = Translator(model_path,
                        device='cpu',
                        compute_type='auto',
                        inter_threads=1)

tokenizer = AutoTokenizer.from_pretrained(model_id)

input_string = 'Translate English to German: How old are you?'

input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_string))

results = translator.translate_batch(source=[input_tokens])

output_tokens = results[0].hypotheses[0]

In activity monitor I only see around 170% CPU utilization on a 12 core machine.

guillaumekln · 2023-03-27T18:30:03Z

Can you try setting intra_threads=12? (Not inter_threads!)

yadamonk · 2023-03-27T18:50:49Z

intra_threads does not seem to have an impact, but it's good to see that the flag works. I think the pip package was throwing an error. Maybe the problem is with the size of the model. Will try a smaller one and report back.

yadamonk · 2023-03-27T19:52:28Z

I looked into t5-small, t5-large and flan-t5-xxl. For the smaller models inference is very fast (much faster than HF using Pytorch MPS). But for the large flan model CTranslate2 is over 4 times slower and even with intra_threads=12 CPU utilization is around 125% for most of the time it takes to predict with a brief peak at 195%.

guillaumekln · 2023-03-27T20:16:42Z

It’s possible that the lack of multithreading in Apple Accelerate starts to be an issue with large models.

There are other configurations you can try on Apple M1. For example you can try compiling with -DWITH_RUY=ON and use the int8 compute type.

You can also try using OpenBLAS instead of Apple Accelerate.

yadamonk · 2023-03-27T21:22:28Z

It would be awesome to get in8 working. I found CTranslate2 via llama.cpp which works like a charm even for the 30B parameter model. Sadly, using -DWITH_RUY=ON instead of -DWITH_ACCELERATE=ON crashes my notebook kernel.

Using -WITH_OPENBLAS=ON instead of -DWITH_ACCELERATE=ON gives me: RuntimeError: No SGEMM backend on CPU.

Both versions compile and install fine.

panosk · 2023-03-28T05:44:22Z

Could this be a Python thing? From a pure C++ implementation, everything seems to work. However, here are a few things I had found out when I was trying to make it all work:

Apple's Clang doesn't support OpenMP. The homebrew's clang has to be used
Accelerate seems to be a requirement, even when using WITH_RUY, otherwise no gemm is detected. However, with Ruy int8 works and speed is great.

yadamonk · 2023-03-28T06:40:03Z

Thank you so much @panosk and @guillaumekln. You were right. I need both -DWITH_ACCELERATE=ON and -DWITH_RUY=ON. Inference now works in int8 and CPU utilization peaks at over 1000% when using intra_threads=12. 🚀

guillaumekln · 2023-03-28T06:49:45Z

Great! Is it now faster than Apple Accelerate?

yadamonk · 2023-03-28T07:02:41Z

Yes, significantly. Inference time went from minutes to seconds. On my 12 core M2 Max optimal performance seems to be with intra_threads=10. @panosk how do you do inference, any tips to share?

panosk · 2023-03-28T07:48:14Z

I'm using a generalized approach to cover all platforms, CPUs, and use cases, and these settings seems to work best:

reported CPUs from OS / 2 for num_replicas_per_device (inter_threads in Python I think) and num_threads_per_replica = 2 (intra_threads in Python I think, @guillaumekln please confirm the mapping :) )
batch type examples and batch size usually 16

ururk · 2023-03-30T17:26:21Z

Should this error still be triggered on M2 macs:

[2023-03-30 13:20:53.657] [ctranslate2] [thread 1853092] [warning] The number of threads (intra_threads) is ignored in this build

with CTranslate2 3.10.3 prebuilt binaries?

Edit: I just realized the changes here were only merged into master, and 3.10.3 is behind.

zara0m · 2023-04-03T17:13:16Z

Hello,
Thanks for sharing these solutions to improve performance on Mac.
I tried them for:
WhisperModel(model_path, device="cpu", compute_type="auto", cpu_threads=8) on M1 mac with 10 cores.

First, I installed OpenMP, then I installed ruy (using bazel build), and I built ctranslate2 using:
-DWITH_MKL=OFF -DWITH_ACCELERATE=ON -DWITH_RUY=ON -DOPENMP_RUNTIME=COMP -DCMAKE_PREFIX_PATH=/opt/homebrew/opt/libomp/

My log were:
[thread 73741] [info] CPU: ARM (NEON=true)
[thread 73741] [info] - Selected ISA: NEON
[thread 73741] [info] - Use Intel MKL: false
[thread 73741] [info] - SGEMM backend: Accelerate (packed: false)
[thread 73741] [info] - GEMM_S16 backend: none (packed: false)
[thread 73741] [info] - GEMM_S8 backend: Ruy (packed: false, u8s8 preferred: false)
[thread 73741] [info] Loaded model small-int8 on device cpu:0
[thread 73741] [info] - Binary version: 6
[thread 73741] [info] - Model specification revision: 3
[thread 73741] [info] - Selected compute type: int8

Using this method on a 3 minutes audio, the cpu usage increased from 120% to about 450% and the inferencing time reduced from 33s to 25s which is great!
However, I couldn't reach to 1000% peak of cpu utilization as mentioned in the above comments; I wonder if I miss something and whether I can improve it even more?!

Thanks!

Edit: BTW, by using 12 threads (instead of 8) the speed became too worse!

guillaumekln · 2023-04-04T07:53:01Z

You are running a different model than the previous comments so it is expected you are seing a different behavior.

Since you are running a Whisper model, you could try using a larger model size and see how the CPU usage changes.

zara0m · 2023-04-04T16:12:49Z

Thank you for your help. @guillaumekln

guillaumekln · 2023-04-06T17:02:30Z

The latest version (3.11.0) enables the Ruy backend for the macOS ARM Python wheels. So now you don't to need to recompile the package to execute in 8-bit with multiple threads.

zara0m · 2023-04-07T12:46:54Z

That's great! Thanks!

ciekawy · 2023-04-14T06:38:55Z

I wonder about support for Apple Silicon GPU/MPS

guillaumekln · 2023-04-14T16:45:52Z

@ciekawy You can open another issue for this feature if you want to, but I don't plan to work on this in the short term.

guillaumekln · 2023-04-24T12:43:50Z

In #1188, I'm adding a custom threading implementation that is used when OpenMP is not available.

I'm wondering how useful this is for Apple Silicon, especially in the context of Whisper models.

@zara0m Could you help testing this new build and see how it impacts the performance? You could try with both int8 and float32 compute types.

Go to the build page
Download the artifact "python-wheels"
Extract the archive
Install the appropriate wheel with pip install --force-reinstall

Note that Apple Accelerate will still use a single core, but now other operations are multithreaded.

zara0m · 2023-04-24T16:53:27Z

@guillaumekln

I just tested it,
WhisperModel(model_path, device="cpu", compute_type="int8", cpu_threads=8) on M1 mac with 10 cores, small-int8 model, 3 minutes audio which originally took 33s (without Ruy and without thread):
having openMP installed, together with ruy (ctranslate2 3.11) , was 25s,
Now without having openMP and by installing the new wheel, it became 20s! That's really great! Thank you very much!

guillaumekln · 2023-04-25T09:04:55Z

Thanks @zara0m for the test. So I will include this change in the next version.

maxcribe closed this as completed Mar 21, 2023

guillaumekln mentioned this issue Mar 28, 2023

Enable the Ruy backend in Python wheels for macOS ARM64 #1148

Merged

guillaumekln mentioned this issue Apr 1, 2023

Support ARM 64-bit architecture #355

Merged

guillaumekln mentioned this issue Apr 24, 2023

faster-whisper vs whisper.cpp with CoreML SYSTRAN/faster-whisper#168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to verify if my local build has openMP support #1137

How to verify if my local build has openMP support #1137

maxcribe commented Mar 17, 2023

guillaumekln commented Mar 18, 2023

maxcribe commented Mar 18, 2023 •

edited

Loading

guillaumekln commented Mar 20, 2023 •

edited

Loading

maxcribe commented Mar 21, 2023 •

edited

Loading

yadamonk commented Mar 23, 2023

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023 •

edited

Loading

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023

yadamonk commented Mar 27, 2023

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023

panosk commented Mar 28, 2023

yadamonk commented Mar 28, 2023

guillaumekln commented Mar 28, 2023

yadamonk commented Mar 28, 2023 •

edited

Loading

panosk commented Mar 28, 2023

ururk commented Mar 30, 2023 •

edited

Loading

zara0m commented Apr 3, 2023 •

edited

Loading

guillaumekln commented Apr 4, 2023 •

edited

Loading

zara0m commented Apr 4, 2023

guillaumekln commented Apr 6, 2023

zara0m commented Apr 7, 2023

ciekawy commented Apr 14, 2023

guillaumekln commented Apr 14, 2023

guillaumekln commented Apr 24, 2023

zara0m commented Apr 24, 2023

guillaumekln commented Apr 25, 2023

How to verify if my local build has openMP support #1137

How to verify if my local build has openMP support #1137

Comments

maxcribe commented Mar 17, 2023

guillaumekln commented Mar 18, 2023

maxcribe commented Mar 18, 2023 • edited Loading

guillaumekln commented Mar 20, 2023 • edited Loading

maxcribe commented Mar 21, 2023 • edited Loading

yadamonk commented Mar 23, 2023

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023 • edited Loading

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023

yadamonk commented Mar 27, 2023

guillaumekln commented Mar 27, 2023

yadamonk commented Mar 27, 2023

panosk commented Mar 28, 2023

yadamonk commented Mar 28, 2023

guillaumekln commented Mar 28, 2023

yadamonk commented Mar 28, 2023 • edited Loading

panosk commented Mar 28, 2023

ururk commented Mar 30, 2023 • edited Loading

zara0m commented Apr 3, 2023 • edited Loading

guillaumekln commented Apr 4, 2023 • edited Loading

zara0m commented Apr 4, 2023

guillaumekln commented Apr 6, 2023

zara0m commented Apr 7, 2023

ciekawy commented Apr 14, 2023

guillaumekln commented Apr 14, 2023

guillaumekln commented Apr 24, 2023

zara0m commented Apr 24, 2023

guillaumekln commented Apr 25, 2023

maxcribe commented Mar 18, 2023 •

edited

Loading

guillaumekln commented Mar 20, 2023 •

edited

Loading

maxcribe commented Mar 21, 2023 •

edited

Loading

yadamonk commented Mar 27, 2023 •

edited

Loading

yadamonk commented Mar 28, 2023 •

edited

Loading

ururk commented Mar 30, 2023 •

edited

Loading

zara0m commented Apr 3, 2023 •

edited

Loading

guillaumekln commented Apr 4, 2023 •

edited

Loading