Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation failed with cmake error #355

Closed
RuiWang1998 opened this issue Aug 3, 2023 · 23 comments
Closed

Installation failed with cmake error #355

RuiWang1998 opened this issue Aug 3, 2023 · 23 comments

Comments

@RuiWang1998
Copy link

Hi,

We are testing our new Hopper machines (H800/H100) and trying to use fp8 for training for the first time, but are having trouble installing TransformerEngine. It reports RuntimeError: Error when running CMake: Command '['/usr/local/bin/cmake', '-S', '/tmp/pip-req-build-p6kjladj/transformer_engine', '-B', '/tmp/tmps08o01xi', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-p6kjladj/build/lib.linux-x86_64-cpython-310', '-GNinja']' returned non-zero exit status 1..

We tried to invoke the command outside of pip and it just reports that there are no source directory.

We are trying docker right now but our internet configuration does not let us use docker very conveniently so we usually would prefer not use it. Could you should us where we could find any clues on how we can proceed? Much appreciated.

@ptrendx
Copy link
Member

ptrendx commented Aug 3, 2023

Hi @RuiWang1998, could you share the command you use for installation and a full error message that you are getting? Thank you!

@RuiWang1998
Copy link
Author

Hi @ptrendx, we used both pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable and pip install git+https://github.com/NVIDIA/TransformerEngine.git@main and tried python version from 3.9 to 3.11. Everytime we simply install pytorch==2.0.1 and packaging and then ran the two commands. They both returned the same error

@RuiWang1998
Copy link
Author

Hi @ptrendx, after a little digging, we think we have located the problem but not sure what's the solution here:

/usr/bin/c++ -Dtransformer_engine_EXPORTS -I/home/rui/TransformerEngine/transformer_engine -I/home/rui/TransformerEngine/transformer_engine/common/include -I/usr/local/cuda-11.8/targets/x86_64-linux/include -I/home/rui/TransformerEngine/transformer_engine/../3rdparty/cudnn-frontend/include -I/tmp/tmp9cj2vyni/common/string_headers -isystem /usr/local/cuda-11.8/include -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -MF common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -c /home/rui/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp
In file included from /usr/local/cuda-11.8/include/cuda_fp8.h:350,
                 from /home/rui/TransformerEngine/transformer_engine/common/fused_attn/../common.h:14,
                 from /home/rui/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp:8:
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator short unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:735:16: error: ‘__half2ushort_rz’ was not declared in this scope
  735 |         return __half2ushort_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:744:16: error: ‘__half2uint_rz’ was not declared in this scope
  744 |         return __half2uint_rz(__half(*this));
      |                ^~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator long long unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:753:16: error: ‘__half2ull_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
  753 |         return __half2ull_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator short int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:791:16: error: ‘__half2short_rz’ was not declared in this scope
  791 |         return __half2short_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:800:16: error: ‘__half2int_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
  800 |         return __half2int_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator long long int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:809:16: error: ‘__half2ll_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
  809 |         return __half2ll_rz(__half(*this));
      |                ^~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator short unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1248:16: error: ‘__half2ushort_rz’ was not declared in this scope
 1248 |         return __half2ushort_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1257:16: error: ‘__half2uint_rz’ was not declared in this scope
 1257 |         return __half2uint_rz(__half(*this));
      |                ^~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator long long unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1266:16: error: ‘__half2ull_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
 1266 |         return __half2ull_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator short int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1303:16: error: ‘__half2short_rz’ was not declared in this scope
 1303 |         return __half2short_rz(__half(*this));
      |                ^~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1311:16: error: ‘__half2int_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
 1311 |         return __half2int_rz(__half(*this));
      |                ^~~~~~~~~~~~~
      |                __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator long long int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1319:16: error: ‘__half2ll_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
 1319 |         return __half2ll_rz(__half(*this));
      |                ^~~~~~~~~~~~
      |                __half2_raw
ninja: build stopped: subcommand failed.

Seems like we are missing some headers, where can we include one?

We have machines with CUDA 11.8 and machines with CUDA 12 and we believe they share the same reason here.

@RuiWang1998
Copy link
Author

Hi,

Some updates, our machines with H800 can successfully install now but A100 machines cannot yet. H800 machines just needed CUDNN but A100 machines, even after installation of CUDNN, still meets the error above.

@ptrendx
Copy link
Member

ptrendx commented Aug 7, 2023

Hi, this is a pretty strange error - functions like __half2ushort_rz are declared inside the cuda_fp16.hpp file, which should be in the include directory in your CUDA installation (in this case /usr/local/cuda-11.8/include or /usr/local/cuda-11.8/targets/x86_64-linux/include). Could you confirm that such file exists there?

@RuiWang1998
Copy link
Author

Hi, yes it is in /usr/local/cuda-11.8/include and it seems that __half2ushort_rz is declared there.

@MicPie
Copy link

MicPie commented Aug 31, 2023

Any update on this issue?

@RuiWang1998
Copy link
Author

Hi, @MicPie ,

We have been able to install this with newer commits now. Were you trying on stable releases?

@mahdip72
Copy link

mahdip72 commented Nov 21, 2023

I have the same problem in my workstation with A6000 ada.

raise RuntimeError(f"Error when running CMake: {e}")
      RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-hnl1xnl7/transformer_engine', '-B', '/tmp/tmp6vkf06mc', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-hnl1xnl7/build/lib.linux-x86_64-cpython-311']' returned non-zero exit status 1.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for transformer-engine

@RuiWang1998 Could you help me what should I do? Install CUDNN?
Cuda 11.8
pytorch 2.1.0
python 3.11
ubuntu 22.04

@RuiWang1998
Copy link
Author

RuiWang1998 commented Nov 21, 2023 via email

@liuchangdm
Copy link

liuchangdm commented Feb 19, 2024

Hi, @MicPie ,

We have been able to install this with newer commits now. Were you trying on stable releases?

@RuiWang1998 Could you show which release version that you use ? I had the same problems. Thanks.

@hellangleZ
Copy link

Same issue

File "/aml2/TransformerEngine/setup.py", line 338, in _build_cmake
raise RuntimeError(f"Error when running CMake: {e}")
RuntimeError: Error when running CMake: Command '['/aml/conda/bin/cmake', '-S', '/aml2/TransformerEngine/transformer_engine', '-B', '/aml2/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/aml2/ds2/bin/python', '-DPython_INCLUDE_DIR=/aml2/ds2/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/aml2/TransformerEngine/build/lib.linux-x86_64-cpython-310', '-GNinja', '-Dpybind11_DIR=/aml2/ds2/lib/python3.10/site-packages/pybind11/share/cmake/pybind11']' returned non-zero exit status 1.
[end of output]

@timmoon10
Copy link
Collaborator

The CMake error message should already be printed to stderr, although it is somewhat buried within the Python stacktrace from setup.py. It may be helpful to search for "Building CMake extension transformer_engine" within your build logs.

If the error is happening during CMake configuration, it's probably because CUDA or cuDNN are not properly installed. See CUDA instructions at #700 (comment). For cuDNN, make sure CUDNN_PATH is set in your environment.

@BrunoFANG1
Copy link

I solved this issue by simply use this command

git submodule update --init --recursive

Under the TransformerEngine dir, I hope this might help you.

@sfdeggb
Copy link

sfdeggb commented Jul 16, 2024

I also meet the question. the question details information is :

raise RuntimeError(f"Error when running CMake: {e}")
RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-yvwm9h7r/transformer_engine', '-B', '/tmp/pip-req-build-yvwm9h7r/build/cmake',
DPython_EXECUTABLE=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/bin/python3.1', '-DPython_INCLUDE_DIR=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/include/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-yvwm9h7r/build/lib.linux-x86_64-cpython-311', '-GNinja']' returned non-zero exit status 1.

My environment is below:
ubuntu 22.04
cuda:11.7
python: 3.11
torch:2.3.1
nvidia driver:535.183.06
Look forward to a solution!

@wplf
Copy link
Contributor

wplf commented Jul 16, 2024

I also meet the question. the question details information is :

raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-yvwm9h7r/transformer_engine', '-B', '/tmp/pip-req-build-yvwm9h7r/build/cmake', DPython_EXECUTABLE=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/bin/python3.1', '-DPython_INCLUDE_DIR=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/include/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-yvwm9h7r/build/lib.linux-x86_64-cpython-311', '-GNinja']' returned non-zero exit status 1.

My environment is below: ubuntu 22.04 cuda:11.7 python: 3.11 torch:2.3.1 nvidia driver:535.183.06 Look forward to a solution!

Hello, my friend!
You can check if your nvcc is added to environment.

nvcc --version

If error occurs, you may fix it by export PATH=/usr/local/cuda/bin:$PATH or something like this.

@sfdeggb
Copy link

sfdeggb commented Jul 16, 2024

@wplf yeah! my nvcc is seem ok! the information is below:

ubuntu@ip-172-31-38-93:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
Are there any other solutions?

@wplf
Copy link
Contributor

wplf commented Jul 16, 2024

compiler

Can you check your cmake version?
You can install cmake by pip install cmake

@sfdeggb
Copy link

sfdeggb commented Jul 16, 2024

@wplf
the cmake version is below:

(yuxunlian) ubuntu@ip-172-31-38-93:~$ cmake --version
cmake version 3.22.1
CMake suite maintained and supported by Kitware (kitware.com/cmake).

Is this version appropriate?

@wplf
Copy link
Contributor

wplf commented Jul 16, 2024

@wplf the cmake version is below:

(yuxunlian) ubuntu@ip-172-31-38-93:~$ cmake --version cmake version 3.22.1 CMake suite maintained and supported by Kitware (kitware.com/cmake).

Is this version appropriate?

Yes, this is ok。
Sorry, I can't help you anymore.

@sfdeggb
Copy link

sfdeggb commented Jul 16, 2024

@wplf
it does not matter! Thank you for your reply!

@FidanVural
Copy link

Any update on this issue? I'm still getting the same error.

@timmoon10
Copy link
Collaborator

timmoon10 commented Oct 4, 2024

If you are experiencing an error that looks like RuntimeError: Error when running CMake, then something has failed in the build process (probably a CMake configuration error or a compilation error). Please look through the build logs to find more details or post enough of the build logs so we can figure out what's going on. To print the maximum amount of information during the build process:

cd transformer_engine
pip install -v -v -v .

Some common build errors and fixes:

  • Uninitialized Git submodules: Run git submodule update --init --recursive.
  • CMake can't find a C++ compiler: Set CXX in the environment.
  • CMake can't find CUDA: Set CUDA_PATH in the environment.
  • CMake can't find cuDNN: Set CUDNN_PATH in the environment.
  • Invalid dependency versions: Consult TE's requirements. As of TE 1.11, TE requires CUDA 12.0+ and cuDNN 8.1+.
  • Hang during compilation: Try disabling parallelism in the build process by setting MAX_JOBS=1 and NVTE_BUILD_THREADS_PER_JOB=1 in the environment. See stuck at building wheel #1077 (comment) for more guidance.

I'll lock this issue to make this comment easier for users to find, but please open a new issue if you are encountering a build error (with enough of the build log for us to help).

@NVIDIA NVIDIA locked as resolved and limited conversation to collaborators Oct 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests