Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA RT IB fails to build onnxruntime pkg #8571

Closed
aandvalenzuela opened this issue Jun 27, 2023 · 5 comments
Closed

CUDA RT IB fails to build onnxruntime pkg #8571

aandvalenzuela opened this issue Jun 27, 2023 · 5 comments

Comments

@aandvalenzuela
Copy link
Contributor

aandvalenzuela commented Jun 27, 2023

Hello,

The dedicated IB for cuda runtime fails to build onnxruntime pkg. From the latest CUDART IB log:

FAILED: CMakeFiles/onnxruntime_providers_cuda.dir/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/onnxruntime/core/providers/cuda/math/binary_elementwise_ops_impl.cu.o 
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/cuda/11.5.2-2a1e4dd2237c71998d9badc1052421af/bin/nvcc -forward-unknown-to-host-compiler -DCPUINFO_SUPPORTED_PLATFORM=1 -DEIGEN_MPL2_ONLY -DEIGEN_USE_THREADS -DENABLE_CPU_FP16_TRAINING_OPS -DNSYNC_ATOMIC_CPP11 -DONNX_ML=1 -DONNX_NAMESPACE=onnx -DORT_ENABLE_STREAM -DPLATFORM_POSIX -DUSE_CUDA=1 -Donnxruntime_providers_cuda_EXPORTS -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/include/onnxruntime -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/include/onnxruntime/core/session -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/pytorch_cpuinfo-src/include -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/google_nsync-src/public -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/onnxruntime -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/abseil_cpp-src -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/safeint-src -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/gsl-src/include -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/onnx-src -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/onnx-build -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/protobuf-src/src -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/flatbuffers-src/include -I/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/cudnn/8.8.0.121-b294749e5f0cb76cc4ef362a1d43fd69/include -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/eigen-src -I/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/cuda/11.5.2-2a1e4dd2237c71998d9badc1052421af/targets/x86_64-linux/include -I/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/build/_deps/mp11-src/include -cudart shared --expt-relaxed-constexpr --Werror default-stream-launch -Xcudafe "--diag_suppress=bad_friend_decl" -Xcudafe "--diag_suppress=unsigned_compare_with_zero" -Xcudafe "--diag_suppress=expr_has_no_effect" -O3 -DNDEBUG --generate-code=arch=compute_60,code=[compute_60,sm_60] --generate-code=arch=compute_70,code=[compute_70,sm_70] --generate-code=arch=compute_75,code=[compute_75,sm_75] -Xcompiler=-fPIC --diag-suppress 554 --compiler-options -Wall --compiler-options -Wno-deprecated-copy --compiler-options -Wno-nonnull-compare -Xcompiler -Wno-nonnull-compare --threads "" -Xcompiler -Wno-reorder -Xcompiler -Wno-error=sign-compare -Werror all-warnings -MD -MT CMakeFiles/onnxruntime_providers_cuda.dir/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/onnxruntime/core/providers/cuda/math/binary_elementwise_ops_impl.cu.o -MF CMakeFiles/onnxruntime_providers_cuda.dir/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/onnxruntime/core/providers/cuda/math/binary_elementwise_ops_impl.cu.o.d -x cu -c /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/onnxruntime/core/providers/cuda/math/binary_elementwise_ops_impl.cu -o CMakeFiles/onnxruntime_providers_cuda.dir/data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el8_amd64_gcc11/external/onnxruntime/1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3/onnxruntime-1.14.1/onnxruntime/core/providers/cuda/math/binary_elementwise_ops_impl.cu.o
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/bits/std_function.h:435:145: error: parameter packs not expanded with '...':
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/bits/std_function.h:435:145: note:         '_ArgTypes'
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/bits/std_function.h:530:146: error: parameter packs not expanded with '...':
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/data/cmsbld/jenkins/workspace/build-any-ib/w/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/include/c++/11.4.1/bits/std_function.h:530:146: note:         '_ArgTypes'
ninja: build stopped: subcommand failed.
error: Bad exit status from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/rpm-tmp.NqgYhJ (%build)


RPM build errors:
    line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+onnxruntime+1.14.1-e4f32e7ee87ac6022c8ef3aa4cfda0b3
    Macro expanded in comment on line 350: %{pkginstroot}/${PYTHON3_LIB_SITE_PACKAGES}

    Bad exit status from /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/rpm-tmp.NqgYhJ (%build)

It seems this issue raised as of the gcc update to 11.4.1 (#8545).
I think it is the same issue exposed in NVIDIA/nccl#650 (comment). I am moving forward to apply the proposed patch to see if it solves the issue.

FYI, @smuzaffar @fwyzard

@cmsbuild
Copy link
Contributor

A new Issue was created by @aandvalenzuela Andrea Valenzuela.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@iarspider
Copy link
Contributor

Tensorflow fails with the same error.

@smuzaffar
Copy link
Contributor

looks like this IB is using cuda 11.5 . please update it to cude 11.8.0 version (I think automatic forward port of #8545 to cudart branch did not work as cuda spec are different in master and cudart branches)

@smuzaffar
Copy link
Contributor

also make sure that cuda 11.8.0 is installed in special cvmfs area ( /cvmfs/projects.cern.ch/cms-restricted/$(uname -m)/rhel8/external/cuda ) , if it is not then you can run https://cmssdt.cern.ch/jenkins/view/CVMFS-CMS/job/cvmfs-install-cuda/ job to install it

@aandvalenzuela
Copy link
Contributor Author

Closing since CUDA update solved the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants