Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl errors after moving to gcc 10.3.0 #494

Closed
edrozenberg opened this issue Apr 13, 2021 · 12 comments
Closed

nccl errors after moving to gcc 10.3.0 #494

edrozenberg opened this issue Apr 13, 2021 · 12 comments

Comments

@edrozenberg
Copy link

edrozenberg commented Apr 13, 2021

Updated my OS (Slackware -current aka 14.2+ aka 15.0- :) , which updated the various gcc-* pkgs from 10.2.0 to 10.3.0.

After these gcc updates, getting nccl errors, for ex. when trying to build the nccl test samples, and also when trying to build pytorch which uses nccl. Downgrading gcc back to 10.2.0 is a workaround, for now.

nccl test samples build error:

[eduardr@work1 ~/Developer/External/nvidia/nccl/nccl-tests-git]$ make
make -C src build
make[1]: Entering directory '/home/eduardr/Developer/External/nvidia/nccl/nccl-tests-git/src'
Compiling  all_reduce.cu                       > ../build/all_reduce.o
/usr/include/c++/10.3.0/chrono: In substitution of ‘template<class _Rep, class _Period> template<class _Period2> using __is_harmonic = std::__bool_constant<(std::ratio<((_Period2::num / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::num, _Period::num)) * (_Period::den / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::den, _Period::den))), ((_Period2::den / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::den, _Period::den)) * (_Period::num / std::chrono::duration<_Rep, _Period>::_S_gcd(_Period2::num, _Period::num)))>::den == 1)> [with _Period2 = _Period2; _Rep = _Rep; _Period = _Period]’:
/usr/include/c++/10.3.0/chrono:473:154:   required from here
/usr/include/c++/10.3.0/chrono:428:27: internal compiler error: Segmentation fault
  428 |  _S_gcd(intmax_t __m, intmax_t __n) noexcept
      |                           ^~~~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.
make[1]: *** [Makefile:84: ../build/all_reduce.o] Error 1
make[1]: Leaving directory '/home/eduardr/Developer/External/nvidia/nccl/nccl-tests-git/src'
make: *** [Makefile:17: src.build] Error 2
@edrozenberg
Copy link
Author

edrozenberg commented Apr 13, 2021

Using nccl-2.8.4.1_11.2, cuda 11.2.2. Same issue with both binary nccl download (built by nvidia), and with nccl I built from source.

Maybe it's a gcc problem, not sure - if it's a gcc problem then I'd have to file it to the gcc project.

@sjeaugey
Copy link
Member

It seems this is a bug in GCC 10.3.0 indeed (segmentation fault) so I would suggest to report this to the GCC project rather than NCCL. If this GCC version is supported by CUDA then you could also open an issue on developer.nvidia.com.

@edrozenberg
Copy link
Author

Same issue with newest nccl 2.9.6.1 and CUDA 11.3.0.

Will look into reporting to gcc project.

@edrozenberg
Copy link
Author

edrozenberg commented Apr 25, 2021

In the mentions above that link to this issue, the following GCC bug reports are reported as possibly relevant:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100101
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102

Also looks relevant, I found it searching GCC Bugzilla for recent intmax_t:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100240

I'm no compiler developer so can't follow the technical descriptions.

If GCC fixes whatever the problem is via GCC 11 and GCC 12, those of us on OS's that provide GCC 10.3 might be stuck in a bad place for a while. In that case hopefully there will be GCC patches available for GCC 10.3.

BTW "ICE" = internal compiler error :)

@Medoalmasry
Copy link

I know this isn't strictly how bugs should be fixed, however, if you're desperate and under a deadline like me, comment out that block of code

        /*
        static constexpr intmax_t
        _S_gcd(intmax_t __m, intmax_t __n) noexcept
        {
          // Duration only allows positive periods so we don't need to
          // support negative values here (unlike __static_gcd and std::gcd).
          return (__m == 0) ? __n : (__n == 0) ? __m : _S_gcd(__n, __m % __n);
        }
        */

The segmentation error disappears

@edrozenberg
Copy link
Author

@Medoalmasry thanks (the change refers to gcc/libstdc++-v3/include/std/chrono). I don't have even 5% of the knowledge required to know if this is an OK thing to do.

@Medoalmasry
Copy link

@edrozenberg Let's hope one day the gcc compiler is stable enough for us NOT to have to go seek that knowledge.

@edrozenberg
Copy link
Author

edrozenberg commented Jun 4, 2021

The GCC project has committed a patch:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5357ab75dedef403b0eebf9277d61d1cbeb5898f
(in response to the problem report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102)

The patch removes ~ 40 lines from gcc/cp/pt.c and adds a couple of test cases.

Works fine for me - I tested as follows:

  • Rebuilt (Slackware) gcc 10.3.0 packages with the patched pt.c and updated to them
  • Built Nvidia nccl-tests - build succeeded (was failing before the patch)
  • Built latest pytorch git with Nvidia cuda/nccl support - build succeeded (was failing before the patch)

@johndpope
Copy link

johndpope commented Jul 3, 2021

For clarity - depending on your machine - GCC even for the latest cudatoolkit 11.4 is ONLY supported by
gcc 9.X - https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Screen Shot 2021-07-09 at 07 54 16

gcc 10.2 working (ubuntu 20.04).
gcc 10.3 broken.
gcc 11.0, 11.1 - broken.
gcc 10.4, gcc 11.2 - fix things.

WARNING - use timeshift to take a snapshot of your working system.
https://github.com/teejee2008/timeshift

UPDATE
once you install gcc-9 you can make a symlink to it. ( I had to create a bin folder)
https://stackoverflow.com/questions/44792279/set-default-host-compiler-for-nvcc

sudo apt install gcc-9 g++-9
sudo mkdir /usr/local/cuda/bin
sudo ln -s /usr/bin/gcc-9 /usr/local/cuda/bin/gcc

nvcc will correctly pick up the version gcc-9.

your system can remain as is.

gcc --version                                                      
gcc (Ubuntu 10.3.0-1ubuntu1~20.10) 10.3.0 (your broken version)

@cponder
Copy link

cponder commented Jul 8, 2021

I think it's worth mentioning here that CUDA 11.4 is listed as being compatible with GCC 10.2 here

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements

but doesn't mention anything later, which is consistent with the failures listed above.

@edrozenberg
Copy link
Author

For clarity - depending on your machine - GCC even for the latest cudatoolkit 11.4 is ONLY supported by
gcc 9.X - https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

@johndpope That's not accurate because Nvidia does a terrible job of updating their various docs. For ex. gcc 10 support is already mentioned months ago in https://docs.nvidia.com/cuda/archive/11.1.1/cuda-toolkit-release-notes/index.html

The real fix will come with gcc 10.4 and future gcc 11 and 12 releases. Meanwhile my distro Slackware has patched gcc 10.3 with patches from gcc dev source and it's working great for me with Nvidia stuff (PR100xxxx patches ftp://ftp.slackware.com/pub/slackware/slackware64-current/source/d/gcc/patches).

rigaya added a commit to rigaya/NVEnc that referenced this issue Feb 20, 2022
fedora32 + cuda11でそのままビルドすると、g++ 10.3のchrono周りのエラーが発生する。
参照: NVIDIA/nccl#494
  gcc 10.2 OK
  gcc 10.3, 11.0, 11.1 broken
  gcc 10.4, gcc 11.2 OK
そのため、dnf downgradeでgccを10.0に戻して対処した。

fedora34の現在のgccは11.2.1のため、この問題は解消しているが、gcc-11.2.1-9からstd::functionがらみの別の問題が発生する。
参照: NVIDIA/nccl#102
@feboz
Copy link

feboz commented Apr 20, 2022

The GCC project has committed a patch:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5357ab75dedef403b0eebf9277d61d1cbeb5898f (in response to the problem report https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100102)

The patch removes ~ 40 lines from gcc/cp/pt.c and adds a couple of test cases.

Works fine for me - I tested as follows:

  • Rebuilt (Slackware) gcc 10.3.0 packages with the patched pt.c and updated to them
  • Built Nvidia nccl-tests - build succeeded (was failing before the patch)
  • Built latest pytorch git with Nvidia cuda/nccl support - build succeeded (was failing before the patch)

Hi,
where can I download the patched version?
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants