-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl errors after moving to gcc 10.3.0 #494
Comments
Using Maybe it's a |
It seems this is a bug in GCC 10.3.0 indeed (segmentation fault) so I would suggest to report this to the GCC project rather than NCCL. If this GCC version is supported by CUDA then you could also open an issue on developer.nvidia.com. |
Same issue with newest Will look into reporting to |
In the mentions above that link to this issue, the following GCC bug reports are reported as possibly relevant: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100101 Also looks relevant, I found it searching GCC Bugzilla for recent I'm no compiler developer so can't follow the technical descriptions. If GCC fixes whatever the problem is via GCC 11 and GCC 12, those of us on OS's that provide GCC 10.3 might be stuck in a bad place for a while. In that case hopefully there will be GCC patches available for GCC 10.3. BTW "ICE" = |
I know this isn't strictly how bugs should be fixed, however, if you're desperate and under a deadline like me, comment out that block of code
The segmentation error disappears |
@Medoalmasry thanks (the change refers to |
@edrozenberg Let's hope one day the gcc compiler is stable enough for us NOT to have to go seek that knowledge. |
The GCC project has committed a patch: https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5357ab75dedef403b0eebf9277d61d1cbeb5898f The patch removes ~ 40 lines from Works fine for me - I tested as follows:
|
For clarity - depending on your machine - GCC even for the latest cudatoolkit 11.4 is ONLY supported by gcc 10.2 working (ubuntu 20.04). WARNING - use timeshift to take a snapshot of your working system. UPDATE
nvcc will correctly pick up the version gcc-9. your system can remain as is.
|
I think it's worth mentioning here that CUDA 11.4 is listed as being compatible with GCC 10.2 here https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements but doesn't mention anything later, which is consistent with the failures listed above. |
@johndpope That's not accurate because Nvidia does a terrible job of updating their various docs. For ex. gcc 10 support is already mentioned months ago in https://docs.nvidia.com/cuda/archive/11.1.1/cuda-toolkit-release-notes/index.html The real fix will come with gcc 10.4 and future gcc 11 and 12 releases. Meanwhile my distro Slackware has patched gcc 10.3 with patches from gcc dev source and it's working great for me with Nvidia stuff (PR100xxxx patches ftp://ftp.slackware.com/pub/slackware/slackware64-current/source/d/gcc/patches). |
fedora32 + cuda11でそのままビルドすると、g++ 10.3のchrono周りのエラーが発生する。 参照: NVIDIA/nccl#494 gcc 10.2 OK gcc 10.3, 11.0, 11.1 broken gcc 10.4, gcc 11.2 OK そのため、dnf downgradeでgccを10.0に戻して対処した。 fedora34の現在のgccは11.2.1のため、この問題は解消しているが、gcc-11.2.1-9からstd::functionがらみの別の問題が発生する。 参照: NVIDIA/nccl#102
Hi, |
Updated my OS (
Slackware -current
aka14.2+
aka15.0-
:) , which updated the variousgcc-*
pkgs from10.2.0
to10.3.0
.After these gcc updates, getting nccl errors, for ex. when trying to build the nccl test samples, and also when trying to build
pytorch
which uses nccl. Downgrading gcc back to 10.2.0 is a workaround, for now.nccl test samples build error:
The text was updated successfully, but these errors were encountered: