Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: Graphcast: Error when running mpirun --allow-run-as-root -np 3 for GraphCast model, but works with -np 2 #539

Closed
Flionay opened this issue May 31, 2024 · 1 comment
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@Flionay
Copy link

Flionay commented May 31, 2024

Version

0.5.0

On which installation method(s) does this occur?

Docker

Describe the issue

When I run the GraphCast model with mpirun --allow-run-as-root -np 3 python train_graphcast.py, I encounter an error. However, when I use mpirun --allow-run-as-root -np 2 python train_graphcast.py, the model runs without any issues.

I am seeking help to identify the potential cause of this problem. Below is the output log from my program:

Minimum reproducible example

mpirun --allow-run-as-root -np 3 python train_graphcast.py

Relevant log output

Cuda failure 1 'invalid argument'
Traceback (most recent call last):
  File "/graphcast/train_graphcast_2to1.py", line 523, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/graphcast/train_graphcast_2to1.py", line 340, in main
    trainer = GraphCastTrainer(cfg, dist, rank_zero_logger)
  File "/graphcast/train_graphcast_2to1.py", line 145, in __init__
    self.model = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1727, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 1 'invalid argument'
----------------------------------------------
59b88225e9fb:14085:14085 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
59b88225e9fb:14085:14085 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
59b88225e9fb:14085:14085 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.19.3+cuda12.3
59b88225e9fb:14087:14087 [2] NCCL INFO cudaDriverVersion 12030
59b88225e9fb:14087:14087 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
59b88225e9fb:14087:14087 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
59b88225e9fb:14085:14711 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
59b88225e9fb:14085:14711 [0] NCCL INFO P2P plugin IBext
59b88225e9fb:14085:14711 [0] NCCL INFO NET/IB : No device found.
59b88225e9fb:14085:14711 [0] NCCL INFO NET/IB : No device found.
59b88225e9fb:14085:14711 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
59b88225e9fb:14085:14711 [0] NCCL INFO Using non-device net plugin version 0
59b88225e9fb:14085:14711 [0] NCCL INFO Using network Socket
59b88225e9fb:14087:14712 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
59b88225e9fb:14087:14712 [2] NCCL INFO P2P plugin IBext
59b88225e9fb:14087:14712 [2] NCCL INFO NET/IB : No device found.
59b88225e9fb:14087:14712 [2] NCCL INFO NET/IB : No device found.
59b88225e9fb:14087:14712 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
59b88225e9fb:14087:14712 [2] NCCL INFO Using non-device net plugin version 0
59b88225e9fb:14087:14712 [2] NCCL INFO Using network Socket
59b88225e9fb:14086:14086 [1] NCCL INFO cudaDriverVersion 12030
59b88225e9fb:14086:14086 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.6<0>
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
59b88225e9fb:14086:14086 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
59b88225e9fb:14086:14713 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
59b88225e9fb:14086:14713 [1] NCCL INFO P2P plugin IBext
59b88225e9fb:14086:14713 [1] NCCL INFO NET/IB : No device found.
59b88225e9fb:14086:14713 [1] NCCL INFO NET/IB : No device found.
59b88225e9fb:14086:14713 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.6<0>
59b88225e9fb:14086:14713 [1] NCCL INFO Using non-device net plugin version 0
59b88225e9fb:14086:14713 [1] NCCL INFO Using network Socket
59b88225e9fb:14086:14713 [1] NCCL INFO comm 0x5574f22e0f80 rank 1 nranks 3 cudaDev 1 nvmlDev 3 busId 66000 commId 0xb80ced506beebbca - Init START
59b88225e9fb:14085:14711 [0] NCCL INFO comm 0x5613bb376e20 rank 0 nranks 3 cudaDev 0 nvmlDev 2 busId 3f000 commId 0xb80ced506beebbca - Init START
59b88225e9fb:14087:14712 [2] NCCL INFO comm 0x55eb79b74d90 rank 2 nranks 3 cudaDev 2 nvmlDev 4 busId 9b000 commId 0xb80ced506beebbca - Init START
59b88225e9fb:14086:14713 [1] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
59b88225e9fb:14086:14713 [1] NCCL INFO NVLS multicast support is available on dev 1
59b88225e9fb:14085:14711 [0] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
59b88225e9fb:14087:14712 [2] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
59b88225e9fb:14087:14712 [2] NCCL INFO NVLS multicast support is available on dev 2
59b88225e9fb:14085:14711 [0] NCCL INFO NVLS multicast support is available on dev 0
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 00/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 01/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 02/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 03/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 04/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 05/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 06/24 :    0   1   2
59b88225e9fb:14087:14712 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1 [4] -1/-1/-1->2->1 [5] -1/-1/-1->2->1 [6] 0/-1/-1->2->1 [7] 0/-1/-1->2->1 [8] 0/-1/-1->2->1 [9] 0/-1/-1->2->-1 [10] 0/-1/-1->2->-1 [11] 0/-1/-1->2->-1 [12] -1/-1/-1->2->1 [13] -1/-1/-1->2->1 [14] -1/-1/-1->2->1 [15] -1/-1/-1->2->1 [16] -1/-1/-1->2->1 [17] -1/-1/-1->2->1 [18] 0/-1/-1->2->1 [19] 0/-1/-1->2->1 [20] 0/-1/-1->2->1 [21] 0/-1/-1->2->-1 [22] 0/-1/-1->2->-1 [23] 0/-1/-1->2->-1
59b88225e9fb:14087:14712 [2] NCCL INFO P2P Chunksize set to 524288
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 07/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 08/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 09/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 10/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 11/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 12/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 13/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 14/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 15/24 :    0   1   2
59b88225e9fb:14086:14713 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->-1 [7] 2/-1/-1->1->-1 [8] 2/-1/-1->1->-1 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->-1 [19] 2/-1/-1->1->-1 [20] 2/-1/-1->1->-1 [21] -1/-1/-1->1->0 [22] -1/-1/-1->1->0 [23] -1/-1/-1->1->0
59b88225e9fb:14086:14713 [1] NCCL INFO P2P Chunksize set to 524288
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 16/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 17/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 18/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 19/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 20/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 21/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 22/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 23/24 :    0   1   2
59b88225e9fb:14085:14711 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->2 [7] -1/-1/-1->0->2 [8] -1/-1/-1->0->2 [9] 1/-1/-1->0->2 [10] 1/-1/-1->0->2 [11] 1/-1/-1->0->2 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->2 [19] -1/-1/-1->0->2 [20] -1/-1/-1->0->2 [21] 1/-1/-1->0->2 [22] 1/-1/-1->0->2 [23] 1/-1/-1->0->2
59b88225e9fb:14085:14711 [0] NCCL INFO P2P Chunksize set to 524288
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 00/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 01/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 02/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 03/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 04/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 05/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 06/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 07/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 08/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 09/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 10/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 11/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 12/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 13/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 14/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 00/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 15/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 01/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 16/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 02/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 17/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 03/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 18/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 04/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 19/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 05/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 20/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 06/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 21/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 07/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 22/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 08/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 23/0 : 2[4] -> 0[2] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 09/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 10/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 11/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 12/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 13/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 14/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 15/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 16/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 17/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 18/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 19/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 20/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 21/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 22/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 23/0 : 0[2] -> 1[3] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 00/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 01/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 02/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 03/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 04/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 05/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 06/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 07/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 08/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 09/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 10/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 11/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 12/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 13/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 14/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 15/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 16/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 17/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 18/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 19/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 20/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 21/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 22/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 23/0 : 1[3] -> 2[4] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Connected all rings
59b88225e9fb:14085:14711 [0] NCCL INFO Connected all rings
59b88225e9fb:14087:14712 [2] NCCL INFO Connected all rings
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 06/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 07/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 08/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 09/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 10/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 11/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 18/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 19/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 20/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 21/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 22/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14085:14711 [0] NCCL INFO Channel 23/0 : 0[2] -> 2[4] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 00/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 01/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 02/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 03/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 04/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 05/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 06/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 07/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 08/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 12/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 13/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 14/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 15/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 16/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 17/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 18/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 19/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14087:14712 [2] NCCL INFO Channel 20/0 : 2[4] -> 1[3] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 00/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 01/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 02/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 03/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 04/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 05/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 09/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 10/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 11/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 12/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 13/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 14/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 15/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 16/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 17/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 21/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 22/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Channel 23/0 : 1[3] -> 0[2] via P2P/CUMEM
59b88225e9fb:14086:14713 [1] NCCL INFO Connected all trees
59b88225e9fb:14087:14712 [2] NCCL INFO Connected all trees
59b88225e9fb:14085:14711 [0] NCCL INFO Connected all trees
59b88225e9fb:14086:14713 [1] NCCL INFO NVLS comm 0x5574f22e0f80 headRank 1 nHeads 3 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 603979776
59b88225e9fb:14087:14712 [2] NCCL INFO NVLS comm 0x55eb79b74d90 headRank 2 nHeads 3 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 603979776
59b88225e9fb:14085:14711 [0] NCCL INFO NVLS comm 0x5613bb376e20 headRank 0 nHeads 3 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 603979776

59b88225e9fb:14086:14713 [1] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
59b88225e9fb:14086:14713 [1] NCCL INFO transport/nvls.cc:339 -> 1
59b88225e9fb:14086:14713 [1] NCCL INFO init.cc:1131 -> 1

59b88225e9fb:14085:14711 [0] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
59b88225e9fb:14085:14711 [0] NCCL INFO transport/nvls.cc:339 -> 1
59b88225e9fb:14085:14711 [0] NCCL INFO init.cc:1131 -> 1
59b88225e9fb:14086:14713 [1] NCCL INFO init.cc:1396 -> 1
59b88225e9fb:14086:14713 [1] NCCL INFO group.cc:64 -> 1 [Async thread]
59b88225e9fb:14085:14711 [0] NCCL INFO init.cc:1396 -> 1
59b88225e9fb:14085:14711 [0] NCCL INFO group.cc:64 -> 1 [Async thread]

59b88225e9fb:14087:14712 [2] transport/nvls.cc:169 NCCL WARN Cuda failure 1 'invalid argument'
59b88225e9fb:14087:14712 [2] NCCL INFO transport/nvls.cc:339 -> 1
59b88225e9fb:14087:14712 [2] NCCL INFO init.cc:1131 -> 1
59b88225e9fb:14087:14712 [2] NCCL INFO init.cc:1396 -> 1
59b88225e9fb:14087:14712 [2] NCCL INFO group.cc:64 -> 1 [Async thread]
59b88225e9fb:14087:14087 [2] NCCL INFO group.cc:418 -> 1
59b88225e9fb:14087:14087 [2] NCCL INFO group.cc:95 -> 1
59b88225e9fb:14086:14086 [1] NCCL INFO group.cc:418 -> 1
59b88225e9fb:14086:14086 [1] NCCL INFO group.cc:95 -> 1
59b88225e9fb:14085:14085 [0] NCCL INFO group.cc:418 -> 1
59b88225e9fb:14085:14085 [0] NCCL INFO group.cc:95 -> 1
Error executing job with overrides: []
Traceback (most recent call last):
  File "/graphcast/train_graphcast_2to1.py", line 523, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/graphcast/train_graphcast_2to1.py", line 340, in main
    trainer = GraphCastTrainer(cfg, dist, rank_zero_logger)
  File "/graphcast/train_graphcast_2to1.py", line 145, in __init__
    self.model = DistributedDataParallel(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1727, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 1 'invalid argument'

Environment details

No response

@Flionay Flionay added ? - Needs Triage Need team to review and classify bug Something isn't working labels May 31, 2024
@mnabian mnabian self-assigned this Jun 3, 2024
@Flionay
Copy link
Author

Flionay commented Jun 7, 2024

I wanted to follow up on this issue. Upon further investigation, I realized that the problem was not with the project code but with my local environment. Therefore, I am closing this issue.

For anyone encountering similar issues, I found the cause and solution related to the environment in this discussion: NVIDIA/nccl#976.

Thank you for your time and support.

@Flionay Flionay closed this as completed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants