FYI: Notes on using nccl and MPS with Torch... #46

CCorfield · 2016-09-07T17:27:35Z

I have written up some notes and cookbook examples of using MPS and nccl with Torch, which may help Torch users who are new to multi-process, multi-GPU environments.

My notes can be found at:

https://github.com/CCorfield/Torch-parallel-nccl-MPS-Example

Please advise on corrections and additions.

Summary: Pull Request resolved: facebookresearch#46 Differential Revision: D55168758

Summary: Pull Request resolved: facebookresearch#46 Add start time information in CollTrace. Now worker thread will wait for the start event of each collective as well. This could help post analysis during hang to figure out dependency between collectives. Reviewed By: minsii Differential Revision: D55168758 fbshipit-source-id: df908efae5d96c03f31b3672640c2e001ae68af9

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

YulunW added a commit to YulunW/nccl that referenced this issue Mar 21, 2024

Add start time information to CollTrace (NVIDIA#46)

3bb6a27

Summary: Pull Request resolved: facebookresearch#46 Differential Revision: D55168758

YulunW added a commit to YulunW/nccl that referenced this issue Mar 21, 2024

Add start time information to CollTrace (NVIDIA#46)

39aa0cd

Summary: Pull Request resolved: facebookresearch#46 Differential Revision: D55168758

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: Notes on using nccl and MPS with Torch... #46

FYI: Notes on using nccl and MPS with Torch... #46

CCorfield commented Sep 7, 2016

FYI: Notes on using nccl and MPS with Torch... #46

FYI: Notes on using nccl and MPS with Torch... #46

Comments

CCorfield commented Sep 7, 2016