-
Notifications
You must be signed in to change notification settings - Fork 479
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
move tensor concatenation out of update and before compute collective…
… call (#1656) Summary: Pull Request resolved: #1656 Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains. **Why did the NCCL collective hang last time?** There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist. Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it. In the case where other ranks have more tensors than rank 0, the collective would not hang but those additional tensors would not be collected on to rank 0 resulting in a wrong calculation. The unit test covers for this case as well. A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in (and vice versa). With the concat before the collective, the test passes. Without the concat, the test will hang. Reviewed By: dstaay-fb Differential Revision: D53027097 fbshipit-source-id: 37746acda3a1b83120e46520f1b22a98dc6d51f1
- Loading branch information
1 parent
84c92d2
commit bc52746
Showing
4 changed files
with
362 additions
and
49 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.