New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

AUC metric optionally move concat to compute step instead of update through flag #1656

Closed

zainhuda wants to merge 1 commit into pytorch:main from zainhuda:export-D53027097

zainhuda commented Jan 24, 2024

Summary:
Previously unlanded D51399384 due to issues with some models. Introducing it again but through an experimental flag as this optimization can benefit many models. This should not conflict with any models running AUC metric. The feature flag is akin to fusing task computation in RecMetrics as that feature as well does not work across all families of models.

The change in this diff is:
We don't need to concatenate the tensor on every update step, since it is an expensive operation (creates a new tensor and allocates new memory every call, as tensors are contiguous) we can call tensor.concat on the compute step instead. Which happens every compute_interval_step batches. This optimization should boost performance of models using AUC with no regression in metric quality.

Differential Revision: D53027097

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Jan 24, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

facebook-github-bot added the fb-exported label

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          AUC metric optionally move concat to compute step instead of update t…

…hrough flag (pytorch#1656)

Summary:

Previously unlanded D51399384 due to issues with some models. Introducing it again but through an experimental flag as this optimization can benefit many models. This should not conflict with any models running AUC metric. The feature flag is akin to fusing task computation in RecMetrics as that feature as well does not work across all families of models. 

The change in this diff is:
We don't need to concatenate the tensor on every update step, since it is an expensive operation (creates a new tensor and allocates new memory every call, as tensors are contiguous) we can call tensor.concat on the compute step instead. Which happens every compute_interval_step batches. This optimization should boost performance of models using AUC with no regression in metric quality.

Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from b472efe to 1441796 Compare

January 24, 2024 00:39

Contributor

facebook-github-bot commented Jan 24, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

Andrew12Liu commented Jan 27, 2024

Looking forward to the performance boost!

Andrew12Liu approved these changes

View reviewed changes

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          move tensor concatenation out of update and before compute collective…

43ae3fb

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 1441796 to 43ae3fb Compare

February 1, 2024 00:38

Contributor

facebook-github-bot commented Feb 1, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          move tensor concatenation out of update and before compute collective…

3f5ec14

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 43ae3fb to 3f5ec14 Compare

February 1, 2024 03:59

Contributor

facebook-github-bot commented Feb 1, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          move tensor concatenation out of update and before compute collective…

7a112fd

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 3f5ec14 to 7a112fd Compare

February 1, 2024 04:29

Contributor

facebook-github-bot commented Feb 1, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 7a112fd to 49efc37 Compare

February 1, 2024 20:15

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          move tensor concatenation out of update and before compute collective…

49efc37

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097

Contributor

facebook-github-bot commented Feb 1, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          move tensor concatenation out of update and before compute collective…

81d9430

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 49efc37 to 81d9430 Compare

February 1, 2024 20:17

Contributor

facebook-github-bot commented Feb 1, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 81d9430 to 850fc93 Compare

February 23, 2024 15:38

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request


          move tensor concatenation out of update and before compute collective…

850fc93

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it. In the case where other ranks have more tensors than rank 0, the collective would not hang but those additional tensors would not be collected on to rank 0 resulting in a wrong calculation. The unit test covers for this case as well.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in (and vice versa). With the concat before the collective, the test passes. Without the concat, the test will hang.

Reviewed By: dstaay-fb

Differential Revision: D53027097

Contributor

facebook-github-bot commented Feb 23, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097


          move tensor concatenation out of update and before compute collective…

8dce235

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it. In the case where other ranks have more tensors than rank 0, the collective would not hang but those additional tensors would not be collected on to rank 0 resulting in a wrong calculation. The unit test covers for this case as well.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in (and vice versa). With the concat before the collective, the test passes. Without the concat, the test will hang.

Reviewed By: dstaay-fb

Differential Revision: D53027097

zainhuda force-pushed the export-D53027097 branch from 850fc93 to 8dce235 Compare

February 23, 2024 15:52

Contributor

facebook-github-bot commented Feb 23, 2024

This pull request was exported from Phabricator. Differential Revision: D53027097

facebook-github-bot closed this in

bc52746

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported