Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUC metric optionally move concat to compute step instead of update through flag #1656

Closed
wants to merge 1 commit into from

Conversation

zainhuda
Copy link

Summary:
Previously unlanded D51399384 due to issues with some models. Introducing it again but through an experimental flag as this optimization can benefit many models. This should not conflict with any models running AUC metric. The feature flag is akin to fusing task computation in RecMetrics as that feature as well does not work across all families of models.

The change in this diff is:
We don't need to concatenate the tensor on every update step, since it is an expensive operation (creates a new tensor and allocates new memory every call, as tensors are contiguous) we can call tensor.concat on the compute step instead. Which happens every compute_interval_step batches. This optimization should boost performance of models using AUC with no regression in metric quality.

Differential Revision: D53027097

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 24, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Jan 24, 2024
…hrough flag (pytorch#1656)

Summary:

Previously unlanded D51399384 due to issues with some models. Introducing it again but through an experimental flag as this optimization can benefit many models. This should not conflict with any models running AUC metric. The feature flag is akin to fusing task computation in RecMetrics as that feature as well does not work across all families of models. 

The change in this diff is:
We don't need to concatenate the tensor on every update step, since it is an expensive operation (creates a new tensor and allocates new memory every call, as tensors are contiguous) we can call tensor.concat on the compute step instead. Which happens every compute_interval_step batches. This optimization should boost performance of models using AUC with no regression in metric quality.

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

@Andrew12Liu
Copy link

Looking forward to the performance boost!

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Feb 1, 2024
… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Feb 1, 2024
… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Feb 1, 2024
… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Feb 1, 2024
… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Feb 1, 2024
… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in. With the concat before the collective, the test passes. Without the concat, the test will hang.

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

zainhuda pushed a commit to zainhuda/torchrec that referenced this pull request Feb 23, 2024
… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it. In the case where other ranks have more tensors than rank 0, the collective would not hang but those additional tensors would not be collected on to rank 0 resulting in a wrong calculation. The unit test covers for this case as well.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in (and vice versa). With the concat before the collective, the test passes. Without the concat, the test will hang.

Reviewed By: dstaay-fb

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

… call (pytorch#1656)

Summary:

Previously unlanded D51399384 because there was a NCCL AllGather hang caused by the previous version of the optimization. This diff fixes the issue by moving the tensor concatenation before the collective is called. By doing this, model outputs are transformed to be identical to what was previous, and thus the flow (sync, collective call, compute) is identical to previous AUC implementation. But at the same time, we avoid concatting every update thus leveraging significant gains.

**Why did the NCCL collective hang last time?**
There existed an edge case where ranks could have mismatched number of tensors, the way torchmetrics sync and collective call is written it calls an allgather per tensor in a list of tensors. Since number of tensors in a list across ranks is not guaranteed to be the same, the issue can arise where an allgather is being called for a tensor that does not exist.

Imagine, 3 tensors on rank 0 and 2 tensors on rank 1. For calling allgather on each tensor, on rank 0 calling on the 3rd tensor will cause a NCCL hang and subsequent timeout since there is no tensor on rank 1 that corresponds to it. In the case where other ranks have more tensors than rank 0, the collective would not hang but those additional tensors would not be collected on to rank 0 resulting in a wrong calculation. The unit test covers for this case as well.

A additional GPU unit test is added to cover this scenario, where on rank 0 - 2 tensors are passed in and on rank 1 - 1 tensor is passed in (and vice versa). With the concat before the collective, the test passes. Without the concat, the test will hang.

Reviewed By: dstaay-fb

Differential Revision: D53027097
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53027097

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants