[TIR] ThreadAllreduce warp-level primitive support with multi-warp #15327

MasterJH5574 · 2023-07-15T19:06:03Z

This PR enhances the implementation of the LowerThreadAllreduce pass.

Prior to this PR, for CUDA backend we will leverage warp-level primitives only when

the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or
the number of reducing threads is less then 32, and equals the reduction extent.

Under the requirement above, for reductions that have large number of reducing threads (e.g., reducing over 128, 256 or larger number or threads), the generated code is inefficient.

This PR improves the LowerThreadAllreduce pass, so that we now generate more efficient CUDA code in such cases, when the number of reducing threads is a multiple of warp size, with the help of warp-level primitives.

Specifically, in such cases, we first reducing 32 elements within each warp, getting the results of each warp stored in shared memory. We then trigger a second round of warp-level primitive reduction within the first warp, and get the final reduction results.

In addition to using warp-level primitives, by doing this we also reduce the size of the shared memory. For example, even when reducing over 1024 threads, we now only require shared memory of size 32, compared with 1024 prior to this PR.

Tests are added to ensure correctness.

tvm-bot · 2023-07-15T19:06:06Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @Hzfengsy, @junrushao, @quic-sanirudh, @shingjan _{See #10317 for details}

_{Generated by tvm-bot}

tqchen · 2023-07-15T19:39:47Z

cc @masahi

yzh119

LGTM in general, left some suggestions.

src/tir/transforms/lower_thread_allreduce.cc

This PR enhances the implementation of the LowerThreadAllreduce pass. Prior to this PR, for CUDA backend we will leverage warp-level primitives only when * the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or * the number of reducing threads is less then 32, and equals the reduction extent. Under the requirement above, for reductions that have large number of reducing threads (e.g., reducing over 128, 256 or larger number or threads), the generated code is inefficient. This PR improves the LowerThreadAllreduce pass, so that we now generate more efficient CUDA code in such cases, when the number of reducing threads is a multiple of warp size, with the help of warp-level primitives. Specifically, in such cases, we first reducing 32 elements within each warp, getting the results of each warp stored in shared memory. We then trigger a second round of warp-level primitive reduction within the first warp, and get the final reduction results. In addition to using warp-level primitives, by doing this we also reduce the size of the shared memory. For example, even when reducing over 1024 threads, we now only require shared memory of size 32, compared with 1024 prior to this PR. Tests are added to ensure correctness.

MrJungle1 · 2023-07-18T03:01:52Z

@MasterJH5574 LGTM ! I also encountered the same problem when I searched for reduce_sum on Ansor. Is your work considered on Ansor?

PR apache#15327 introduces the warp-level primitive support in multi-warp allreduce. However, due to the specialty of the two-stage shuffle-down reduction implementation of the allreduce in multi-warp scenarios, PR apache#15327 did not broadcast the allreduce result to each reduction thread. This behavior does not align with the semantics of allreduce and is not ideal for many use cases. Therefore, this PR completes the implementation by inserting a stage of writing the reduction results to shared memory, so that each reduction thread across all the reduction warps can access the reduction results. This shared memory write-back stage will only be inserted in multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync` is used to broadcast the reduction results across reduction threads. Since in multi-warp settings we cannot leverage warp-level primitives to broadcast the value, we can only make use of shared memory. The numerical correctness are verified locally.

…15373) PR #15327 introduces the warp-level primitive support in multi-warp allreduce. However, due to the specialty of the two-stage shuffle-down reduction implementation of the allreduce in multi-warp scenarios, PR #15327 did not broadcast the allreduce result to each reduction thread. This behavior does not align with the semantics of allreduce and is not ideal for many use cases. Therefore, this PR completes the implementation by inserting a stage of writing the reduction results to shared memory, so that each reduction thread across all the reduction warps can access the reduction results. This shared memory write-back stage will only be inserted in multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync` is used to broadcast the reduction results across reduction threads. Since in multi-warp settings we cannot leverage warp-level primitives to broadcast the value, we can only make use of shared memory. The numerical correctness are verified locally.

…pache#15327) This PR enhances the implementation of the LowerThreadAllreduce pass. Prior to this PR, for CUDA backend we will leverage warp-level primitives only when * the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or * the number of reducing threads is less then 32, and equals the reduction extent. Under the requirement above, for reductions that have large number of reducing threads (e.g., reducing over 128, 256 or larger number or threads), the generated code is inefficient. This PR improves the LowerThreadAllreduce pass, so that we now generate more efficient CUDA code in such cases, when the number of reducing threads is a multiple of warp size, with the help of warp-level primitives. Specifically, in such cases, we first reducing 32 elements within each warp, getting the results of each warp stored in shared memory. We then trigger a second round of warp-level primitive reduction within the first warp, and get the final reduction results. In addition to using warp-level primitives, by doing this we also reduce the size of the shared memory. For example, even when reducing over 1024 threads, we now only require shared memory of size 32, compared with 1024 prior to this PR. Tests are added to ensure correctness.

PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation. At the time of the introduction, I tested the correctness numerically via the workload of "taking a matrix of ones as input, computing the summation over each row". Both PR passed this numerical tess, while I didn't realize that this test is not complete and cannot guarantee the correctness. The previous implementation has bug which can be tested by turning the input matrix from ones to random floating-point numbers. This will expose the issues of the previous implementation. Therefore, this PR fixes the issues, and add the numerical tests for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing some of the redundant tests in that file, we hope this can reduce the testing time a bit while still guarantee the correctness. Sorry for not testing the implementation completely before.

PR #15327 and #15373 introduced multi-warp allreduce implementation. At the time of the introduction, I tested the correctness numerically via the workload of "taking a matrix of ones as input, computing the summation over each row". Both PR passed this numerical tess, while I didn't realize that this test is not complete and cannot guarantee the correctness. The previous implementation has bug which can be tested by turning the input matrix from ones to random floating-point numbers. This will expose the issues of the previous implementation. Therefore, this PR fixes the issues, and add the numerical tests for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing some of the redundant tests in that file, we hope this can reduce the testing time a bit while still guarantee the correctness. Sorry for not testing the implementation completely before.

…pache#15327) This PR enhances the implementation of the LowerThreadAllreduce pass. Prior to this PR, for CUDA backend we will leverage warp-level primitives only when * the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or * the number of reducing threads is less then 32, and equals the reduction extent. Under the requirement above, for reductions that have large number of reducing threads (e.g., reducing over 128, 256 or larger number or threads), the generated code is inefficient. This PR improves the LowerThreadAllreduce pass, so that we now generate more efficient CUDA code in such cases, when the number of reducing threads is a multiple of warp size, with the help of warp-level primitives. Specifically, in such cases, we first reducing 32 elements within each warp, getting the results of each warp stored in shared memory. We then trigger a second round of warp-level primitive reduction within the first warp, and get the final reduction results. In addition to using warp-level primitives, by doing this we also reduce the size of the shared memory. For example, even when reducing over 1024 threads, we now only require shared memory of size 32, compared with 1024 prior to this PR. Tests are added to ensure correctness.

MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch from 3f16761 to 035ac24 Compare July 15, 2023 19:34

MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch 3 times, most recently from d811d87 to 0d58998 Compare July 15, 2023 19:51

tqchen approved these changes Jul 15, 2023

View reviewed changes

yzh119 approved these changes Jul 16, 2023

View reviewed changes

MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch 2 times, most recently from 7a07622 to 5efb770 Compare July 16, 2023 06:53

MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch from 5efb770 to bd3448e Compare July 16, 2023 08:26

tqchen merged commit e25b1ba into apache:main Jul 16, 2023

yzh119 mentioned this pull request Jul 18, 2023

[Bug] The optimal implementation of reduce_sum searched by Ansor is more than 30x slower than torch.sum #15342

Open

MasterJH5574 mentioned this pull request Jul 21, 2023

[TIR] Allreduce broadcast result to each thread in multi-warp case #15373

Merged

MasterJH5574 mentioned this pull request Jul 25, 2023

[BugFix][TIR] Fix multi-grouped multi-warp allreduce #15399

Merged

ysh329 mentioned this pull request Oct 18, 2023

[Release] v0.14.0 Release Candidate Notes #15948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIR] ThreadAllreduce warp-level primitive support with multi-warp #15327

[TIR] ThreadAllreduce warp-level primitive support with multi-warp #15327

MasterJH5574 commented Jul 15, 2023

tvm-bot commented Jul 15, 2023

tqchen commented Jul 15, 2023

yzh119 left a comment

MrJungle1 commented Jul 18, 2023

[TIR] ThreadAllreduce warp-level primitive support with multi-warp #15327

[TIR] ThreadAllreduce warp-level primitive support with multi-warp #15327

Conversation

MasterJH5574 commented Jul 15, 2023

tvm-bot commented Jul 15, 2023

tqchen commented Jul 15, 2023

yzh119 left a comment

Choose a reason for hiding this comment

MrJungle1 commented Jul 18, 2023