Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR] ThreadAllreduce warp-level primitive support with multi-warp #15327

Merged

Conversation

MasterJH5574
Copy link
Contributor

This PR enhances the implementation of the LowerThreadAllreduce pass.

Prior to this PR, for CUDA backend we will leverage warp-level primitives only when

  • the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or
  • the number of reducing threads is less then 32, and equals the reduction extent.

Under the requirement above, for reductions that have large number of reducing threads (e.g., reducing over 128, 256 or larger number or threads), the generated code is inefficient.

This PR improves the LowerThreadAllreduce pass, so that we now generate more efficient CUDA code in such cases, when the number of reducing threads is a multiple of warp size, with the help of warp-level primitives.

Specifically, in such cases, we first reducing 32 elements within each warp, getting the results of each warp stored in shared memory. We then trigger a second round of warp-level primitive reduction within the first warp, and get the final reduction results.

In addition to using warp-level primitives, by doing this we also reduce the size of the shared memory. For example, even when reducing over 1024 threads, we now only require shared memory of size 32, compared with 1024 prior to this PR.

Tests are added to ensure correctness.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jul 15, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch from 3f16761 to 035ac24 Compare July 15, 2023 19:34
@tqchen
Copy link
Member

tqchen commented Jul 15, 2023

cc @masahi

@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch 3 times, most recently from d811d87 to 0d58998 Compare July 15, 2023 19:51
Copy link
Member

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, left some suggestions.

src/tir/transforms/lower_thread_allreduce.cc Outdated Show resolved Hide resolved
src/tir/transforms/lower_thread_allreduce.cc Outdated Show resolved Hide resolved
src/tir/transforms/lower_thread_allreduce.cc Show resolved Hide resolved
src/tir/transforms/lower_thread_allreduce.cc Show resolved Hide resolved
src/tir/transforms/lower_thread_allreduce.cc Show resolved Hide resolved
src/tir/transforms/lower_thread_allreduce.cc Show resolved Hide resolved
src/tir/transforms/lower_thread_allreduce.cc Outdated Show resolved Hide resolved
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch 2 times, most recently from 7a07622 to 5efb770 Compare July 16, 2023 06:53
This PR enhances the implementation of the LowerThreadAllreduce pass.

Prior to this PR, for CUDA backend we will leverage warp-level
primitives only when
* the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or
* the number of reducing threads is less then 32, and equals the
reduction extent.

Under the requirement above, for reductions that have large number
of reducing threads (e.g., reducing over 128, 256 or larger number
or threads), the generated code is inefficient.

This PR improves the LowerThreadAllreduce pass, so that we now generate
more efficient CUDA code in such cases, when the number of reducing
threads is a multiple of warp size, with the help of warp-level
primitives.

Specifically, in such cases, we first reducing 32 elements within
each warp, getting the results of each warp stored in shared memory.
We then trigger a second round of warp-level primitive reduction
within the first warp, and get the final reduction results.

In addition to using warp-level primitives, by doing this we also
reduce the size of the shared memory. For example, even when reducing
over 1024 threads, we now only require shared memory of size 32,
compared with 1024 prior to this PR.

Tests are added to ensure correctness.
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-07-15-multi-warp-allreduce branch from 5efb770 to bd3448e Compare July 16, 2023 08:26
@tqchen tqchen merged commit e25b1ba into apache:main Jul 16, 2023
@MrJungle1
Copy link

@MasterJH5574 LGTM ! I also encountered the same problem when I searched for reduce_sum on Ansor. Is your work considered on Ansor?

MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 21, 2023
PR apache#15327 introduces the warp-level primitive support in multi-warp
allreduce. However, due to the specialty of the two-stage
shuffle-down reduction implementation of the allreduce in multi-warp
scenarios, PR apache#15327 did not broadcast the allreduce result to each
reduction thread. This behavior does not align with the semantics
of allreduce and is not ideal for many use cases. Therefore, this
PR completes the implementation by inserting a stage of writing the
reduction results to shared memory, so that each reduction thread
across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in
multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync`
is used to broadcast the reduction results across reduction threads.
Since in multi-warp settings we cannot leverage warp-level primitives
to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 21, 2023
PR apache#15327 introduces the warp-level primitive support in multi-warp
allreduce. However, due to the specialty of the two-stage
shuffle-down reduction implementation of the allreduce in multi-warp
scenarios, PR apache#15327 did not broadcast the allreduce result to each
reduction thread. This behavior does not align with the semantics
of allreduce and is not ideal for many use cases. Therefore, this
PR completes the implementation by inserting a stage of writing the
reduction results to shared memory, so that each reduction thread
across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in
multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync`
is used to broadcast the reduction results across reduction threads.
Since in multi-warp settings we cannot leverage warp-level primitives
to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 21, 2023
PR apache#15327 introduces the warp-level primitive support in multi-warp
allreduce. However, due to the specialty of the two-stage
shuffle-down reduction implementation of the allreduce in multi-warp
scenarios, PR apache#15327 did not broadcast the allreduce result to each
reduction thread. This behavior does not align with the semantics
of allreduce and is not ideal for many use cases. Therefore, this
PR completes the implementation by inserting a stage of writing the
reduction results to shared memory, so that each reduction thread
across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in
multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync`
is used to broadcast the reduction results across reduction threads.
Since in multi-warp settings we cannot leverage warp-level primitives
to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.
tqchen pushed a commit that referenced this pull request Jul 22, 2023
…15373)

PR #15327 introduces the warp-level primitive support in multi-warp
allreduce. However, due to the specialty of the two-stage
shuffle-down reduction implementation of the allreduce in multi-warp
scenarios, PR #15327 did not broadcast the allreduce result to each
reduction thread. This behavior does not align with the semantics
of allreduce and is not ideal for many use cases. Therefore, this
PR completes the implementation by inserting a stage of writing the
reduction results to shared memory, so that each reduction thread
across all the reduction warps can access the reduction results.

This shared memory write-back stage will only be inserted in
multi-warp allreduce cases. In single-warp allreduce, a `shfl_sync`
is used to broadcast the reduction results across reduction threads.
Since in multi-warp settings we cannot leverage warp-level primitives
to broadcast the value, we can only make use of shared memory.

The numerical correctness are verified locally.
junrushao pushed a commit to junrushao/tvm that referenced this pull request Jul 24, 2023
…pache#15327)

This PR enhances the implementation of the LowerThreadAllreduce pass.

Prior to this PR, for CUDA backend we will leverage warp-level
primitives only when
* the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or
* the number of reducing threads is less then 32, and equals the
reduction extent.

Under the requirement above, for reductions that have large number
of reducing threads (e.g., reducing over 128, 256 or larger number
or threads), the generated code is inefficient.

This PR improves the LowerThreadAllreduce pass, so that we now generate
more efficient CUDA code in such cases, when the number of reducing
threads is a multiple of warp size, with the help of warp-level
primitives.

Specifically, in such cases, we first reducing 32 elements within
each warp, getting the results of each warp stored in shared memory.
We then trigger a second round of warp-level primitive reduction
within the first warp, and get the final reduction results.

In addition to using warp-level primitives, by doing this we also
reduce the size of the shared memory. For example, even when reducing
over 1024 threads, we now only require shared memory of size 32,
compared with 1024 prior to this PR.

Tests are added to ensure correctness.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 25, 2023
PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 25, 2023
PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
MasterJH5574 added a commit to MasterJH5574/tvm that referenced this pull request Jul 25, 2023
PR apache#15327 and apache#15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
tqchen pushed a commit that referenced this pull request Jul 25, 2023
PR #15327 and #15373 introduced multi-warp allreduce implementation.
At the time of the introduction, I tested the correctness numerically
via the workload of "taking a matrix of ones as input, computing the
summation over each row". Both PR passed this numerical tess, while
I didn't realize that this test is not complete and cannot guarantee
the correctness.

The previous implementation has bug which can be tested by turning
the input matrix from ones to random floating-point numbers. This will
expose the issues of the previous implementation.

Therefore, this PR fixes the issues, and add the numerical tests
for multi-warp allreduce into `test_allreduce_cuda.py`. By reducing
some of the redundant tests in that file, we hope this can reduce the
testing time a bit while still guarantee the correctness.

Sorry for not testing the implementation completely before.
junrushao pushed a commit to junrushao/tvm that referenced this pull request Jul 27, 2023
…pache#15327)

This PR enhances the implementation of the LowerThreadAllreduce pass.

Prior to this PR, for CUDA backend we will leverage warp-level
primitives only when
* the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or
* the number of reducing threads is less then 32, and equals the
reduction extent.

Under the requirement above, for reductions that have large number
of reducing threads (e.g., reducing over 128, 256 or larger number
or threads), the generated code is inefficient.

This PR improves the LowerThreadAllreduce pass, so that we now generate
more efficient CUDA code in such cases, when the number of reducing
threads is a multiple of warp size, with the help of warp-level
primitives.

Specifically, in such cases, we first reducing 32 elements within
each warp, getting the results of each warp stored in shared memory.
We then trigger a second round of warp-level primitive reduction
within the first warp, and get the final reduction results.

In addition to using warp-level primitives, by doing this we also
reduce the size of the shared memory. For example, even when reducing
over 1024 threads, we now only require shared memory of size 32,
compared with 1024 prior to this PR.

Tests are added to ensure correctness.
junrushao pushed a commit to junrushao/tvm that referenced this pull request Jul 30, 2023
…pache#15327)

This PR enhances the implementation of the LowerThreadAllreduce pass.

Prior to this PR, for CUDA backend we will leverage warp-level
primitives only when
* the reducing threads are a sub-warp (i.e., size 16, 8, 4, 2), or
* the number of reducing threads is less then 32, and equals the
reduction extent.

Under the requirement above, for reductions that have large number
of reducing threads (e.g., reducing over 128, 256 or larger number
or threads), the generated code is inefficient.

This PR improves the LowerThreadAllreduce pass, so that we now generate
more efficient CUDA code in such cases, when the number of reducing
threads is a multiple of warp size, with the help of warp-level
primitives.

Specifically, in such cases, we first reducing 32 elements within
each warp, getting the results of each warp stored in shared memory.
We then trigger a second round of warp-level primitive reduction
within the first warp, and get the final reduction results.

In addition to using warp-level primitives, by doing this we also
reduce the size of the shared memory. For example, even when reducing
over 1024 threads, we now only require shared memory of size 32,
compared with 1024 prior to this PR.

Tests are added to ensure correctness.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants