Question and suggestions about custom allreduce #2918
Unanswered
leizhao1234
asked this question in
Q&A
Replies: 3 comments 5 replies
-
The block 0 is always the first block started, and each rank's data is allready in global memory(cudamemcpy). |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think when copy_mode is true, it still need block barrier without fence but if copy_mode is false multi_gpu_barrier without a memory fence is suffice |
Beta Was this translation helpful? Give feedback.
0 replies
-
What is the meaning of copy_mode? |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In trtllm's twoshot allreduce, there is one multi_gpu_barrier and one block_barrier, whereas in sgLang's twoshot allreduce, there appear to be two block_barriers.
I noticed that the first block barrier does not use a memory fence, similar to vllm. However, does the first barrier need to be at the block level? I believe a multi_gpu_barrier without a memory fence would suffice.. Here is my code:
`inline device void multi_gpu_barrier(uint32_t** signals, uint32_t const flag, size_t const local_rank,
size_t const world_size, int const tidx, int const bidx)
{
// After this function, at least one block in each GPU has reached the barrier
if (tidx < world_size)
{
// we can think of signals having the shape [world_size, world_size]
// Dimension 0 is the "listening" dimension, dimension 1 is "emitting" dimension
}
inline device void block_barrier(uint32_t** signals, uint32_t const flag, size_t const local_rank,
size_t const world_size, int const tidx, int const bidx, int const grid_size)
{
__syncthreads();
// After this function, the block of id == bidx of each GPU has reached the barrier
if (tidx < world_size)
{
// we can think of signals having the shape [world_size, 2, num_blocks, world_size]
// (+ an offset on dim 2 to account for flags used in multi_gpu_barrier)
// Dimension 0 is the "listening" dimension, dimension 3 is "emitting" dimension
}`
I think oneshot only requires one multi_gpu_barrier, while twoshot requires both one multi_gpu_barrier and one block_barrier.
Beta Was this translation helpful? Give feedback.
All reactions