-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
streamk v0.1 #619
streamk v0.1 #619
Conversation
Can you write a README to introduce the features implemented in this version of the streamK kernel? |
What does this mean? |
In this version, stream-k kernel use the persistent loop so that a WG may work on multiple output tiles, and also allowing workgroups to do part of the work for an output tile. |
But it uses atomics right? Did you mean non atomic as in does not do atomic add? |
yeah, my description is not precise, we still use atomics for spinning lock, but not atomic_add for the final output. |
done |
@@ -0,0 +1,43 @@ | |||
# streamk gemm script v0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be needed to get it to 1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need made it ready to explore half million benchmarks, and have a comparable performance with Tensile development
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can have it comparable to tensile because that is outside of the scope of streamk. I think we can call this 0.1 until we have the wider tuning space working.
acc = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=acc_dtype) | ||
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)): | ||
if EVEN_K: | ||
a = tl.load(A_BASE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we peel the masking for the last iteration when EVEN_K is False
so that only the last loop pays the price of the mask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, this will be in next PR. Thanks !
|
||
|
||
@triton.jit() | ||
def get_new_pid(current_pid, num_sms): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/s/num_sms/num_cus
# Number of XCDs | ||
num_xcds = 8 | ||
# Number of pids per XCD in the new arrangement | ||
pids_per_xcd = num_sms // num_xcds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the grid can have multiple of num_cus pids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For persistent kernel, grid has to be either num_cus or total_tiles if total_tiles < num_cus
* streamk v0.1 * remove unused variable * fix format issues * add README * fix format issue * change num_sms to num_cus
Triton stream-k gemm v0.1