-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Compiler option for expanding s_waitcnt instructions or don’t merge them in the first place #67
Comments
ExamplesCase 1 : Single CounterThe following code snippet is extracted from a kernel in the Rodinia benchmark suite. A PC sample at line 13 will tell us that we are waiting for instructions that increase the lgkmcnt, but we can’t tell which load we are waiting on. If we can expand the single s_waitcnt instruction into 3 s_waitcnt instructions waiting for 2,1 and 0 respectively, we should be able to tell from the PC sample which one of these loads might be the bottleneck. Note that if 1 is a high latency load, then its exposed latency will shadow some of the latency of the subsequent loads. If we observe waiting for load 1 and no waiting for loads 2 and 9, significantly reducing the latency for load 1 might just expose latency waiting for load 2 and/or load 9.
Case 2 : Independent CountersThe next example here is a little more complicated / advanced.
Case 3 : Overlapping CountersThe last example we have here is extracted from the QuickSilver application. Here stores from line 3 to line 14 increases vmcnt, and flat load instruction at line 36 increases both lgkmcnt and vmcnt. We are not entirely sure what is the right way to expand the s_waitcnt instruction at line 38, but it would be nice if we do it in some way that can let us tell apart whether we are waiting for an instruction from line 3 to line 14, or line 36. According to the document of
|
Tracking internally as [SWDEV-448069]. |
llvm#79236 overlaps with this. |
@bbiiggppiigg For the 3 examples provided, could you pls (1) list the desired code with s_waitcnt inserted at desired places with the appropriate counters, and (2) list the bitcode (.ll) ? Thanks! |
@jwanggit86
An example for Case 2 would be
For case 3, as we've mentioned, we are not sure what is the right expansion order.
As for the bitcode, I tried emit-llvm to generate llvm bitcode and llvm-dis them into llvm ir, but I don't see any waitcnt instructions within it. |
Hi, can you check if #79236 implements what you are looking for? |
Hi @bbiiggppiigg, have you been able to check whether the compiler option added in llvm#79236 is sufficient for your needs? |
Hi, Based on my understanding, this options seems to be a improvement over the -amdgpu-waitcnt-forcezero option so that waitcnt instruction is only inserted after memory instructions? I don't believe it is sufficient. What we are looking for is a way to relate a PC sample on a waitcnt instruction to the specific memory instruction that we are waiting for. The waitcnt instruction should not be directly after every memory instruction but right before their use (or their original position determined by the compiler) so the latency of memory access can be hide by the computation. Assume that we get a PC sample that lands on a s_waitcnt vmcnt(0). |
Suggestion Description
From the kernels we have studied, it is not uncommon to see the use of a single waitcnt instruction to wait for multiple load instructions to finish loading their values.
Once the PC-sampling feature is enabled on these kernels, we expect to see a non-negligible amount of pc-samples reported at the waitcnt instructions.
In order to figure out which load instruction might be a/the bottleneck, it would be nice to have a compiler option that expands a single waitcnt instruction for value N into a series of waitcnt instructions with decreasing value from N+k, N+k-1, … N, when the waitcnt instruction is waiting for k loads to complete.
Please note that we are NOT looking for the existing compiler option -amdgpu-waitcnt-forcezero that adds an s_waitcnt(0) after every instruction, as we still want to hide the memory load latency with compute instructions as much as possible.
Operating System
No response
GPU
MI200 / MI250 / MI300
ROCm Component
No response
The text was updated successfully, but these errors were encountered: