-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow fine rasterization (k4) on Adreno 640 #83
Comments
I think you should at least bring 22507de in; the old clip stack have a really hard register pressure. Also, what's the current used ALU capacity %? On desktop GPUs these should be close to 80--90%, so that would be the goal for efficiency. For profiling, having access to the compiler output (ISA) would be quite helpful, although I think qcom's stack is too proprietary so they wouldn't provide that. |
Rebasing to master (which brings in 22507de among other things) yields continued slow performance (seemingly worse than the branch I was working on). Similarly, commenting out the begin and end clip commands gives a dramatic improvement in performance (but still not as good as the snapshot). This is interesting and somewhat unexpected, as I would have expected the clip_stack[] itself to be significant. It's not entirely unexpected, though, as one of the experiments was to increase the clip stack size, with the intent of forcing the compiler to spill it and not consume registers, and that did not significantly change performance. At least we have more variables to play with, though. The "% Shader ALU Capacity Utilized" metric reported by AGI is around 6% for CHUNK=4 and with the clip stack in place, using #82 as the measurement baseline. It's about double that with CHUNK=1 and no clip stack. |
If the clip stack is such a bottleneck, I'll mention #77 as a potential optimization of both the global memory traffic, and for avoiding the sRGB conversions. It doesn't reduce register pressure, though. |
I haven't exactly experimented with #77, but did tests similar to it, without encouraging results. I'm finding this issue quite challenging and difficult, but want to approach it as systematically as possible. One approach is to see if I can take a look at asm. Another is to continue treating the GPU as a black box, and build up a synthetic workload that gradually starts to resemble the real kernel4, watching at each step to see where the performance cliff is. |
I've spent a considerable amount of time investigating, and have some insight but not complete understanding. Perhaps more important, there is a way forward. Something is causing shader compilation to produce extremely deoptimized code for current master kernel4. By isolation, I find a single two-line fragment of code that can be commented out: the It's not low numbers of workgroups concurrently scheduled (occupancy) due to register pressure, though there is a difference in occupancy. With CHUNK=1, a maximum of 18 workgroups get scheduled in the happy case, 14 in the sad. (I instrumented the code with atomics to count number of workgroups in flight). With CHUNK=2, these numbers are 30/24, and with CHUNK=4, 32/32. These differences are nowhere near enough to explain the performance gap. It also appears not to be register spilling, based on Read Total and Write Total counters from AGI; the memory traffic per pixel rendered doesn't vary substantially between runs. The fact that the performance gap is worse with CHUNK=1 than larger CHUNK values is also not consistent with spilling. We can estimate register usage with the Radeon shader analyzer (accessed through the shader playground). Using gfx900 asic, with CHUNK=1, vgpr used is 21, allocated is 24. At CHUNK=2, 30/38, and at CHUNK=4, 46/62. The 6xx wiki linked above suggests that <= 32 poses no problems whatsoever for occupancy. One possibility, not completely ruled out, is that memory traffic from spilling is not properly reported by AGI. (We could test this by doing an experiment that forces spilling). I'd still be very curious to look at the asm to see what's so deoptimal. It is possible to build Mesa for Android, and it is possible to get that to cough up disassembly, but I haven't done this (it seems to involve some tedious mucking about with devices and building). In addition, this might not help very much, as the freedreno driver may well have different performance characteristics. Now for the good news. Given the focus on clip memory traffic, I tried the approach of #77, and it seems to perform reasonably well at all CHUNK values. I'm now strongly leaning toward adopting that, though I think there is more discussion to be had about the best way to handle the clip/blend stack. Obviously one shortcoming is that the size of the clip stack is hardcoded, but in my experiments even a large value (256) doesn't seem to affect performance - clearly the clip stack itself is spilled and doesn't consume vgpr's, which is good overall. We could in theory escalate the deoptimized shader compilation results to the GPU driver vendor, but at this point I'm not very eager to do that, as I think we have a good approach going forward. I am somewhat worried about some other future development triggering a similar performance cliff, but I figure we can cross that bridge when we get to it - another reason to push forward with the complete imaging model. |
Why not a hybrid, as described in #77 (comment): Create two sets of clip commands, one using global memory and one using thread local memory? Is it because the mere existence of the two write_mem calls causes the performance cliff, not their execution? |
Yes, it is the mere existence of the instructions here; the workload has been selected not to contain any clip commands. Btw, another clue (and another set of experiments I did but didn't report) is that that whatever is causing the performance cliff, the main effect seems to be making memory loads very slow; when adding a synthetic ALU-intensive load, the ALU operations don't seem significantly slower in the sad case. This is also consistent with the variation in response to CHUNK - each subgroup within a workgroup reads its input once, so the total amount of memory read should scale as 4/CHUNK. Meanwhile, the total amount of ALU is roughly the same regardless of CHUNK. In the happy case, the total time is fairly consistent wrt CHUNK, suggesting a mostly-ALU workload, while in the sad case it seems almost directly proportional to the expected amount of memory read. |
How unfortunate. Did you get to try Snapdragon Profiler? It's the platform profiler for Adreno chipsets, and may give more detailed information than the generic Android one.
An interesting experiment would be to keep the write_mems, but replace their target buffer. That is, add
and write to that instead. |
I did try to use it, but ran into the same layer problem as AGI 1.0 (google/agi#760). My understanding is that the performance counters it exposes are basically the same as AGI. It's possible I could track down what's going wrong, and would do so if I had evidence I would be able to capture higher quality information from the tool. |
FWIW, I don't get any noticeable speedup by removing the write_mems (nor read_mems) on my Pixel 1. However, I do gain a lot of performance by removing the sRGB conversion in fillImage. Again, the mere existence of the line is enough; I remove all FillImage commands from the scene. |
This will be something to watch, and I suspect we're going to be running into variants of it. My best guess is that we're running into a heuristic in the compiler, perhaps triggered by register pressure, perhaps code size, perhaps something else we don't understand yet. It may be that the ubershader approach doesn't scale on this class of hardware/drivers. This is a reason to build the entire imaging model, so we know what we're dealing with. I also had another thought about how to gain more insight here: it may be worthwhile trying to run the shader in OpenGLES rather than Vulkan, and use glGetProgramBinary. |
Doesn't vkGetPipelineCacheData include the shader program binary? Anyway, I hacked a glGetProgramBinary-as-a-service into a Gio program. If you
It works on my Pixel 1 and on my Fedora with the Intel GPU. |
Fascinating, and thanks! I didn't realize the pipeline cache data could potentially be used to get the program binary. I tried it and was able to recover a file (k4_pipeline), but I'm not sure how useful it is. It seems pretty obfuscated or at least compressed. I was also able to use the kitchen apk (thanks!), and the result of that looks like it might be more tractable. It clearly has some symbol names at the end, and the bulk of it is clearly low entropy, indicating likely executable binary. I'll see if I can get envytools to produce something intelligible. |
Ok, I got envytools to run and have substantially more insight. What's attached are disassembler output from kernel 4 in #82, and a similar version with the bodies of In any case, looking at the asm, there is indeed a difference. In the k4_noclip version (happy path), all reading from the input buffers is with the
But in k4_clip (sad path), while some of the reads are still
Doing a little reading in the freedreno code, ldib is a cat6 instruction new to the Adreno 6xx series. There's a comment on emit_intrinsic_load_ssbo with some info, and also there's a doc string in the xml for the instruction that reads, cryptically "LoaD IBo". (IBO = "index buffer object" in GL-speak) I didn't do quite as much study of the So, what we know with reasonably high confidence: there's some heuristic that the compiler is using to select between I've skimmed over the rest of the shader code and don't see anything unexpected or very different between the two. Register usage seems comparable. There are of course lots of opportunities for experimental error here, but I'm reasonably confident I'm looking at the actual source of the slowdown. |
fwiw, IBO is actually "Image Buffer Object" (I kinda made that name up.. it is what is used for both Images and SSBOs)
|
Ahh, interesting. That would certainly explain why writes to memory is the difference between the two cases. It doesn't explain the srgb conversion as observed by Elias above, but it might be a similar issue. This is making me wonder whether we should be more rigorous in separating read-only memory access patterns from read-write, and annotating the buffers with In any case, thanks much for the insight, it's very helpful and really tells us where to look. |
Yes, |
Adding |
At a hw level, |
Improves kernel4 performance for a Gio scene from ~22ms to ~15ms. Updates linebender#83 Signed-off-by: Elias Naur <[email protected]>
I've added a |
I did a little experimentation with adding |
I'm not so sure. Gio's shader compilation stages goes through SPIR-V: piet-gpu shaders => SPIR-V => GL ES 3.1, and precision qualifiers survive the conversions without mentioning VK_KHR_shader_float16_int8 anywhere. My guess is that VK_KHR_shader_float16_int8 is for explicit float16 and uint8 etc. types, whereas precision qualifiers are merely hints. My guess is that the Pixel 1 GPU is just register/ALU limited, whereas your Pixel 4 is less so.
I agree that the issue you're seeing is caused by something else. I brought up the mediump fix to eliminate the sRGB code as a culprit. |
Improves kernel4 performance for a Gio scene from ~22ms to ~15ms. Updates #83 Signed-off-by: Elias Naur <[email protected]>
Useful for debugging shader compiler issues, such as those that may cause linebender/vello#83 Signed-off-by: Elias Naur <[email protected]>
Did you have any luck adding
and then assigning both bind points 0 and 1 to the same I experimented with the above, but my Pixel 1 does not have a performance cliff similar to your Pixel 4. Presumably because the |
note that binding the same buffer twice with restrict keyword might result in undefined behavior.. ie. reads could be a TPL1 cache hit with stale data prior to the write |
The intention is to read and write dependent values from the same buffer. That is, clip state reads and writes will use the read/write memory buffer binding, while reading commands and path segments from the read-only binding. |
I suppose that could work, provided the read/write and read-only sections don't share a cache line.. but not sure I'd call that portable. Is it possible to separate the r/o data into a separate buffer? This wouldn't depend so much on details of hw implementation. |
I think the solution here (which I have working locally, not yet a PR) is to port a version of #77, so there is no explicit writing to buffers at all in kernel4, only the image write to the output buffer. Note that similar issues almost certainly exist for other pipeline stages, where the code currently does reading and writing from a single unified buffer, but the actual input regions are a read-only pattern. The performance lossage doesn't seem to be as serious, though I haven't done careful measurement yet. I'm not sure what the solution is. One approach is to revert the approach of 4de67d9, and have separate memory buffers. That would have the disadvantage of the stages not being able to share a free memory pool. Another approach is to reliably signal to the shader compiler that relaxed memory semantics are desired, but I'm not sure there is any way to do that other than fast-forward to a world in which the Vulkan memory model is widely adopted and there are performance tests in place to ensure that the optimizations it enables are actually exploited. I haven't read up on "restrict" yet but based on my knowledge of the analogous keyword in C, I am skeptical that it is a reliable, portable solution. The description in the GLSL wiki is so vague and non-normative that it seems close to useless to my eyes. The spec language is slightly better but I don't yet fully understand it. |
|
My read: the optimization is sensible, but the By contrast, going for a segmented, 80286-style approach to data, is a pattern that is quite well represented by existing GPU workloads and is likely to be optimized reliably, even though a big linear 386-style is very tempting for programmer ergonomics and because a unified free memory pool makes it less likely for an allocation to fail even when there is plenty of free space in another buffer. |
Some additional insights on this: having well segmented buffers also helps desktop GPUs. Raph, you almost predicted this in your blog post :)
The thing is that, AMD GPUs have two kinds of memory caches, which are scalar cache and vector cache. The former is used for uniform loads and has a smaller cache line size, while the latter is used for everything else and typically has a 128-byte cache line. The caveat is that these two kinds of cache is not coherent. So if the compiler cannot prove that your access don't alias, then you end up using vector memory instructions everywhere instead. And therefore the problem with spilling writes applies the same to AMD GPUs, and we basically waste some capacity as we're not utilizing the scalar cache and the scalar memory controller. The effect is not that catastrophic compared to Adreno, but it probably makes a sizable difference (preliminary testing shows 950us -> 850us for k4). So that's another strong motivation toward a segmented buffer model. Thanks for Mesa developers helping me to pinpoint this, working with Mesa has always been a joy :) |
Can we get away with a single backing buffer bound to, say, separate readonly and a writeonly bindings in the shader? I haven't dug into the specs, but it seems to me that should free the compiler to use non-coherent memory accesses. |
Wow, I was looking at AMD asm and wondering why there were vector instructions even for scalar memory access. I assumed it was because vector was at least as efficient so there was no benefit to scalar, but this explanation makes more sense. Yes, a separate buffer for just the clip stack would also work fine - for this stage. There are other stages where there are lots of reads from input buffers and a write stage on the output. Those may be seeing slowdowns due to using a single allocation pool. My sense of the best long term solution is to rely on memory model annotations to tell the compiler explicitly how much coherence we need. But that also feels more like a future thing, it doesn't help us with compatibility. My inclination is to fix this by having the compiler spill by using an array variable, as it's simplest from our pov. The downside is that there is a fixed maximum stack size, but I think that can be big enough (256 is what I have in my head) that it's extremely unlikely to be a problem in practice. |
I haven't closed this issue because I'm not sure #77 is the right solution. I spent a little more time poking into it, and believe that scratch is pretty badly broken on AMD 5700 XT. It has artifacts on Vulkan, but blue-screens on DX12. Talking with a number of other GPU experts, it seems like relying on a large array to be placed in scratch memory is just not something that we can expect to be implemented reliably and efficiently, though there are aspects of it that are appealing. My gut feeling is that we'll need to implement a separate buffer for the clip spills (to avoid the performance problems from overly generous coherence) and manually spill into that. I'm wrestling with how to allocate that, as targeting worst case might be really large. You only need an allocation that can support the maximum number of occupant workgroups. That is of course pretty GPU dependent, but I think we can make a conservative estimate (or perhaps even do a fingerprint probe on startup). I'm leaning toward that now, but also considering other possibilities. We can go back to the pre-77 state, where the amount of scratch is computed in coarse rasterization. That's less dependent on the GPU execution model, but to my mind just shifts the risks around - it's still very much possible to hit an out of memory condition, it's just that you're not paying for a maximum stack depth. I'm leaning toward an atomic approach - maybe 32 * workgroup size bits, then on first push the workgroup relaxed-reads those, finds a zero bit, does atomicOr to set it, and loops if that doesn't succeed. On exit, atomicAnd to reset the bit. I don't think this will be super complicated or expensive. There can be some CPU-side accounting of maximum stack depth and allocation of buffers to support that. Basically, this issue needs a little more thought to figure out the best approach. |
This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed. On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory. The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer. Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config. This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix #83 (finally!) There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter. Closes #175
This is the core logic for robust dynamic memory. There are changes to both shaders and the driver logic. On the shader side, failure information is more useful and fine grained. In particular, it now reports which stage failed and how much memory would have been required to make that stage succeed. On the driver side, there is a new RenderDriver abstraction which owns command buffers (and associated query pools) and runs the logic to retry and reallocate buffers when necessary. There's also a fairly significant rework of the logic to produce the config block, as that overlaps the robust memory. The RenderDriver abstraction may not stay. It was done this way to minimize code disruption, but arguably it should just be combined with Renderer. Another change: the GLSL length() method on a buffer requires additional infrastructure (at least on Metal, where it needs a binding of its own), so we now pass that in as a field in the config. This also moves blend memory to its own buffer. This worked out well because coarse rasterization can simply report the size of the blend buffer and it can be reallocated without needing to rerun the pipeline. In the previous state, blend allocations and ptcl writes were interleaved in coarse rasterization, so a failure of the former would require rerunning coarse. This should fix linebender#83 (finally!) There are a few loose ends. The binaries haven't (yet) been updated (I've been testing using a hand-written test program). Gradients weren't touched so still have a fixed size allocation. And the logic to calculate the new buffer size on allocation failure could be smarter. Closes linebender#175
I'm getting piet-gpu running on Android (see #82 for snapshot), and running into fine rasterization being considerably slower than expected. The other stages in the pipeline seem fine. I've done some investigation but the fundamental cause remains a mystery.
Info so far: the Adreno 640 (Pixel 4) has a default subgroup size of 64 (though it can also be set to 128 using the vk subgroup size control). That should be fine, as it means that the memory read operations from the per-tile command list are amortized over a significant number of pixels even if CHUNK (the number of pixels written per invocation of kernel4.comp) is 1 or 2. If CHUNK is 4, then the workgroup and subgroup sizes are the same; any larger value results in a partially filled subgroup. There's more detail about the Adreno 6xx on the freedreno wiki.
There are some experiments that move the needle. Perhaps the most interesting is that commenting out the body of
Cmd_BeginClip
andCmd_EndClip
at least doubles the speed. This is true even when the clip commands are completely absent from the workload. My working hypothesis is that register pressure from accommodating the clip stack and other state is reducing the occupancy.Another interesting set of experiments involves adding a per-pixel ALU load. The amount of time taken increases significantly with CHUNK. I'm not sure how to interpret that yet. Tweaking synthetic workloads like this may will be the best way to move forward, though I'd love to be able to see the asm from the shader compilation. I'm looking into the possibility of running this workload on the same (or similar hardware) but a free driver such as freedreno so that I might be able to gain more insight that way.
I've been using Android GPU Inspector (make sure to use at least 1.1) but so far it only gives me fairly crude metrics - things like % ALU capacity and write bandwidth scale with how much work gets done, and other metrics like % shaders busy and GPU % Utilization are high in all cases.
There are other things I've been able to rule out: a failure of loop unrolling by the shader compiler. Failure to account ptcl reads as dynamically uniform (I manually had one invocation read and broadcast the results, yielding no change).
I do have some ideas how to make things faster (including moving as much of the math as possible to f16), but the first order of business is understanding why it's slow, especially when we don't seem to be seeing similar problems on desktop GPUs.
The text was updated successfully, but these errors were encountered: