Optimize PointLight2D shadow rendering by reducing draw calls and RD state changes #100302
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This dramatically reduces the CPU time spent on rendering shadows for PointLight2Ds
I think this fixes the remaining regression in #99420
Fixes: #73805
Basically the problem is that certain changes in 4.4 have increased the cost of:
Rendering PointLight2D Shadows creates 4 draw lists per light on screen and
4 * lights_on_screen * occluders_on_screen
draw calls. And each draw call comes with 4 other API calls.Therefore, we are checking the thread guard
5 * 4 * lights_on_screen * occluders_on_screen
In #99420 this means we check the thread guard over 100,000 times. So, despite it being a very cheap operation, it ends up reducing performance in a measurable way.Eventually we will reduce the cost of Thread guards by disabling them in release builds and maybe by disabling them when RD functions are called internally. But for now, the best option is just to drastically reduce the algorithmic complexity and other costs of rendering shadows. I did that with a number of things:
Most of these changes increase the cost on the GPU. However, this shader is still so simple that the GPU spends way more time waiting for commands than it does actually drawing things. So these changes have no measurable impact on GPU time.
Finally, I left one optimization on the table. We can reduce the entire draw loop to 4 draw calls per light by using one giant shared vertex buffer and using vertex pulling in the shader to read the vertex positions. I didn't implement this since:
Overall, since I got the performance gain I needed and the current code is not much more complex than it was before, I decided to leave it here.
Performance
In my test scene Performance goes from 330 FPS in master to 430 FPS (Windows, RX 3600, release builds)
For comparison, 4.4 dev3 was about 380 FPS. So I am confident that this already fully restores the performance from 4.3 and then some.
On a M2 MBP it goes from 160 FPS (dev3) to 400 FPS (debug build) (which makes sense since it is a tiling architecture)
On a Pixel 4 it goes from 17 FPS to 60 FPS (vsync locked)
This test project is intended to be a worst case scenario since it has so many lights and occluders on screen at once
light2dopt.zip