Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PointLight2D shadow rendering by reducing draw calls and RD state changes #100302

Merged
merged 1 commit into from
Dec 17, 2024

Conversation

clayjohn
Copy link
Member

@clayjohn clayjohn commented Dec 12, 2024

This dramatically reduces the CPU time spent on rendering shadows for PointLight2Ds

I think this fixes the remaining regression in #99420
Fixes: #73805

Basically the problem is that certain changes in 4.4 have increased the cost of:

  1. Most RD API calls (this is the Thread Guard change)
  2. Ending a draw list (this is the command intersection checks)

Rendering PointLight2D Shadows creates 4 draw lists per light on screen and 4 * lights_on_screen * occluders_on_screen draw calls. And each draw call comes with 4 other API calls.

Therefore, we are checking the thread guard 5 * 4 * lights_on_screen * occluders_on_screen In #99420 this means we check the thread guard over 100,000 times. So, despite it being a very cheap operation, it ends up reducing performance in a measurable way.

Eventually we will reduce the cost of Thread guards by disabling them in release builds and maybe by disabling them when RD functions are called internally. But for now, the best option is just to drastically reduce the algorithmic complexity and other costs of rendering shadows. I did that with a number of things:

  1. Cull occluders against each light, so we only render occluders that matter
  2. Move the projection creation to the GPU so we transfer less data (binding the push constant is one of the more expensive operations simply because of the memory copies)
  3. Use viewport culling instead of creating a render pass for each direction
  4. Save the occluder transforms in an SSBO and reuse for all lights so we only pay the upload cost once
  5. Move culling into the fragment shader so we don't have to constantly switch pipelines

Most of these changes increase the cost on the GPU. However, this shader is still so simple that the GPU spends way more time waiting for commands than it does actually drawing things. So these changes have no measurable impact on GPU time.

Finally, I left one optimization on the table. We can reduce the entire draw loop to 4 draw calls per light by using one giant shared vertex buffer and using vertex pulling in the shader to read the vertex positions. I didn't implement this since:

  1. It would add a significant amount of complexity and make the whole process harder to understand
  2. It would require a lot of bookeeping
  3. It would be much riskier than the current changes

Overall, since I got the performance gain I needed and the current code is not much more complex than it was before, I decided to leave it here.

Performance

In my test scene Performance goes from 330 FPS in master to 430 FPS (Windows, RX 3600, release builds)
For comparison, 4.4 dev3 was about 380 FPS. So I am confident that this already fully restores the performance from 4.3 and then some.

On a M2 MBP it goes from 160 FPS (dev3) to 400 FPS (debug build) (which makes sense since it is a tiling architecture)

On a Pixel 4 it goes from 17 FPS to 60 FPS (vsync locked)

This test project is intended to be a worst case scenario since it has so many lights and occluders on screen at once
light2dopt.zip

Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some really nice improvements.

…state changes.

This dramatically reduces the CPU time spent on rendering shadows for PointLight2Ds
@akien-mga akien-mga merged commit 190ae9f into godotengine:master Dec 17, 2024
20 checks passed
@akien-mga
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2D optimizations issue on Adreno ("SnapDragon") GPUs
5 participants