Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Vertex Buffers Into Position and Interleaved Attribute Streams #2350

Closed
The-O-King opened this issue Feb 25, 2021 · 13 comments
Closed

Comments

@The-O-King
Copy link

The-O-King commented Feb 25, 2021

Describe the project you are working on

Godot Renderers

Describe the problem or limitation you are having in your project

Applications with large amounts of vertex data are bottlenecked by memory bandwidth, especially on mobile devices; and applications with lower amounts of vertex data still throw away a lot of power/time/cache performance

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Vertex stream splitting is a cache performance optimization which makes a difference on tile-based GPUs such as those on all Android Devices - in particular during the binning phase of the rendering process as this only requires vertex positions. It can also help improve vertex memory bandwidth usage in other situations such as shadows, where only vertex position data is needed

An initial implementation saw a 50% reduction in vertex memory bandwidth usage on Android devices

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Vertex Stream splitting will take the existing single buffer used for interleaving all vertex attributes and modify it so that this single buffer is now split into two components - a contiguous block for vertex position data appended to a contiguous block to interleave the remaining vertex attribute data

Before:
|Position1/Normal1/Tangent1/UV1/Position2/Normal2/Tangent2/UV2......|

After:
|Position1/Position2...|Normal1/Tangent1/UV1/Normal2/Tangent2/UV2...|

A test implementation of this has already been written for Godot 3.2 and both it's GLES2/GLES3 renderer. Changes that need to be made occur in

  • rasterizer_storage_gles[2|3].cpp - the mesh_add_surface() function needs to take into account the new offsets and strides in the buffer for specifying the location of vertex positions and attributes
  • visual_server.cpp - the surface_set_data() function will need to write the mesh's vertex data into the buffer properly, and the get_array_from_surface() function will need to be updated to read vertex positions/attributes properly with the correct buffer offsets

These changes should also be portable to Godot 4.0 and the Vulkan rendering backend.

If this enhancement will not be used often, can it be worked around with a few lines of script?

N/A

Is there a reason why this should be core and not an add-on in the asset library?

This is a low level change to the way that vertex data management is done in the renderer

Android Games DevTech Team @ Google

@clayjohn
Copy link
Member

I think this is already implemented in 4.0. if not, I know reduz has discussed adding it.

@reduz
Copy link
Member

reduz commented Feb 25, 2021

I did a similar split in 4.0 but not entirely sure if its the same thing.

In 4.0 I have vertex/normal/tangent (20 bytes)
Uv/color/etc.

This split is useful for depth prepass (not so useful on mobile), and skeleton/blendshape processing (done on compute in 4.0). For shadow rendering I use a separate array of only welded vertices in order.to minimize bandwidth and maximize cache usage.

@reduz
Copy link
Member

reduz commented Feb 25, 2021

Maybe on mobile vertex/nornal/tangent could be deinterlaced, but I would probably want to check performamce gain given its less data anyway.

I also still don't entirely understand the optimization, I thought the binning happens at the rop stage in TBDR, after the vertex program.

For 3.2 I guess it makes sense, though over there I can only imagine it to be an import option.

@BastiaanOlij
Copy link

@reduz sounds like they already have the work done for 3.2 so if it works I'd say hell yez let's get it added.

For 4.0 I think this is worth some speed tests. I can imagine that if we change it to vertex1/vertex2/vertex.../normal1/normal2/normal.../tangent1/tangent2/tangent.../uv1_1/uv2_1/color1/others1/uv1_2/uv2_2/color2/others2 it is possible we get best of both worlds and do away with the extra buffer. Or maybe make it a setting between the layouts and make it possible to vary depending on platform (so using this layout on mobile, the current layout on desktop, etc).

And indeed, I wonder what the magic is that happens on the GPU that allows it to run enough of the vertex shader to only read position without reading all the other properties before deciding to discard fragments. I don't know enough about how this works internally to judge at all but eager to learn more about this :)

Oh and welcome Android Games DevTech Team, awesome to have you guys chiming in!

@The-O-King
Copy link
Author

Thanks for the feedback and welcome :)

Mobile devices have an optimization where binning occurs before the full vertex shader is run; the driver takes the vertex shader and compiles a gl_Position shader (using glsl parlance) which only does the calculations necessary to compute gl_Position. This then gets run on all vertices to do binning, culling any vertices that won't contribute fragments to the image onscreen. Once the binning is complete, then the rest of the vertex shader is run on the active vertices!

So having the vertex positions being contiguous would help improve cache performance for example (especially with static meshes). One thing I have not tested is with skeletal meshes which would require bone weights along with vertex positions to compute the final gl_Position, so I'm not sure how much performance will improve in that scenario

I'm not sure how prevalent this optimization is on older devices but on my pixel 3 and pixel 4 it has made a big (50%) difference in vertex memory bandwidth usage for the test scene I created (basically just a single ~100k vertex torus haha)

And yes right now I have an initial implementation for 3.2 that I can clean up and share soon!

@clayjohn
Copy link
Member

What is the impact on non-mobile devices? Should we be creating a separate path for mobile to take advantage of this optimization?

@reduz
Copy link
Member

reduz commented Feb 26, 2021

@clayjohn I think on desktop it does not really matter a lot, so we can simply uninterleave the vertices always.

@The-O-King
Copy link
Author

I tried to get some performance data (look at overall framerate in the Godot editor) on my laptop with intel integrated and GTX 2070 Max-Q and did not see an appreciable drop in performance (I modified the scene to have 50 of these static torus meshes, each with the 100k vertices mentioned before so that it's more of a challenge on desktop haha)

I'm not as familiar with the GPU profiling tools on desktop, for future reference do you all have any preferred tools for looking at performance data on desktop platforms?

@clayjohn
Copy link
Member

@The-O-King The best tools are usually the ones maintained by your GPU hardware vendor. So for NVidia, I think NSight is likely the best tool for profiling.

@The-O-King
Copy link
Author

@clayjohn gotcha thanks for the pointer haha, I'll familiarize myself more with it!

@The-O-King
Copy link
Author

Haven't created a full PR yet but I have a feature branch on my fork of godot up for people to take a look at :)

reduz mentioned making it an import option (right now it just assumes everything wants this new buffer format so you'll have to reimport any existing assets in) I can try to get that working next week!

The-O-King/godot@47be5f1

@The-O-King
Copy link
Author

Hello everyone! Just wanted to update this with the PR that I submitted with the implementation

I added an import option for the stream splitting, however I also found that the easiest way to pass down the flag for stream splitting was to also pass down the rest of the mesh flags from the Import UI so right now I made it so that you could also select which individual vertex attributes you want to compress on import (this was actually going to be another proposal but I thought this might kill two birds with one stone haha)

This is what it currently looks like UI-wise, let me know what you think or if there's any feedback/improvements on what I've done so far

image

thanks :)

@akien-mga
Copy link
Member

akien-mga commented Jul 20, 2021

Implemented in 3.4 by godotengine/godot#46574.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants