-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split Vertex Buffers Into Position and Interleaved Attribute Streams #2350
Comments
I think this is already implemented in 4.0. if not, I know reduz has discussed adding it. |
I did a similar split in 4.0 but not entirely sure if its the same thing. In 4.0 I have vertex/normal/tangent (20 bytes) This split is useful for depth prepass (not so useful on mobile), and skeleton/blendshape processing (done on compute in 4.0). For shadow rendering I use a separate array of only welded vertices in order.to minimize bandwidth and maximize cache usage. |
Maybe on mobile vertex/nornal/tangent could be deinterlaced, but I would probably want to check performamce gain given its less data anyway. I also still don't entirely understand the optimization, I thought the binning happens at the rop stage in TBDR, after the vertex program. For 3.2 I guess it makes sense, though over there I can only imagine it to be an import option. |
@reduz sounds like they already have the work done for 3.2 so if it works I'd say hell yez let's get it added. For 4.0 I think this is worth some speed tests. I can imagine that if we change it to vertex1/vertex2/vertex.../normal1/normal2/normal.../tangent1/tangent2/tangent.../uv1_1/uv2_1/color1/others1/uv1_2/uv2_2/color2/others2 it is possible we get best of both worlds and do away with the extra buffer. Or maybe make it a setting between the layouts and make it possible to vary depending on platform (so using this layout on mobile, the current layout on desktop, etc). And indeed, I wonder what the magic is that happens on the GPU that allows it to run enough of the vertex shader to only read position without reading all the other properties before deciding to discard fragments. I don't know enough about how this works internally to judge at all but eager to learn more about this :) Oh and welcome Android Games DevTech Team, awesome to have you guys chiming in! |
Thanks for the feedback and welcome :) Mobile devices have an optimization where binning occurs before the full vertex shader is run; the driver takes the vertex shader and compiles a gl_Position shader (using glsl parlance) which only does the calculations necessary to compute gl_Position. This then gets run on all vertices to do binning, culling any vertices that won't contribute fragments to the image onscreen. Once the binning is complete, then the rest of the vertex shader is run on the active vertices! So having the vertex positions being contiguous would help improve cache performance for example (especially with static meshes). One thing I have not tested is with skeletal meshes which would require bone weights along with vertex positions to compute the final gl_Position, so I'm not sure how much performance will improve in that scenario I'm not sure how prevalent this optimization is on older devices but on my pixel 3 and pixel 4 it has made a big (50%) difference in vertex memory bandwidth usage for the test scene I created (basically just a single ~100k vertex torus haha) And yes right now I have an initial implementation for 3.2 that I can clean up and share soon! |
What is the impact on non-mobile devices? Should we be creating a separate path for mobile to take advantage of this optimization? |
@clayjohn I think on desktop it does not really matter a lot, so we can simply uninterleave the vertices always. |
I tried to get some performance data (look at overall framerate in the Godot editor) on my laptop with intel integrated and GTX 2070 Max-Q and did not see an appreciable drop in performance (I modified the scene to have 50 of these static torus meshes, each with the 100k vertices mentioned before so that it's more of a challenge on desktop haha) I'm not as familiar with the GPU profiling tools on desktop, for future reference do you all have any preferred tools for looking at performance data on desktop platforms? |
@The-O-King The best tools are usually the ones maintained by your GPU hardware vendor. So for NVidia, I think NSight is likely the best tool for profiling. |
@clayjohn gotcha thanks for the pointer haha, I'll familiarize myself more with it! |
Haven't created a full PR yet but I have a feature branch on my fork of godot up for people to take a look at :) reduz mentioned making it an import option (right now it just assumes everything wants this new buffer format so you'll have to reimport any existing assets in) I can try to get that working next week! |
Hello everyone! Just wanted to update this with the PR that I submitted with the implementation I added an import option for the stream splitting, however I also found that the easiest way to pass down the flag for stream splitting was to also pass down the rest of the mesh flags from the Import UI so right now I made it so that you could also select which individual vertex attributes you want to compress on import (this was actually going to be another proposal but I thought this might kill two birds with one stone haha) This is what it currently looks like UI-wise, let me know what you think or if there's any feedback/improvements on what I've done so far thanks :) |
Implemented in 3.4 by godotengine/godot#46574. |
Describe the project you are working on
Godot Renderers
Describe the problem or limitation you are having in your project
Applications with large amounts of vertex data are bottlenecked by memory bandwidth, especially on mobile devices; and applications with lower amounts of vertex data still throw away a lot of power/time/cache performance
Describe the feature / enhancement and how it helps to overcome the problem or limitation
Vertex stream splitting is a cache performance optimization which makes a difference on tile-based GPUs such as those on all Android Devices - in particular during the binning phase of the rendering process as this only requires vertex positions. It can also help improve vertex memory bandwidth usage in other situations such as shadows, where only vertex position data is needed
An initial implementation saw a 50% reduction in vertex memory bandwidth usage on Android devices
Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams
Vertex Stream splitting will take the existing single buffer used for interleaving all vertex attributes and modify it so that this single buffer is now split into two components - a contiguous block for vertex position data appended to a contiguous block to interleave the remaining vertex attribute data
Before:
|Position1/Normal1/Tangent1/UV1/Position2/Normal2/Tangent2/UV2......|
After:
|Position1/Position2...|Normal1/Tangent1/UV1/Normal2/Tangent2/UV2...|
A test implementation of this has already been written for Godot 3.2 and both it's GLES2/GLES3 renderer. Changes that need to be made occur in
rasterizer_storage_gles[2|3].cpp
- themesh_add_surface()
function needs to take into account the new offsets and strides in the buffer for specifying the location of vertex positions and attributesvisual_server.cpp
- thesurface_set_data()
function will need to write the mesh's vertex data into the buffer properly, and theget_array_from_surface()
function will need to be updated to read vertex positions/attributes properly with the correct buffer offsetsThese changes should also be portable to Godot 4.0 and the Vulkan rendering backend.
If this enhancement will not be used often, can it be worked around with a few lines of script?
N/A
Is there a reason why this should be core and not an add-on in the asset library?
This is a low level change to the way that vertex data management is done in the renderer
Android Games DevTech Team @ Google
The text was updated successfully, but these errors were encountered: