-
-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance problems I found #7366
Comments
Thanks a lot for putting this together! It sounds like we can have pretty noticeable gains by combining these approaches.
Regarding this issue, I tried removing all instances of
In both instances (with and without There are surely other scenes and GPUs that could benefit from this change. I'm just puzzled why I'm not benefiting from this change in the slightest. While I have a high-end GPU, I definitely expect a significant performance improvement to be feasible (given the performance I can get in other engines).
We should implement 2D batching in Vulkan renderers, as is already done in OpenGL. It wasn't a priority due to Vulkan's lower CPU overhead, but there are a lot of cases where it's still beneficial, especially on integrated graphics.
Did you mean |
Parts of this would likely resolve godotengine/godot#68959 |
Oh thanks for the scene test! I was curious. Good idea about trying it out! Note that you need to do the following if you want to measure the changes:
Once I did all that I got: Discard enabled:
Discard disabled:
This is on an AMD 6800 XT. So it DOES make a difference, just a tiny one (0.01ms - 0.02ms). This is repeatable. The problem with these monster GPUs is that:
Yes. Fixed. |
I took care of the discard issue so we can take it off the table godotengine/godot#79865 @BastiaanOlij and I noticed the issue with the At any rate, it'll be merged upstream soon |
Hey @darksylinc, Great stuff man! Just some feedback from my end (and sorry if I'm doubling up on anyone, I haven't read all the reactions yet :) ). Every pixel shader is using Discard Me and Clay ran into this a while back but never got around to fixing it because it was unclear from the code in what situation the opaque prepass threshold actually made sense. Clay just submitted a PR:) Depth Only passes are using (e.g. Depth prepass + Shadow Maps) are using pixel shaders This was pointed out to me a little while ago too, we're relying on the GPU driver to recognise the fragment shader isn't doing any actual work and taking it out. Obviously with the discard issue up above that never happens. Implementing this could be tricky because there are scenarios where the fragment shader is doing work during prepass, when the above discard check is needed for instance, and when certain effects are enabled (like GI I believe) where it write more than just depth. I do wonder if in these scenarios, the depth prepass is still effective. Shadow maps appear to have no frustum culling (or it's broken) or wrong use of depth clamp This one is weird, I've been working a lot on shadows lately and I can confirm the culling logic is run but indeed it comes back with far more objects than you'd expect. That said, especially for directional shadows where the shadowmap is far larger than the frustrum we're clipping against you can see it working clearly when displaying the directional shadows. I haven't dived deep enough into the culling logic but I did notice that the code that check whether an AABB is inside of the frustum is fairly optimistic, if I recall correctly it ends up performing this check: https://github.com/godotengine/godot/blob/master/servers/rendering/renderer_scene_cull.h#L201
Where bounds are calculated from an AABB in the method above. Small objects in Shadow Maps and LOD I'm not sure how we manage LOD at the moment for shadows, haven't fully looked into this, but indeed there isn't any code that would discard small objects based on distance. I also wonder how much impact having large collections of small objects have that could be filtered out by an encapsulating AABB. This has been often discussed but never really got anywhere. My goto example is creating a small village the player can walk around in with each building having a fully worked out interior. Often the entire interior of a building can be skipped just by checking its encompassing AABB, but right now the culling goes through each and every tiny little object in each and every building. Binds one VB/IB per draw call That is cool information, I'm not sure how easy this will be to implement however. I'm not sure how we'd go about merging meshes for this purpose especially when we're allowing materials to be overriden on instance level and while the meshes may share vertex and index buffers, they may need completely different shader pipelines used. Sky & Tonemap are doing fullscreen quad instead of fullscreen tri Dang, never thought about this but it makes complete sense. There is no cost to rendering a tri much larger than the screen, but one tri is better than two. I am surprised about the performance benefit you're expecting, is it really that high? That said, for sky I have a PR outstanding that actually renders a sphere because it allows moving a lot of the UV calculations into the vertex shader, especially because when stereo rendering we need to do a full unproject. This as I understand has less importance on desktop but on mobile rendering can make a big difference (both not doing a full matrix calculation in the fragment shader and allowing for pre-fetch of the textures). Still need to do more measurements around it. Passes are not amalgamated This is something I changed on the mobile renderer and we may be able to pull some of that back into the forward+ renderer. On the mobile renderer we're using subpasses as much was we can. I think this may also make a huge difference on rendering shadow maps as we're starting new passes for each light. On that subject, and this was feedback I got from one of the GPU vendors, the way we start passes for rendering shadows is also bad. We're not using Gaussian Glow performs inefficient pipeline barriers & lack of barrier solver + Lack of barrier solver Not much to add here other than, cool to get your insights, I'm only just starting to wrap my head around how barriers are applied in Godot especially after discussing the impact of not checking vertex and fragment shader barriers appropriately on mobile, one of the things one of the mobile GPU vendors pointed out. This is why we recently split the "raster barrier" into two, again if I've been told correctly this has no effect on desktop, but on mobile our current approach was preventing vertex shaders (from the next draw command) and fragment shaders to run in parallel. Text is drawn w/ each letter as one drawcall of 6 vertices Nothing to add here other than, sounds logical SPIR-V Optimizer is turned off Clay and I looked at this earlier in the year as well. Juan was adamant that the GPU drivers did a better job. I personally am in favour of adding the SPIV-V optimizer if only because it provides much more insight into what's left over after optimisation. Hearing that mobile drivers do a poor job makes me even more convinced this is something we should do. I believe the same suite of tools also allows us to add debug info into the shaders which could be interesting as well. D32S8 is used, but it's unclear if Stencil is actually needed It should default to D24S8 if supported and only use D32S8 when not, I've got some further issues around this because raster access is denied to some of these formats and required on the mobile renderer (where we currently don't use compute). Currently stencils aren't used but there has been a lot of demand for this, so this was added in preparation. I don't have any insight to provide whether this was smart or whether it makes more sense to have a separate stencil buffer when effects require a stencil buffer. My gut feeling says that it is better to use D32 whenever possible. Bistro: Horrible optimization Pretty sure this is the demo, not Godot (unless the importer really messes up). Godot is perfectly capable of loading a mesh once and drawing it multiple times, it doesn't require the mesh to be re-imported for each instance. I get the feeling it's the conversion to get to the end result. I ran into this as well when importing a scene I got from Clay that was sourced from a unity project. I'm convinced the original scene did proper instancing, I can see this from the naming of entities, but somewhere this was lost. Then again the file went from FBX to GLSL to Godot. Personally I think this requires us to create more example projects that are properly crafted for Godot, but who has the time.... Obviously finding a way to "deduplicate" meshes during import would be a great improvement. In Bistro, Setup Sky clears the RTT and then does nothing It's actually not setup sky, it looks weird because we don't always do a good job adding these groupings. This empty pass is clearing the directional shadow map. Juan added this because it allows parallel rendering of shadows and GI on AMD hardware, some weird trick that triggers this. I don't have the details. I ran into it when working on updating the directional shadow maps so some of the cascades are only updated every other frame and off course this clear broke the system. So if you turn on stepped shadows it actually doesn't do this anymore and instead clears only the quadrants that are updated. Some textures get the same name (e.g. VoxelGI Cascade LightProbe Average Texture) This can happen when textures are created for each viewport, we don't (currently) have a way to add a viewport ID into this. We should improve that. Use of RD::get_singleton() Weird, For readability I do load the pointer into a variable in many places (especially accessing the storage classes) but I was under the impression it didn't matter either way. Personally I don't mind the approach of obtaining the pointer and using it, makes the code much easier to read. |
Hi!
It's understandable but please note that at least for shadow maps, a pixel shader is rarely used. This sounds like it's trivial to check (if no alpha_test_needed and shadow_casting_pass then disable shadows)
What you are describing is hierarchical culling and (unless it's very specific for a game) it has been mostly abandoned (except for raytracing via BVH) because often you end up losing more from traversing the structures in a cache-unfriendly manner (indirections... this AABB contains children AABB) than what you gain from the culling. However the distance approach can be done in bulk (all objects laid out mostly contiguous in memory, updated in parallel)
With the right shader code approaches, per instance materials should be supported. From a quick glance I had, Godot should already support this. The main issue with this way of laying out memory is that it is mostly compatible with interleaved vertex formats (e.g. pos/normal/uv), and quite hostile to non-interleaved formats (e.g. pos/pos/pos then another buffer with normal/normal/normal and another with uv/uv/uv)
Yes, but note the ~10% is for that effect alone. Don't expect a 10% overall FPS increase because a few passes have been optimized 🤣
YMMV on mobile.
It is true it's a killer on mobile but TBH there are not many ways to render a shadow map ATLAS and be performant on mobile. Unless you have one RTT per shadow map it doesn't perform well. TBH I just disable shadow maps for low and mid end devices.
Of note: On Mobile the SPIRV optimizer is important due to driver crashes. Unoptimized code often ends up crashing drivers because it contains code the driver didn't like.
D32 should be the default. On AMD, D24 is emulated and is internally a D32 buffer (in fact AMD engineers get moody when you mention D24, they only support it because of emulators and other legacy uses). D24 was only faster on old HW (DX10 class HW). Nowadays D32 should be preferred (are you guys doing reverse Z? It increases depth precision A LOT)
What I meant, and I cannot stress enough on this: I am experienced, so I noticed the problem in Bistro after a few hours working on it. But a newbie won't even understand what is wrong. De-duplicating meshes is often a very manual work. But they don't even know duplicated meshes are a bad thing. And since most Godot users are not experts, Godot should at least inform them when things look suspicious (hence why I recommended displaying an instance / mesh ratio and documenting what it is about). Half the battle is educating the user.
Ahh, yeah that makes sense.
A global monotonic counter would be enough.
It doesn't matter if it's inlined. Consider the following code: RD *g_singleton = new RD(); // It's a global variable
void RD::specialFunction()
{
g_singleton = nullptr;
}
int myFunction()
{
g_singleton->funcA();
g_singleton->specialFunction();
g_singleton->funcB();
} Between funcA and funcB, specialFunction() was called. The compiler must load the data from singleton's address again, because it changed (g_singleton was set to null). Now consider that specialFunction() does NOT set g_singleton, however the function was defined somewhere else (i.e. in a different cpp file) it cannot see. The compiler doesn't know if specialFunction() modifies g_singleton or not (in C++ lingo this is called "unknown side-effects"). Therefore it must assume specialFunction() does modify it, even if it really doesn't. With LTO (Link Time Optimization), the linker gets to see what
It depends on so many things it's not a guarantee that it will make a difference, however the practice should be to prefer the local variable. At the very least on the 'hot' paths (i.e. those that get called every frame for every object). |
Interesting stuff. |
Heya,
It should be smart enough to tell those situations apart so indeed, what I said only applies to depth pre-pass, for shadow maps it should only need to include the fragment shader if alpha scissor is enabled imho.
Hmm interesting, I'd imagine excluding large amounts of objects from checks would win but if this has already been disproven... :)
Yes this seems like a smart improvement.
I think the structure is certainly there, it's more that right now we're sorting objects by shader and material, so objects that share vertex and index buffers wouldn't be in order.
If I'm not mistaken all data is interleaved so that should be ok.
Yup, sometimes I think our users are expecting miracles and not realising that most good looking games on mobile only look that good on flagship phones and even then have highly customized shaders. Definately for my main use case, VR, you can see that nearly everything that comes out on Quest is all baked lighting and uses none of this.
Well that might explain a few issues we've been having then :) Still I think it's worth including even on desktop.
I'm not sure, there was a discussion about reverse Z some time ago, clay will know that one better then me. So how does D32S8 compare? I'm guessing D24S8 was chosen because its a 32bit format. I don't know what the best way forward here is, but if we can flag whether a stencil buffer is even needed I'm for using D32 whenever possible and D32S8 if we do need a stencil. It's also important to note that most XR platforms don't allow D24S8 or D32S8 to be used anyway (there is currently an overrule for this in OpenXR).
Owh 100% agree with you here, I'm just saying that it most likely wasn't a conscious thing but just how we got the assets. If we have a means by which to deduplicate meshes automatically I think this is a very worthwhile improvement on the importer.
Yes that would make the names unique, but I think its worth going the extra mile and numbering viewports and including that in here. It would be useful for seeing if the right buffers are used.
OK so this is a case of me being told that the optimizer is good at doing something it isn't, I'm mostly just copying the existing style here :P I think this is something we should have a quick discussion about on the chat channels, are you on the Godot rocket chat server yet? |
You can do miracles, but miracles require a lot of work and manual hand tunning :)
Nowadays D24S8 is internally D32S8 and D24 is internally D32, therefore by choosing D32S8 you get better precision at the same cost. Plus, D32 allows for reverse Z and the quality difference is GIGANTIC. Reverse Z cannot be used on the following:
I cannot see how a tool would be able to fully automate this case (because the meshes do not look equal) but I can see a way to semi automate it:
|
I forgot to comment on this! Can you point me out where in code this sort is made? I want to evaluate it (at a first glance, not sorting by vertex/index buffers sounds wrong). |
I'd be keen to learn more about the depth buffer things, I was working on code that resolves depth buffers for mobile and still need to add a fallback and this is where I ran into trouble with needing to do a copy with a raster shader: So we might be able to make some format optimisation here as well. I'm also busy exposing the render buffers class to extensions and here I've merged the MSAA generation into this class instead of having it separate for mobile and vulkan and ran into issues with the buffer format as well. So simplifying this further would be welcome: I need to look into reverse Z, I've heard that mentioned a few times now and haven't wrapped my head around why it is an improvement, sound interesting.
Me neither, I don't know the import system well enough but its worth exploring and you make some good suggestions here. That said, I don't think we should wait on this and see if we can't manually improve some of the demos by deduplicating things. It's fairly "simple" to do, if you have two
It's done here: https://github.com/godotengine/godot/blob/master/servers/rendering/renderer_rd/forward_clustered/render_forward_clustered.cpp#L1690
So all opaque objects are sorted by a key, and that key is made up of the material index, shader index, priority, and some other bits and bobs. |
First, go to sleep. It's late for you already!
When it comes to Mobile, the way Godot is doing things is a mix of deferred and forward, and that is admittedly not the best for Mobile. Bigger changes will be needed. It's a bit of a struggle and tradeoffs in many fronts, so this isn't easy.
I can give you the TL;DR:
|
So for |
Yes but for best results it needs to be implemented when building the Projection matrix (so the shaders don't even need to change). The change itself is easy (minus a few tries until you get it right). It is possible to create it from scratch (best precision) or to convert one. The devil in the details:
|
This is a nice visual writeup btw: https://developer.nvidia.com/content/depth-precision-visualized Yeah, it sounds pretty straight forward. I think we might get tripped up where OpenGL and Vulkan code share some of the logic, as it seems we can't apply this in GLES, and there might be an impact around XR where we can get a projection matrix from the XR runtime, but it wouldn't be hard to adjust those. I think OpenXR no longer does this and we have control over the projection matrix calculation as well. I think the biggest issue we'll run into is peoples custom shaders breaking due to your point no 3. Though I don't think a lot of people use this (I do for a number of tricks in XR where I need to draw things on the near plane). |
Some options:
|
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Please keep comments on topic. This is not the place to receive performance advice for your game. While contributors are willing to comment and help out, it really pollutes the issue and makes it harder for us to use it for its intended purpose. |
As the original porter of the project I'll answer to the performance complaints regarding the Bistro: Regardless the scene is so small that anything beyond the most cursory optimization wasn't necessary. I wouldn't consider not instancing a handful of objects to be horrible optimization considering the size and nature of the scene. I appreciate the analysis given to Godot in this post, and although I'm no low-level rendering expert the rest seems pretty reasonable. Thanks for taking the time to put it together! |
How did I not know about this proposal until now? Most of these points have already been addressed, but I still wanted to add my two cents for future reference.
I agree with most of your points here, the one thing I want to comment on is the usage of discard in the early depth pass.
While the bistro scene itself was never meant to be an example of a highly optimized scene, it is a good showcase for some larger problems. The most pressing on is that we don't have any kind of deduplicator when meshes are imported. What makes this even worse is that Blender's gltf exporter turns all instances into copies, which is absolutely terrible from a performance point of view. Depending on the cpu cost, an even better option might be to just store a meshes hash in godot's internal mesh format. This way we could check if an identical mesh has already been loaded during runtime, which could A) save us from having to read the mesh data from disk and B) work across multiple scenes, which could be very useful since the importer would only be able to remove duplicates within the current import. So if someone uses an assetlib inside Blender and places e.g. the same column in multiple different scenes, we could detect it with this system.
Reverse depth has been something I have been advocating for a year now. The problem was that implementing it would suddenly switch all depth comparisons, which would affect user code and thus break compat, which is something we usually try to avoid. |
Note that reverse Z has been merged in 4.3 (see blog post).
I'm curious as to how to provide the best usability for culling small objects from shadow passes. There are a few things I have in mind:
|
I think this is fixed with blender 4.0 and up: |
I have a few comments on that based on my experience (Note: By "Shadow Map" I mean cascade or light. Basically any independent "quadrant"):
Please note that: From a performance point of view, this means that you would need a visibility threshold per shadow map. The more thresholds you have, the higher the CPU cost to evaluate them all. Furthermore, since Godot is flexible, this would have to be handled dynamically and most likely through heap allocations e.g. for i in objects
for j in shadow_maps
object[i].visibility[j] = shadow_camera[j].distance( object[i] ) While this can be optimized, it tends to be a headache because everything is dynamic. From an artistic point of view, you usually want the object's visibility to be somewhat linked with the shadow casting's visibility. Unless you're making a God of War like game (the original PS2/3 trilogy) or other "on rails" game where you are in complete control of camera angles and therefore can fine tune every shadow threshold to meet performance demands; the user will want something that works in general conditions. Otherwise you may end up with big objects that for some reason are not casting shadows while everything else still is. Linking the object's visibility somewhat like this: // When loading mesh:
obj.visibility_threshold = 1000; // 1000 meters. Set by user.
float default_threshold = obj.visibility_threshold * 0.75f; // At least default value. Can be overridden
if( has_overrides )
default_threshold = override( obj );
obj.shadow_visibility_threshold = default_threshold * global_bias; This means that you know the object is designed to be really small if 1000m away. And it should reasonably still be small enough to cast any visible shadow in the last 250 meters before it disappears.
While one can get fancy with these things; these increase the maintenance burden due to increased code complexity, increase CPU costs (because calculations need to be evaluated and stored somewhere) while usually you get diminishing returns to be worth the trouble. |
Btw I tested master branch 4.2.dev 6588a4a29af1621086feac0117d5d4d37af957fd which is basically from Friday.
I tested Bistro and the TPS demo.
Note: I couldn't "play" the Bistro demo (it always shows a grey screen), so I evaluated the rendering in the Editor. I could play the TPS demo.
Every pixel shader is using Discard (Fixed) godotengine/godot#79865
***YOU CANNOT USE GLSL'S DISCARD CARELESSLY.***
Almost every pixel shader has this snippet needlesly enabled:
This applies even when alpha = 1. This is very bad.
Even Alpha Blended objects have this snippet! (which makes very little sense)
I spoke with Juan. He told me this is likely a bug. If he recalls correctly this feature should only be used for things like grass and foliage (which makes sense).
Here's the thing. When a pixel shader uses discard (it doesn't matter if the shader never actually hits it, just its presence alone is enough):
The ideal rendering order would be as follows:
Depth Prepass:
Normal (Forward) pass:
Recommendation:
USE_OPAQUE_PREPASS
unless the object is opaque and alpha != 1.Depth Only passes are using (e.g. Depth prepass + Shadow Maps) are using pixel shaders
When Godot only renders to depth (not colour) the PSOs should not contain a Pixel Shader.
In theory the GPU and/or the driver should be clever enough to not run a PS. So this may not do anything.
However the driver needs to analyze if the Pixel Shader has side effects; and may chose not to analyze the PSO and assume we are doing the efficient thing (which we are currently not). Most likely mobile falls in this category while NV and AMD do the former (I dunno what Intel is doing).
Even if the driver correctly analyzes Godot's pixel shader, it's a waste of time when the driver needs to parse the shader at PSO-creation time.
The only reasons to set a Pixel Shader in the PSO in this scenario are (aka side effects):
The Depth Prepass sometimes draws exclusively to the depth buffer (i.e. TPS demo in the main menu) but most of the time it seems to render normals to the GBuffer (this is fine and is a good practice). Obviously when the latter happens, a Pixel Shader is needed. No objections there.
Even when the driver properly disables the pixel shader, Godot shouldn't even waste CPU time compiling a shader from GLSL to SPIRV that is never going to be used. It should be a nullptr.
Shadow maps appear to have no frustum culling (or it's broken) or wrong use of deph clamp
In the TPS demo I saw an alarming number of objects being rendered behind the camera during the shadow mapping pass.
If depth clamp were to be enabled this would make sense (a common trick to increase depth precision in shadow maps). However depth clamp is turned off.
Frustum culling debugging can be tricky because the following could be happening:
This is the least worrysome issue because there may be nothing wrong (although we could increase precision if we used depth clamp) and I don't see this problem happening in scenes like Bistro.
Recomendation:
Small objects in Shadow Maps and LOD
This is a high level feature & optimization rather than a Vulkan low level one. The following is from Bistro:
We have a "something" that is just 78 vertices, occupies a few pixels, and is even rejected in the shadow map (because it's inside a building). We could get rid of this draw entirely and no visual changes would occur.
The way to solve it is:
calculated_pixels < obj_threshold * camera_bias
, do not render it.Of course this algorithm CAN cause visual changes (it's LOD after all), but the beauty of it is that global bias can make the algorithm adapt to all sorts of hardware: Use a large bias value for high end HW, use low bias for low end HW (this mimics "draw distance" from 90's consoles).
Binds one VB/IB per draw call
This one was the most painful for me to watch.
When two (completely unrelated) objects share the exact same vertex format, they can be placed in the same vertex and index buffer.
That's what
firstIndex
,vertexOffset
andfirstInstance
in vkCmdDrawIndexed are for.Right now Godot does the following:
It should be more like:
Unfortunately this requires a lot of plumbing and I need to get a lot more familiar with the engine before we can do this.
Sky & Tonemap are doing fullscreen quad instead of fullscreen tri (Fixed)
This is the easiest & quickest to fix. Various fullscreen passes are being done using a quad (2 triangles, 6 vertices).
It is way better to use a fullscreen triangle.
The reason is that GPUs have problems in the diagonal for two reasons: how pixels are collected for SIMD execution and cache locality is broken.
This normally optimizes perf by 10-18% (for those passes, not overall).
I want to do this one because it's very easy for me and I want to keep getting familiar with the engine and the code.
Passes are not amalgamated (Done)
Ideally all rendering should look like this:
Of course sometimes we have to breakup the passes for rendering reasons.
But I saw that Godot:
There is no reason for Godot not to put all 3 in the same block.
This feature doesn't just impact Mobile (mobile is the most impacted by this), but also all modern desktop GPUs after Maxwell (NVIDIA) & Navi (AMD). Since they became hybrid tilers (they are not TBDRs, but they do benefit from load/store actions as they try to keep data on-chip as much as possible).
This happens because Godot is written with this clean pattern:
However in OgreNext we do the following:
Therefore another_begin() gets the chance to evaluate if what's already open needs closing, and if not, just early out and continue doing more work.
Gaussian Glow performs inefficient pipeline barriers & lack of barrier solver (Done)
This is a follow up of the previous point.
Godot is plagued (not in a bad way) with the following pattern:
The problem right now is that in compute_list_end(), Godot will turn back every texture back to a state that can be read again for pixel shaders.
Hence the following happens:
VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
toVK_IMAGE_LAYOUT_GENERAL
VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL
.The result is that between compute calls we end up with 2 vkCmdPipelineBarrier between dispatches instead of just 1.
A solution for this could be a Barrier Solver.
Lack of barrier solver (Done)
A barrier solver is one of multiple possible solutions, but one I'm familiar with (because I wrote one for OgreNext).
A BarrierSolver is a class that performs similar tasks that D3D11 API does behind the scenes:
The idea is that the user of a BarrierSolver declares intent. e.g.
Therefore it's no longer the job of the compute pass to restore the texture back to a read layout.
The next pass declares that it wants to use a texture for reading and calls resolve() with the right parameters. Then finally execute() once it's done with all the resolve() calls, all before start generating rendering commands.
If the next pass is a compute pass instead, it declares it wants to use the texture for UAV, and since it is already in that state, resolve() does nothing. And if execute() sees there's nothing to do, vkCmdPipelineBarrier isn't called either.
OgreNext's BarrierSolver header and source files.
Text is drawn w/ each letter as one drawcall of 6 vertices (Done: godotengine/godot#92797)
In TPS demo, this happens for the FPS text.
I know it's debug text but really? 1 drawcall per letter?
The editor also draws its entirety of text in the same way. This is bad.
SPIR-V Optimizer is turned off
This is a controversial topic. The GPU driver performs its own optimizations, so some say the SPIR-V optimizer is a pointless optimization pass (that may even hurt performance).
Others say, the opposite is true.
There are a few facts we can analyze:
D32S8 is used, but it's unclear if Stencil is actually needed
It didn't seem like Godot is actually using the Stencil shader, but it does request for one.
If it's not actually used, Godot should use a D32 depth buffer. This saves both memory, performance and bandwidth.
And have an API where it asks every plugin (including user code) that asks whether they'll need a Stencil buffer with the rendering Window (and have another for every RTT that needs creating).
Bistro: Horrible optimization
I don't know if this issue comes from the original Bistro scene from Amazon or it was a human mistake when porting the scene to Godot, or if the problem lies in whatever tool(s) was used to import it into Godot.
But basically I can see the exact same furniture with the exact same geometry duplicated over and over again and being placed with the transform already embedded into the vertices; instead of creating a mesh template and placed via Godot nodes each with its own transform.
This prevents any sort of batching (and wastes VRAM). Each lampost has its own duplicate in memory:
2023-07-24.12-59-40.mkv.mp4
It can be seen in Blender by loading the glb models:
If it's human error, then the only thing I can say is that Godot needs better tools to detect the situation:
p_instances
inRenderingDeviceVulkan::draw_list_draw
) but this value is almost always 1 in TPS and Bistro demos. We should report the average value per frame as a performance metric. The higher it is, the better the performance could be. Low values will be correlated with the Instance / Mesh ratio above.In Bistro, Setup Sky clears the RTT and then does nothing
There is a Setup Sky step that performs a vkCmdCopyBuffer, then opens the pass with Clear and immediately closes.
Later on that render target is used in the depth pre-pass (with Load action Load instead of load action clear):
Suggestion:
Some textures get the same name (e.g. VoxelGI Cascade LightProbe Average Texture)
Godot assigns the exact same name to multiple textures, which makes it harder to debug.
Simply numbering them e.g. "VoxelGI Cascade LightProbe Average Texture #2" would fix the problem.
Use of RD::get_singleton()
Godot is plagued with the following pattern:
Each
RD::get_singleton()
call cannot be optimized by the C++ compiler because of side effects, often even with LTO.This is because:
The compiler would need to guarantee that whatever happens in
callA()
never alters the value inRD::singleton
.If it doesn't know, then it must refetch from
RD::get_singleton
(it doesn't matter if the call gets inlined or not, because the problem is with the contents in the global variableRD::singleton
). This would happen with every call.Sometimes you get lucky and with LTO the compiler noticed
callA()
never modifiesRD::singleton
and thus can reuse the returned value forcallB()
. But the more complex a call is, the less likely this will happen.The following pattern should be preferred:
I see that Godot uses this pattern not just for RenderingDevice, but for a lot of other things. The same analysis applies.
Potentially-inefficient loops
A similar thing to
RD::get_singleton();
happens with loop conditions. This is a snippet fromLightStorage::update_light_buffers
:This is not the same as:
The reason is that in the latter we can guarantee the compiler local_omni_light_count & local_spot_light_count will not change at all after every iteration.
While on the former, we have to cross a lot of fingers the optimizing compiler and LTO can prove omni_light_count & spot_light_count do not change on every iteration.
(I can't even fix this immediately and submit a PR because I'd have to analyze if the algorithm actually expects to handle omni_light_count & spot_light_count changing).
For cold paths, this optimization hampers code readability more than it can optimize performance, but for hot paths this is necessary.
VkPipelineCache may inhibit multithreading compilation performance
From what I can see, right now Godot is not calling
vkCreateGraphicsPipelines
from multiple threads but the intention do this in the future exists. So this will become relevant eventually.The explanation is here. Pasting the relevant part:
This is why
vkMergePipelineCaches
exists: If we have 8 threads all of them calling vkCreateGraphicsPipelines in parallel at the same time; if all 8 threads use the same cache the following scenarios can happen:Vector & LocalVector
Vector::push_back
&LocalVector::push_back
have controversial behaviors:Vector::push_back
performs resize(1+n). Apparently Godot calls this "tight" which is fine if you don't expect the Vector to grow much more than once or twice; but it can too easily accidentally degrade into O(N²) behavior which is why no standard vector container implements it.A vector that pushes 1000 elements without reserve(1000) will regrow 1000 times, which means copies will be 1 + 2 + 3 + ... + 999. This is described by the
N * (N + 1) / 2
formula which in terms of Big O notation it is another way of saying N² with extra steps.LocalVector::push_back
performs resize(MAX(1,n*2)) unless the "tight" parameter is explicitly set; citing in the comments "Otherwise, it grows exponentially (the default and what you want in most cases)."Unfortunately, this is also problematic. The N parameter to resize should approach the golden ratio.
In practice everyone uses 1.5x because it's easy to calculate and still has a nice reuse/reallocation ratio. This means resize(MAX(1,n+n/2)). That is, we grow N by 1.5x, instead of 2.0x.
This is not arbitrary, an explanation can be found in Stack Overflow and it has to do with memory fragmentation and reclaiming memory.
The following table is what happens when you grow at 2x rate:
The problem here is that when we have 32 elements, next time we will grow to 64. So far we've freed 15 bytes. We can never reuse those free blocks. We always have to ask the OS for more memory pages.
But with a 1.5x rate things change:
By the time we have 12 elements, next time we will grow to 18. So far we've freed 19 bytes. Assuming they were contiguous, we can reuse that free area and place the 18 elements there; instead of asking the OS to give us more memory.
Every 6 allocations or so (approx), a vector growing 1.5x will allow allocators to reuse old memory blocks.
Needless to say, this math works for everything; so if you can think of a GPU algorithm growing VRAM at 2x rate when it could do at 1.5x; it is time to rethink if 1.5x is better.
The only advantage of 2x is that it requires fewer reallocations overall. But that is often negated by the need to ask for more pages to the OS and the inability to reuse freed blocks properly (which leads to fragmentation).
UPDATE: I was pointed out Vector does
resize( n + 1 )
but it delegates intoCowData::resize
which performs exponential resize.Suggestion
I don't know how much Godot code actually expects the Vector/LocalVector to be tight. Thus by blindly changing the behavior we could do more harm than good if the code is actually aware of this.
The least intrusive option would be to:
Cheers
The text was updated successfully, but these errors were encountered: