Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problems I found #7366

Open
darksylinc opened this issue Jul 24, 2023 · 32 comments
Open

Performance problems I found #7366

darksylinc opened this issue Jul 24, 2023 · 32 comments

Comments

@darksylinc
Copy link

darksylinc commented Jul 24, 2023

Btw I tested master branch 4.2.dev 6588a4a29af1621086feac0117d5d4d37af957fd which is basically from Friday.

I tested Bistro and the TPS demo.

Note: I couldn't "play" the Bistro demo (it always shows a grey screen), so I evaluated the rendering in the Editor. I could play the TPS demo.

Every pixel shader is using Discard (Fixed) godotengine/godot#79865

I wanted to start with this one because it was the biggest one that raised all red flags:

***YOU CANNOT USE GLSL'S DISCARD CARELESSLY.***

Almost every pixel shader has this snippet needlesly enabled:

#ifdef USE_OPAQUE_PREPASS
	if (alpha < scene_data.opaque_prepass_threshold) {
		discard;
	}
#endif

This applies even when alpha = 1. This is very bad.

Even Alpha Blended objects have this snippet! (which makes very little sense)

I spoke with Juan. He told me this is likely a bug. If he recalls correctly this feature should only be used for things like grass and foliage (which makes sense).

Here's the thing. When a pixel shader uses discard (it doesn't matter if the shader never actually hits it, just its presence alone is enough):

  1. Early Depth Rejection is disabled.
    • This negates any benefit from doing the depth pre-pass!
    • Desktop suffers a lot but mobile is just a no-go. Discard is very expensive for TBDRs.
    • Some advanced GPUs are able to avoid this by doing two Z evaluations: An early depth test before running the pixel shader, and then a late depth test to write the depth. This however increases ROP usage, ROP contention (which is bad if many triangles overlap on the same pixels), and increases memory Bandwidth. But at least you get Early Z to prevent expensive pixel shaders.
  2. Hi-Z (Hierarchy Z) and Z compression may be disabled (potentially forever until the next depth buffer clear; highly depends on HW). See depth in depth for an explanation of how Depth Buffers work. Fabian Giesen also talks about HiZ. TL;DR rather than storing depth per pixel, (among many other things) GPUs store the triangle's plane equation for a whole tile. They can't do that if the depth pixels are arbitrarily rejected on a per pixel basis.

The ideal rendering order would be as follows:

Depth Prepass:

  1. Render all Opaque objects
  2. Render all objects that need discard (debatable, it may be better to not render these here)

Normal (Forward) pass:

  1. Disable Depth Writes for all draws from now on
    • Godot correctly does this (yay!)
  2. Render all Opaque objects
  3. Render all objects that need discard
  4. Render transparents

Recommendation:

  1. Godot should not define USE_OPAQUE_PREPASS unless the object is opaque and alpha != 1.
  2. Another shader variant needs to be used for this.

Depth Only passes are using (e.g. Depth prepass + Shadow Maps) are using pixel shaders

When Godot only renders to depth (not colour) the PSOs should not contain a Pixel Shader.

In theory the GPU and/or the driver should be clever enough to not run a PS. So this may not do anything.

However the driver needs to analyze if the Pixel Shader has side effects; and may chose not to analyze the PSO and assume we are doing the efficient thing (which we are currently not). Most likely mobile falls in this category while NV and AMD do the former (I dunno what Intel is doing).

Even if the driver correctly analyzes Godot's pixel shader, it's a waste of time when the driver needs to parse the shader at PSO-creation time.

The only reasons to set a Pixel Shader in the PSO in this scenario are (aka side effects):

  1. The PS must use discard (e.g. foliage, grass)
  2. The PS writes to an UAV (Storage Buffer / Storage Textures in Vulkan lingo) or an ROV (needs Vulkan extension, Godot is not using them)

The Depth Prepass sometimes draws exclusively to the depth buffer (i.e. TPS demo in the main menu) but most of the time it seems to render normals to the GBuffer (this is fine and is a good practice). Obviously when the latter happens, a Pixel Shader is needed. No objections there.

Even when the driver properly disables the pixel shader, Godot shouldn't even waste CPU time compiling a shader from GLSL to SPIRV that is never going to be used. It should be a nullptr.

Shadow maps appear to have no frustum culling (or it's broken) or wrong use of deph clamp

In the TPS demo I saw an alarming number of objects being rendered behind the camera during the shadow mapping pass.

If depth clamp were to be enabled this would make sense (a common trick to increase depth precision in shadow maps). However depth clamp is turned off.

Frustum culling debugging can be tricky because the following could be happening:

  1. Godot is doing Frustum Culling on the main cameras but not on the shadow map cameras.
  2. The math/intersection tests are wrong or incomplete.
  3. The objects' AABB are oversized or they hit and edge case.

This is the least worrysome issue because there may be nothing wrong (although we could increase precision if we used depth clamp) and I don't see this problem happening in scenes like Bistro.

Recomendation:

  1. Investigate / double check frustum culling
  2. Implement depth clamp for shadow maps (it's not just about turning it on, the shadow map camera needs to be tuned for this)

Small objects in Shadow Maps and LOD

This is a high level feature & optimization rather than a Vulkan low level one. The following is from Bistro:

ShadowMapSmall 01

We have a "something" that is just 78 vertices, occupies a few pixels, and is even rejected in the shadow map (because it's inside a building). We could get rid of this draw entirely and no visual changes would occur.

The way to solve it is:

  1. All objects should have a distance_to_camera threshold (I already suggested this to Juan in 2019, he may have implemented it). If we know the camera's parameters, the object's AABB and the distance to camera, we can roughly approximate the amount of pixels it occupies.
  2. If calculated_pixels < obj_threshold * camera_bias, do not render it.
  3. Godot can autocalculate the obj_threshold given various parameters: AABB, number of vertices, and even location (an object inside a building is different from an object placed outdoors). Just trace a few rays in all directions and if many of them hit something, then it must be indoors.
  4. User can override the per-object threshold.

Of course this algorithm CAN cause visual changes (it's LOD after all), but the beauty of it is that global bias can make the algorithm adapt to all sorts of hardware: Use a large bias value for high end HW, use low bias for low end HW (this mimics "draw distance" from 90's consoles).

Binds one VB/IB per draw call

This one was the most painful for me to watch.

When two (completely unrelated) objects share the exact same vertex format, they can be placed in the same vertex and index buffer.

That's what firstIndex, vertexOffset and firstInstance in vkCmdDrawIndexed are for.

Right now Godot does the following:

vkCmdBindVertexBuffers( ... );
vkCmdBindIndexBuffer( ... );
vkCmdPushConstants( ... );
vkCmdBindDescriptorSets( ... );
vkCmdDrawIndexed(dl->command_buffer, to_draw, p_instances, dl->validation.index_array_offset, 0, 0);

vkCmdPushConstants( ... );
vkCmdDrawIndexed(dl->command_buffer, to_draw, p_instances, dl->validation.index_array_offset, 0, 0);

vkCmdBindVertexBuffers( ... );
vkCmdBindIndexBuffer( ... );
vkCmdPushConstants( ... );
vkCmdDrawIndexed(dl->command_buffer, to_draw, p_instances, dl->validation.index_array_offset, 0, 0);

It should be more like:

vkCmdBindVertexBuffers( ... );
vkCmdBindIndexBuffer( ... );
vkCmdPushConstants( ... );
vkCmdBindDescriptorSets( ... );
vkCmdDrawIndexed(dl->command_buffer, to_draw, p_instances, dl->validation.index_array_offset, baseVertex, baseInstance);
vkCmdDrawIndexed(dl->command_buffer, to_draw, p_instances, dl->validation.index_array_offset, baseVertexB, baseInstanceB);
vkCmdDrawIndexed(dl->command_buffer, to_draw, p_instances, dl->validation.index_array_offset, baseVertexC, baseInstanceC);

Unfortunately this requires a lot of plumbing and I need to get a lot more familiar with the engine before we can do this.

Sky & Tonemap are doing fullscreen quad instead of fullscreen tri (Fixed)

This is the easiest & quickest to fix. Various fullscreen passes are being done using a quad (2 triangles, 6 vertices).

It is way better to use a fullscreen triangle.

The reason is that GPUs have problems in the diagonal for two reasons: how pixels are collected for SIMD execution and cache locality is broken.

This normally optimizes perf by 10-18% (for those passes, not overall).

I want to do this one because it's very easy for me and I want to keep getting familiar with the engine and the code.

Passes are not amalgamated (Done)

Ideally all rendering should look like this:

vkCmdBeginRenderPass( LOAD_ACTION_CLEAR );
// ... all draw calls ...
vkCmdEndRenderPass( {STORE_ACTION_STORE, STORE_ACTION_DONT_CARE} );

Of course sometimes we have to breakup the passes for rendering reasons.

But I saw that Godot:

  1. Renders opaques in its own RenderPass block.
  2. Then Sky alone in its own RenderPass block.
  3. Then 3D Transparent Pass also in its own RenderPass block.

There is no reason for Godot not to put all 3 in the same block.

This feature doesn't just impact Mobile (mobile is the most impacted by this), but also all modern desktop GPUs after Maxwell (NVIDIA) & Navi (AMD). Since they became hybrid tilers (they are not TBDRs, but they do benefit from load/store actions as they try to keep data on-chip as much as possible).

This happens because Godot is written with this clean pattern:

something_begin( cmd_list );
something_do_stuff( cmd_list );
something_end( cmd_list );

another_begin( cmd_list );
another_do_stuff( cmd_list );
another_end( cmd_list );

present();

However in OgreNext we do the following:

something_begin( cmd_list );
something_do_stuff( cmd_list );

another_begin( cmd_list ); // Will close anything currently open
another_do_stuff( cmd_list );

present(); // Will close anything currently open

Therefore another_begin() gets the chance to evaluate if what's already open needs closing, and if not, just early out and continue doing more work.

Gaussian Glow performs inefficient pipeline barriers & lack of barrier solver (Done)

This is a follow up of the previous point.

Godot is plagued (not in a bad way) with the following pattern:

void runCompute();
{
	rd->compute_list_begin();

	for( i < n ) {
		rd->compute_list_bind_compute_pipeline( compute_list, compute_pso );
		rd->compute_list_bind_uniform_set( compute_list, compute_pso, ... );
		rd->compute_list_set_push_constant(compute_list, &push_constant, sizeof(SDFGIShader::IntegratePushConstant));
		rd->compute_list_dispatch_threads( compute_list, ... );
	}

	rd->compute_list_end( RD::BARRIER_MASK_COMPUTE );
}

runCompute();
runCompute(); // Do it again

The problem right now is that in compute_list_end(), Godot will turn back every texture back to a state that can be read again for pixel shaders.

Hence the following happens:

  1. Inside runCompute (1st call), texture_A gets turned from VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL to VK_IMAGE_LAYOUT_GENERAL
  2. Before runCompute returns, texture_A gets turned back into VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.
  3. Inside runCompute (2nd call), the same thing happens as in point 1.

The result is that between compute calls we end up with 2 vkCmdPipelineBarrier between dispatches instead of just 1.

A solution for this could be a Barrier Solver.

Lack of barrier solver (Done)

A barrier solver is one of multiple possible solutions, but one I'm familiar with (because I wrote one for OgreNext).

A BarrierSolver is a class that performs similar tasks that D3D11 API does behind the scenes:

  1. Most resources don't need tracking: Normal textures, vertex buffers, etc. Copying from/into these buffers is an exception, but don't need much tracking. They don't require the BarrierSolver.
    • These resources can be numerous. There can be many thousands of these in large projects.
  2. Resources that need tracking are those that will have GPU write access at some point: Render Targets, UAVs (Storage buffers/images).
    • Fortunately these resources are few, often less than a hundred.

The idea is that the user of a BarrierSolver declares intent. e.g.

enum Layout
{
	TextureSample,
	RenderTarget,
	RenderTargetReadOnly,
	Uav,
	Present
}
enum Access
{
	Read,
	Write,
	ReadWrite
}

barrierSolver.resolve( texture_for_sample, Layout::Texture, Access::Read, VertexShader|PixelShader );
barrierSolver.resolve( texture_for_render, Layout::RenderTarget, Access::ReadWrite, PixelShader );
barrierSolver.resolve( texture_for_uav, Layout::Uav, Access::ReadWrite, PixelShader );
barrierSolver.execute(); // Actually calls vkCmdPipelineBarrier

Therefore it's no longer the job of the compute pass to restore the texture back to a read layout.

The next pass declares that it wants to use a texture for reading and calls resolve() with the right parameters. Then finally execute() once it's done with all the resolve() calls, all before start generating rendering commands.

If the next pass is a compute pass instead, it declares it wants to use the texture for UAV, and since it is already in that state, resolve() does nothing. And if execute() sees there's nothing to do, vkCmdPipelineBarrier isn't called either.

OgreNext's BarrierSolver header and source files.

Text is drawn w/ each letter as one drawcall of 6 vertices (Done: godotengine/godot#92797)

In TPS demo, this happens for the FPS text.

I know it's debug text but really? 1 drawcall per letter?

The editor also draws its entirety of text in the same way. This is bad.

SPIR-V Optimizer is turned off

This is a controversial topic. The GPU driver performs its own optimizations, so some say the SPIR-V optimizer is a pointless optimization pass (that may even hurt performance).

Others say, the opposite is true.

There are a few facts we can analyze:

  1. On mobile, the optimizer is needed. It's not just because mobile drivers are pretty bad at optimization, it's also because there are several driver bugs which the optimizer fixes.
  2. Juan didn't have to think much: he turned it off because the optimizer strips away relevant information he needed during reflection pass.
    • I met the same problem. The solution I used was to compile the shader twice: Once without the optimizer, for reflection; another with optimizations for rendering. Godot would need to rewrite a few functions to keep two sets of SPIRV blobs per shader. Non-optimized should always be available, optimized one only if cached.
    • However OgreNext most of the time doesn't have to reflect the shader (Godot always has to), so the price I pay in OgreNext is much cheaper than Godot would.

D32S8 is used, but it's unclear if Stencil is actually needed

It didn't seem like Godot is actually using the Stencil shader, but it does request for one.

If it's not actually used, Godot should use a D32 depth buffer. This saves both memory, performance and bandwidth.

And have an API where it asks every plugin (including user code) that asks whether they'll need a Stencil buffer with the rendering Window (and have another for every RTT that needs creating).

Bistro: Horrible optimization

I don't know if this issue comes from the original Bistro scene from Amazon or it was a human mistake when porting the scene to Godot, or if the problem lies in whatever tool(s) was used to import it into Godot.

But basically I can see the exact same furniture with the exact same geometry duplicated over and over again and being placed with the transform already embedded into the vertices; instead of creating a mesh template and placed via Godot nodes each with its own transform.

This prevents any sort of batching (and wastes VRAM). Each lampost has its own duplicate in memory:

2023-07-24.12-59-40.mkv.mp4

It can be seen in Blender by loading the glb models:

BistroDupeMesh

If it's human error, then the only thing I can say is that Godot needs better tools to detect the situation:

  • Better documentation explaining this is bad practice and how to fix it
  • An Instance / Mesh ratio alongside FPS, vertex count and draw call count. It gets red when the value is < 3 indicating the performance could be bad and the scene may have room to be optimized. When users google what that means, they end up in the documentation page. The higher the value, the better. A value >= 50 should be green. A value in range [3; 5] should be yellow. In between probably different shades between yellow and green.
  • Instancing average value. Godot has support for instancing (p_instances in RenderingDeviceVulkan::draw_list_draw) but this value is almost always 1 in TPS and Bistro demos. We should report the average value per frame as a performance metric. The higher it is, the better the performance could be. Low values will be correlated with the Instance / Mesh ratio above.
  • Unless it's an import tool error, de-duplicating is hard because it requires a lot of manual work. It may be possible to add automation tools on the Godot side though: If the user selects which objects are supposed to be the same; the tool can look at the vertices' difference and deduce both translation and orientation (if scale is also embedded into the vertices then AFAIK it becomes really hard).

In Bistro, Setup Sky clears the RTT and then does nothing

There is a Setup Sky step that performs a vkCmdCopyBuffer, then opens the pass with Clear and immediately closes.

Later on that render target is used in the depth pre-pass (with Load action Load instead of load action clear):

UnusedSky

Suggestion:

  • If sky is not used, skip all of its work.
  • Depth Prepass should make the clear

Some textures get the same name (e.g. VoxelGI Cascade LightProbe Average Texture)

Godot assigns the exact same name to multiple textures, which makes it harder to debug.

Simply numbering them e.g. "VoxelGI Cascade LightProbe Average Texture #2" would fix the problem.

Use of RD::get_singleton()

Godot is plagued with the following pattern:

RD::get_singleton()->compute_list_bind_uniform_set(compute_list, cascades[i].integrate_uniform_set, 0);
RD::get_singleton()->compute_list_bind_uniform_set(compute_list, gi->sdfgi_shader.integrate_default_sky_uniform_set, 1);
RD::get_singleton()->compute_list_set_push_constant(compute_list, &push_constant, sizeof(SDFGIShader::IntegratePushConstant));
RD::get_singleton()->compute_list_dispatch_threads(compute_list, probe_axis_count * probe_axis_count * SDFGI::LIGHTPROBE_OCT_SIZE, probe_axis_count * SDFGI::LIGHTPROBE_OCT_SIZE, 1);

Each RD::get_singleton() call cannot be optimized by the C++ compiler because of side effects, often even with LTO.

This is because:

RenderingDevice *RenderingDevice::singleton = nullptr;
RenderingDevice *RenderingDevice::get_singleton() {
	return singleton;
}
typedef RenderingDevice RD;

RD::get_singleton()->callA();
RD::get_singleton()->callB();

The compiler would need to guarantee that whatever happens in callA() never alters the value in RD::singleton.
If it doesn't know, then it must refetch from RD::get_singleton (it doesn't matter if the call gets inlined or not, because the problem is with the contents in the global variable RD::singleton). This would happen with every call.

Sometimes you get lucky and with LTO the compiler noticed callA() never modifies RD::singleton and thus can reuse the returned value for callB(). But the more complex a call is, the less likely this will happen.

The following pattern should be preferred:

RD *rd = RD::get_singleton();

rd->compute_list_bind_uniform_set(compute_list, cascades[i].integrate_uniform_set, 0);
rd->compute_list_bind_uniform_set(compute_list, gi->sdfgi_shader.integrate_default_sky_uniform_set, 1);
rd->compute_list_set_push_constant(compute_list, &push_constant, sizeof(SDFGIShader::IntegratePushConstant));
rd->compute_list_dispatch_threads(compute_list, probe_axis_count * probe_axis_count * SDFGI::LIGHTPROBE_OCT_SIZE, probe_axis_count * SDFGI::LIGHTPROBE_OCT_SIZE, 1);

I see that Godot uses this pattern not just for RenderingDevice, but for a lot of other things. The same analysis applies.

Potentially-inefficient loops

A similar thing to RD::get_singleton(); happens with loop conditions. This is a snippet from LightStorage::update_light_buffers:

for (uint32_t i = 0; i < (omni_light_count + spot_light_count); i++) {
	uint32_t index = (i < omni_light_count) ? i : i - (omni_light_count);
	LightData &light_data = (i < omni_light_count) ? omni_lights[index] : spot_lights[index];
	RS::LightType type = (i < omni_light_count) ? RS::LIGHT_OMNI : RS::LIGHT_SPOT;
	LightInstance *light_instance = (i < omni_light_count) ? omni_light_sort[index].light_instance : spot_light_sort[index].light_instance;
	Light *light = (i < omni_light_count) ? omni_light_sort[index].light : spot_light_sort[index].light;
	real_t distance = (i < omni_light_count) ? omni_light_sort[index].depth : spot_light_sort[index].depth;
	
	// ...
}

This is not the same as:

const uint32_t local_omni_light_count = omni_light_count;
const uint32_t local_spot_light_count = spot_light_count;

for (uint32_t i = 0; i < (local_omni_light_count + local_spot_light_count); i++) {
	uint32_t index = (i < local_omni_light_count) ? i : i - (local_omni_light_count);
	LightData &light_data = (i < local_omni_light_count) ? omni_lights[index] : spot_lights[index];
	RS::LightType type = (i < local_omni_light_count) ? RS::LIGHT_OMNI : RS::LIGHT_SPOT;
	LightInstance *light_instance = (i < local_omni_light_count) ? omni_light_sort[index].light_instance : spot_light_sort[index].light_instance;
	Light *light = (i < local_omni_light_count) ? omni_light_sort[index].light : spot_light_sort[index].light;
	real_t distance = (i < local_omni_light_count) ? omni_light_sort[index].depth : spot_light_sort[index].depth;
		
	// ...
}

The reason is that in the latter we can guarantee the compiler local_omni_light_count & local_spot_light_count will not change at all after every iteration.

While on the former, we have to cross a lot of fingers the optimizing compiler and LTO can prove omni_light_count & spot_light_count do not change on every iteration.

(I can't even fix this immediately and submit a PR because I'd have to analyze if the algorithm actually expects to handle omni_light_count & spot_light_count changing).

For cold paths, this optimization hampers code readability more than it can optimize performance, but for hot paths this is necessary.

VkPipelineCache may inhibit multithreading compilation performance

From what I can see, right now Godot is not calling vkCreateGraphicsPipelines from multiple threads but the intention do this in the future exists. So this will become relevant eventually.

The explanation is here. Pasting the relevant part:

This is why vkMergePipelineCaches exists: If we have 8 threads all of them calling vkCreateGraphicsPipelines in parallel at the same time; if all 8 threads use the same cache the following scenarios can happen:

  1. You don't set external sync flag. The driver performs a lock that lasts for the whole duration of vkCreateGraphicsPipelines. Essentially this is the same as running 1 thread because only 1 vkCreateGraphicsPipelines can execute at a time
  2. You don't set external sync flag. The driver performs a lock that only lasts as little as possible when the cache is used. This is better than the previous option, but still far from optimal (Amdahl's law).
  3. You set the external sync flag, you don't use a mutex: Extreme corruption or crashes may happen
  4. You set the external sync flag, you use a mutex to protect every vkCreateGraphicsPipelines call. Why are you using threads? This is the same situation as 1.
  5. You set the external sync flag, you use 8 VkPipelineCaches (i.e. 1 per thread). When you're done calling vkCreateGraphicsPipelines and all threads are done, you merge the caches.
    • One disadvantage is that if Thread A and Thread B meet the exact same parameters (and it was not yet in the cache), they will both independently compile a PSO and this will only get caught at the end when vkMergePipelineCaches gets called.
    • However the main purpose of Vulkan PSO caches exist mostly to reuse compilations from a previous run saved on disk.

Vector & LocalVector

Vector::push_back & LocalVector::push_back have controversial behaviors:

Vector::push_back performs resize(1+n). Apparently Godot calls this "tight" which is fine if you don't expect the Vector to grow much more than once or twice; but it can too easily accidentally degrade into O(N²) behavior which is why no standard vector container implements it.

A vector that pushes 1000 elements without reserve(1000) will regrow 1000 times, which means copies will be 1 + 2 + 3 + ... + 999. This is described by the N * (N + 1) / 2 formula which in terms of Big O notation it is another way of saying N² with extra steps.

LocalVector::push_back performs resize(MAX(1,n*2)) unless the "tight" parameter is explicitly set; citing in the comments "Otherwise, it grows exponentially (the default and what you want in most cases)."

Unfortunately, this is also problematic. The N parameter to resize should approach the golden ratio.

In practice everyone uses 1.5x because it's easy to calculate and still has a nice reuse/reallocation ratio. This means resize(MAX(1,n+n/2)). That is, we grow N by 1.5x, instead of 2.0x.

This is not arbitrary, an explanation can be found in Stack Overflow and it has to do with memory fragmentation and reclaiming memory.

The following table is what happens when you grow at 2x rate:

Allocation Freed Freed so far Next allocation will need
1 0 2
2 1 1 4
4 2 3 8
8 4 7 16
16 8 15 32
32 16 31 64
64 32 63 128
128 64 127 256
256 128 255 512

The problem here is that when we have 32 elements, next time we will grow to 64. So far we've freed 15 bytes. We can never reuse those free blocks. We always have to ask the OS for more memory pages.

But with a 1.5x rate things change:

Allocation Freed Freed so far Next allocation will need
1 0 2
2 1 1 3
3 2 3 5
5 3 6 8
8 5 11 12
12 8 19 18
18 12 31 27
27 18 49 41
41 27 76 62

By the time we have 12 elements, next time we will grow to 18. So far we've freed 19 bytes. Assuming they were contiguous, we can reuse that free area and place the 18 elements there; instead of asking the OS to give us more memory.

Every 6 allocations or so (approx), a vector growing 1.5x will allow allocators to reuse old memory blocks.

Needless to say, this math works for everything; so if you can think of a GPU algorithm growing VRAM at 2x rate when it could do at 1.5x; it is time to rethink if 1.5x is better.

The only advantage of 2x is that it requires fewer reallocations overall. But that is often negated by the need to ask for more pages to the OS and the inability to reuse freed blocks properly (which leads to fragmentation).

UPDATE: I was pointed out Vector does resize( n + 1 ) but it delegates into CowData::resize which performs exponential resize.

Suggestion

I don't know how much Godot code actually expects the Vector/LocalVector to be tight. Thus by blindly changing the behavior we could do more harm than good if the code is actually aware of this.

The least intrusive option would be to:

  1. The non-tight code should use 1.5x growth, not 2.0x
  2. Evaluate Vector using exponential growth instead of linear
  3. Add debug checks. If a tight vector grows more than 5 times, it should assert. This will catch accidentally-quadratic behaviors.

Cheers

@Calinou
Copy link
Member

Calinou commented Jul 24, 2023

Thanks a lot for putting this together! It sounds like we can have pretty noticeable gains by combining these approaches.

Every pixel shader is using Discard

Regarding this issue, I tried removing all instances of discard; as a proof of concept then running https://github.com/Calinou/godot-reflection with the following CLI arguments (after opening it once in the editor):

timeout 10 bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/godot-reflection/ --disable-vsync --print-fps --fullscreen

In both instances (with and without discard; removed), I get the exact same average FPS in 4K on a RTX 4090 (266 FPS in Forward+, 338 FPS in Mobile). I was expecting a performance difference to be there, considering I'm not CPU-bottlenecked. (The scene doesn't use alpha-tested objects, so rendering remains correct even with all discard; statements removed.)

There are surely other scenes and GPUs that could benefit from this change. I'm just puzzled why I'm not benefiting from this change in the slightest. While I have a high-end GPU, I definitely expect a significant performance improvement to be feasible (given the performance I can get in other engines).

Text is drawn w/ each letter as one drawcall of 6 vertices

We should implement 2D batching in Vulkan renderers, as is already done in OpenGL. It wasn't a priority due to Vulkan's lower CPU overhead, but there are a lot of cases where it's still beneficial, especially on integrated graphics.

RD *rd = rd;

Did you mean RD *rd = RenderingDevice::get_singleton();?

@mrjustaguy
Copy link

Parts of this would likely resolve godotengine/godot#68959

@darksylinc
Copy link
Author

darksylinc commented Jul 24, 2023

Every pixel shader is using Discard

Regarding this issue, I tried removing all instances of discard; as a proof of concept then running https://github.com/Calinou/godot-reflection with the following CLI arguments (after opening it once in the editor):

timeout 10 bin/godot.linuxbsd.editor.x86_64 --path ~/Documents/Godot/godot-reflection/ --disable-vsync --print-fps --fullscreen

In both instances (with and without discard; removed), I get the exact same average FPS in 4K on a RTX 4090 (266 FPS in Forward+, 338 FPS in Mobile). I was expecting a performance difference to be there, considering I'm not CPU-bottlenecked. (The scene doesn't use alpha-tested objects, so rendering remains correct even with all discard; statements removed.)

There are surely other scenes and GPUs that could benefit from this change. I'm just puzzled why I'm not benefiting from this change in the slightest. While I have a high-end GPU, I definitely expect a significant performance improvement to be feasible (given the performance I can get in other engines).

Oh thanks for the scene test!

I was curious. Good idea about trying it out!

Note that you need to do the following if you want to measure the changes:

  1. Modify the GLSL shaders (d'oh)
  2. Recompile Godot
  3. Delete everything under ~/.local/share/godot/app_userdata/Tesseract Reflection Map
  4. Delete godot-reflection/.godot/shader_cache

Once I did all that I got:

Discard enabled:

Project FPS: 409 (2.44 mspf)
Project FPS: 410 (2.43 mspf)
Project FPS: 409 (2.44 mspf)
Project FPS: 382 (2.61 mspf)
Project FPS: 410 (2.43 mspf)
Project FPS: 409 (2.44 mspf)
Project FPS: 410 (2.43 mspf)
Project FPS: 410 (2.43 mspf)
Project FPS: 382 (2.61 mspf)
Project FPS: 409 (2.44 mspf)
Project FPS: 410 (2.43 mspf)
Project FPS: 409 (2.44 mspf)
Project FPS: 408 (2.45 mspf)
Project FPS: 397 (2.51 mspf)
Project FPS: 392 (2.55 mspf)
Project FPS: 409 (2.44 mspf)

Discard disabled:

Project FPS: 412 (2.42 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 411 (2.43 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 383 (2.61 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 411 (2.43 mspf)
Project FPS: 383 (2.61 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 412 (2.42 mspf)
Project FPS: 411 (2.43 mspf)

This is on an AMD 6800 XT. So it DOES make a difference, just a tiny one (0.01ms - 0.02ms). This is repeatable.

The problem with these monster GPUs is that:

  1. The bottleneck is likely on anything (with measurements so low, anything moves the needle).
  2. These GPUs can do both early & late depth rejection without flinching. It's the lower end that gets affected much more (either because they just don't do early rejection when discard is present, or because they do but they don't have enough ROP or BW power to keep up)
  3. The perf is dominated by Vol. Fog and SDFGI. When I disable both I get 1.40ms(714 fps) vs 1.38ms(722 fps). It again does make a difference, but clearly rendering geometry is not a problem for these monster GPUs. It's the other stuff (SSAO, SDFGI + Vol. Fog + Glow + Tonemap)

RD *rd = rd;

Did you mean RD *rd = RenderingDevice::get_singleton();?

Yes. Fixed.

@clayjohn
Copy link
Member

I took care of the discard issue so we can take it off the table godotengine/godot#79865

@BastiaanOlij and I noticed the issue with the discard statements awhile ago while doing some performance optimization on the TPS demo, but we weren't able to measure any meaningful performance gain from commenting out the discard statements. Unfortunately, that means we put this bug on the backburner and forgot about it.

At any rate, it'll be merged upstream soon

@BastiaanOlij
Copy link

Hey @darksylinc,

Great stuff man!

Just some feedback from my end (and sorry if I'm doubling up on anyone, I haven't read all the reactions yet :) ).

Every pixel shader is using Discard

Me and Clay ran into this a while back but never got around to fixing it because it was unclear from the code in what situation the opaque prepass threshold actually made sense. Clay just submitted a PR:)

Depth Only passes are using (e.g. Depth prepass + Shadow Maps) are using pixel shaders

This was pointed out to me a little while ago too, we're relying on the GPU driver to recognise the fragment shader isn't doing any actual work and taking it out. Obviously with the discard issue up above that never happens.

Implementing this could be tricky because there are scenarios where the fragment shader is doing work during prepass, when the above discard check is needed for instance, and when certain effects are enabled (like GI I believe) where it write more than just depth.

I do wonder if in these scenarios, the depth prepass is still effective.

Shadow maps appear to have no frustum culling (or it's broken) or wrong use of depth clamp

This one is weird, I've been working a lot on shadows lately and I can confirm the culling logic is run but indeed it comes back with far more objects than you'd expect. That said, especially for directional shadows where the shadowmap is far larger than the frustrum we're clipping against you can see it working clearly when displaying the directional shadows.

I haven't dived deep enough into the culling logic but I did notice that the code that check whether an AABB is inside of the frustum is fairly optimistic, if I recall correctly it ends up performing this check: https://github.com/godotengine/godot/blob/master/servers/rendering/renderer_scene_cull.h#L201

_ALWAYS_INLINE_ bool in_frustum(const Frustum &p_frustum) const {
	// This is not a full SAT check and the possibility of false positives exist,
	// but the tradeoff vs performance is still very good.

	for (uint32_t i = 0; i < p_frustum.plane_count; i++) {
		Vector3 min(
				bounds[p_frustum.plane_signs_ptr[i].signs[0]],
				bounds[p_frustum.plane_signs_ptr[i].signs[1]],
				bounds[p_frustum.plane_signs_ptr[i].signs[2]]);

		if (p_frustum.planes_ptr[i].distance_to(min) >= 0.0) {
			return false;
		}
	}

	return true;
}

Where bounds are calculated from an AABB in the method above.

Small objects in Shadow Maps and LOD

I'm not sure how we manage LOD at the moment for shadows, haven't fully looked into this, but indeed there isn't any code that would discard small objects based on distance.

I also wonder how much impact having large collections of small objects have that could be filtered out by an encapsulating AABB. This has been often discussed but never really got anywhere.

My goto example is creating a small village the player can walk around in with each building having a fully worked out interior. Often the entire interior of a building can be skipped just by checking its encompassing AABB, but right now the culling goes through each and every tiny little object in each and every building.

Binds one VB/IB per draw call

That is cool information, I'm not sure how easy this will be to implement however. I'm not sure how we'd go about merging meshes for this purpose especially when we're allowing materials to be overriden on instance level and while the meshes may share vertex and index buffers, they may need completely different shader pipelines used.

Sky & Tonemap are doing fullscreen quad instead of fullscreen tri

Dang, never thought about this but it makes complete sense. There is no cost to rendering a tri much larger than the screen, but one tri is better than two. I am surprised about the performance benefit you're expecting, is it really that high?

That said, for sky I have a PR outstanding that actually renders a sphere because it allows moving a lot of the UV calculations into the vertex shader, especially because when stereo rendering we need to do a full unproject. This as I understand has less importance on desktop but on mobile rendering can make a big difference (both not doing a full matrix calculation in the fragment shader and allowing for pre-fetch of the textures).

Still need to do more measurements around it.

Passes are not amalgamated

This is something I changed on the mobile renderer and we may be able to pull some of that back into the forward+ renderer. On the mobile renderer we're using subpasses as much was we can.

I think this may also make a huge difference on rendering shadow maps as we're starting new passes for each light.

On that subject, and this was feedback I got from one of the GPU vendors, the way we start passes for rendering shadows is also bad. We're not using renderArea on the render pass but instead start the render pass for the whole atlas and then use viewport to restrict rendering. Supposedly this is more performant on desktop, but its a killer on mobile?

Gaussian Glow performs inefficient pipeline barriers & lack of barrier solver + Lack of barrier solver

Not much to add here other than, cool to get your insights, I'm only just starting to wrap my head around how barriers are applied in Godot especially after discussing the impact of not checking vertex and fragment shader barriers appropriately on mobile, one of the things one of the mobile GPU vendors pointed out. This is why we recently split the "raster barrier" into two, again if I've been told correctly this has no effect on desktop, but on mobile our current approach was preventing vertex shaders (from the next draw command) and fragment shaders to run in parallel.

Text is drawn w/ each letter as one drawcall of 6 vertices

Nothing to add here other than, sounds logical

SPIR-V Optimizer is turned off

Clay and I looked at this earlier in the year as well. Juan was adamant that the GPU drivers did a better job. I personally am in favour of adding the SPIV-V optimizer if only because it provides much more insight into what's left over after optimisation. Hearing that mobile drivers do a poor job makes me even more convinced this is something we should do.

I believe the same suite of tools also allows us to add debug info into the shaders which could be interesting as well.

D32S8 is used, but it's unclear if Stencil is actually needed

It should default to D24S8 if supported and only use D32S8 when not, I've got some further issues around this because raster access is denied to some of these formats and required on the mobile renderer (where we currently don't use compute).

Currently stencils aren't used but there has been a lot of demand for this, so this was added in preparation. I don't have any insight to provide whether this was smart or whether it makes more sense to have a separate stencil buffer when effects require a stencil buffer.

My gut feeling says that it is better to use D32 whenever possible.

Bistro: Horrible optimization

Pretty sure this is the demo, not Godot (unless the importer really messes up). Godot is perfectly capable of loading a mesh once and drawing it multiple times, it doesn't require the mesh to be re-imported for each instance.

I get the feeling it's the conversion to get to the end result. I ran into this as well when importing a scene I got from Clay that was sourced from a unity project. I'm convinced the original scene did proper instancing, I can see this from the naming of entities, but somewhere this was lost. Then again the file went from FBX to GLSL to Godot.

Personally I think this requires us to create more example projects that are properly crafted for Godot, but who has the time....

Obviously finding a way to "deduplicate" meshes during import would be a great improvement.

In Bistro, Setup Sky clears the RTT and then does nothing

It's actually not setup sky, it looks weird because we don't always do a good job adding these groupings.

This empty pass is clearing the directional shadow map. Juan added this because it allows parallel rendering of shadows and GI on AMD hardware, some weird trick that triggers this. I don't have the details.

I ran into it when working on updating the directional shadow maps so some of the cascades are only updated every other frame and off course this clear broke the system. So if you turn on stepped shadows it actually doesn't do this anymore and instead clears only the quadrants that are updated.
Still an outstanding PR though godotengine/godot#76291

Some textures get the same name (e.g. VoxelGI Cascade LightProbe Average Texture)

This can happen when textures are created for each viewport, we don't (currently) have a way to add a viewport ID into this. We should improve that.

Use of RD::get_singleton()

Weird, get_singleton() should be defined as inline so it should have direct access to that pointer and optimise it away that way.
It's a much used pattern in Godot.

For readability I do load the pointer into a variable in many places (especially accessing the storage classes) but I was under the impression it didn't matter either way.

Personally I don't mind the approach of obtaining the pointer and using it, makes the code much easier to read.

@darksylinc
Copy link
Author

darksylinc commented Jul 25, 2023

Hi!

Depth Only passes are using (e.g. Depth prepass + Shadow Maps) are using pixel shaders

This was pointed out to me a little while ago too, we're relying on the GPU driver to recognise the fragment shader isn't doing any actual work and taking it out. Obviously with the discard issue up above that never happens.

Implementing this could be tricky because there are scenarios where the fragment shader is doing work during prepass, when the above discard check is needed for instance, and when certain effects are enabled (like GI I believe) where it write more than just depth.

I do wonder if in these scenarios, the depth prepass is still effective.

It's understandable but please note that at least for shadow maps, a pixel shader is rarely used. This sounds like it's trivial to check (if no alpha_test_needed and shadow_casting_pass then disable shadows)

My goto example is creating a small village the player can walk around in with each building having a fully worked out interior. Often the entire interior of a building can be skipped just by checking its encompassing AABB, but right now the culling goes through each and every tiny little object in each and every building.

What you are describing is hierarchical culling and (unless it's very specific for a game) it has been mostly abandoned (except for raytracing via BVH) because often you end up losing more from traversing the structures in a cache-unfriendly manner (indirections... this AABB contains children AABB) than what you gain from the culling.

However the distance approach can be done in bulk (all objects laid out mostly contiguous in memory, updated in parallel)

Binds one VB/IB per draw call

That is cool information, I'm not sure how easy this will be to implement however. I'm not sure how we'd go about merging meshes for this purpose especially when we're allowing materials to be overriden on instance level and while the meshes may share vertex and index buffers, they may need completely different shader pipelines used.

With the right shader code approaches, per instance materials should be supported. From a quick glance I had, Godot should already support this.

The main issue with this way of laying out memory is that it is mostly compatible with interleaved vertex formats (e.g. pos/normal/uv), and quite hostile to non-interleaved formats (e.g. pos/pos/pos then another buffer with normal/normal/normal and another with uv/uv/uv)

Sky & Tonemap are doing fullscreen quad instead of fullscreen tri

Dang, never thought about this but it makes complete sense. There is no cost to rendering a tri much larger than the screen, but one tri is better than two. I am surprised about the performance benefit you're expecting, is it really that high?

Yes, but note the ~10% is for that effect alone. Don't expect a 10% overall FPS increase because a few passes have been optimized 🤣

That said, for sky I have a PR outstanding that actually renders a sphere because it allows moving a lot of the UV calculations into the vertex shader, especially because when stereo rendering we need to do a full unproject. This as I understand has less importance on desktop but on mobile rendering can make a big difference (both not doing a full matrix calculation in the fragment shader and allowing for pre-fetch of the textures).

YMMV on mobile.

On that subject, and this was feedback I got from one of the GPU vendors, the way we start passes for rendering shadows is also bad. We're not using renderArea on the render pass but instead start the render pass for the whole atlas and then use viewport to restrict rendering. Supposedly this is more performant on desktop, but its a killer on mobile?

It is true it's a killer on mobile but TBH there are not many ways to render a shadow map ATLAS and be performant on mobile. Unless you have one RTT per shadow map it doesn't perform well.

TBH I just disable shadow maps for low and mid end devices.

SPIR-V Optimizer is turned off

Clay and I looked at this earlier in the year as well. Juan was adamant that the GPU drivers did a better job. I personally am in favour of adding the SPIV-V optimizer if only because it provides much more insight into what's left over after optimisation. Hearing that mobile drivers do a poor job makes me even more convinced this is something we should do.

I believe the same suite of tools also allows us to add debug info into the shaders which could be interesting as well.

Of note: On Mobile the SPIRV optimizer is important due to driver crashes. Unoptimized code often ends up crashing drivers because it contains code the driver didn't like.

D32S8 is used, but it's unclear if Stencil is actually needed

It should default to D24S8 if supported and only use D32S8 when not, I've got some further issues around this because raster access is denied to some of these formats and required on the mobile renderer (where we currently don't use compute).

D32 should be the default. On AMD, D24 is emulated and is internally a D32 buffer (in fact AMD engineers get moody when you mention D24, they only support it because of emulators and other legacy uses).

D24 was only faster on old HW (DX10 class HW). Nowadays D32 should be preferred (are you guys doing reverse Z? It increases depth precision A LOT)

Bistro: Horrible optimization

Pretty sure this is the demo, not Godot (unless the importer really messes up). Godot is perfectly capable of loading a mesh once and drawing it multiple times, it doesn't require the mesh to be re-imported for each instance.

I get the feeling it's the conversion to get to the end result. I ran into this as well when importing a scene I got from Clay that was sourced from a unity project. I'm convinced the original scene did proper instancing, I can see this from the naming of entities, but somewhere this was lost. Then again the file went from FBX to GLSL to Godot.

Personally I think this requires us to create more example projects that are properly crafted for Godot, but who has the time....

What I meant, and I cannot stress enough on this: I am experienced, so I noticed the problem in Bistro after a few hours working on it.

But a newbie won't even understand what is wrong. De-duplicating meshes is often a very manual work. But they don't even know duplicated meshes are a bad thing.

And since most Godot users are not experts, Godot should at least inform them when things look suspicious (hence why I recommended displaying an instance / mesh ratio and documenting what it is about). Half the battle is educating the user.

This empty pass is clearing the directional shadow map. Juan added this because it allows parallel rendering of shadows and GI on AMD hardware, some weird trick that triggers this. I don't have the details.

Ahh, yeah that makes sense.

Some textures get the same name (e.g. VoxelGI Cascade LightProbe Average Texture)

This can happen when textures are created for each viewport, we don't (currently) have a way to add a viewport ID into this. We should improve that.

A global monotonic counter would be enough.

Use of RD::get_singleton()

Weird, get_singleton() should be defined as inline so it should have direct access to that pointer and optimise it away that way. It's a much used pattern in Godot.

It doesn't matter if it's inlined. Consider the following code:

RD *g_singleton = new RD(); // It's a global variable

void RD::specialFunction()
{
    g_singleton = nullptr;
}

int myFunction()
{
    g_singleton->funcA();
    g_singleton->specialFunction();
    g_singleton->funcB();
}

Between funcA and funcB, specialFunction() was called. The compiler must load the data from singleton's address again, because it changed (g_singleton was set to null).

Now consider that specialFunction() does NOT set g_singleton, however the function was defined somewhere else (i.e. in a different cpp file) it cannot see.

The compiler doesn't know if specialFunction() modifies g_singleton or not (in C++ lingo this is called "unknown side-effects"). Therefore it must assume specialFunction() does modify it, even if it really doesn't.

With LTO (Link Time Optimization), the linker gets to see what specialFunction() does. But there are limit of how thorough LTO will search.

but I was under the impression it didn't matter either way.

It depends on so many things it's not a guarantee that it will make a difference, however the practice should be to prefer the local variable. At the very least on the 'hot' paths (i.e. those that get called every frame for every object).

@ParsleighScumble
Copy link

Interesting stuff.

@BastiaanOlij
Copy link

Heya,

It's understandable but please note that at least for shadow maps, a pixel shader is rarely used. This sounds like it's trivial to check (if no alpha_test_needed and shadow_casting_pass then disable shadows)

It should be smart enough to tell those situations apart so indeed, what I said only applies to depth pre-pass, for shadow maps it should only need to include the fragment shader if alpha scissor is enabled imho.

What you are describing is hierarchical culling and (unless it's very specific for a game) it has been mostly abandoned (except for raytracing via BVH) because often you end up losing more from traversing the structures in a cache-unfriendly manner (indirections... this AABB contains children AABB) than what you gain from the culling.

Hmm interesting, I'd imagine excluding large amounts of objects from checks would win but if this has already been disproven... :)

However the distance approach can be done in bulk (all objects laid out mostly contiguous in memory, updated in parallel)

Yes this seems like a smart improvement.

With the right shader code approaches, per instance materials should be supported. From a quick glance I had, Godot should already support this.

I think the structure is certainly there, it's more that right now we're sorting objects by shader and material, so objects that share vertex and index buffers wouldn't be in order.

The main issue with this way of laying out memory is that it is mostly compatible with interleaved vertex formats (e.g. pos/normal/uv), and quite hostile to non-interleaved formats (e.g. pos/pos/pos then another buffer with normal/normal/normal and another with uv/uv/uv)

If I'm not mistaken all data is interleaved so that should be ok.

It is true it's a killer on mobile but TBH there are not many ways to render a shadow map ATLAS and be performant on mobile. Unless you have one RTT per shadow map it doesn't perform well.

TBH I just disable shadow maps for low and mid end devices.

Yup, sometimes I think our users are expecting miracles and not realising that most good looking games on mobile only look that good on flagship phones and even then have highly customized shaders.

Definately for my main use case, VR, you can see that nearly everything that comes out on Quest is all baked lighting and uses none of this.
So whether it makes sense for us to go chasing impossible goals, idk...

Of note: On Mobile the SPIRV optimizer is important due to driver crashes. Unoptimized code often ends up crashing drivers because it contains code the driver didn't like.

Well that might explain a few issues we've been having then :)

Still I think it's worth including even on desktop.

D32 should be the default. On AMD, D24 is emulated and is internally a D32 buffer (in fact AMD engineers get moody when you mention D24, they only support it because of emulators and other legacy uses).

D24 was only faster on old HW (DX10 class HW). Nowadays D32 should be preferred (are you guys doing reverse Z? It increases depth precision A LOT)

I'm not sure, there was a discussion about reverse Z some time ago, clay will know that one better then me.

So how does D32S8 compare? I'm guessing D24S8 was chosen because its a 32bit format.

I don't know what the best way forward here is, but if we can flag whether a stencil buffer is even needed I'm for using D32 whenever possible and D32S8 if we do need a stencil.

It's also important to note that most XR platforms don't allow D24S8 or D32S8 to be used anyway (there is currently an overrule for this in OpenXR).

What I meant, and I cannot stress enough on this: I am experienced, so I noticed the problem in Bistro after a few hours working on it.

But a newbie won't even understand what is wrong. De-duplicating meshes is often a very manual work. But they don't even know duplicated meshes are a bad thing.

And since most Godot users are not experts, Godot should at least inform them when things look suspicious (hence why I recommended displaying an instance / mesh ratio and documenting what it is about). Half the battle is educating the user.

Owh 100% agree with you here, I'm just saying that it most likely wasn't a conscious thing but just how we got the assets.

If we have a means by which to deduplicate meshes automatically I think this is a very worthwhile improvement on the importer.

A global monotonic counter would be enough.

Yes that would make the names unique, but I think its worth going the extra mile and numbering viewports and including that in here. It would be useful for seeing if the right buffers are used.

It doesn't matter if it's inlined. Consider the following code:
...
With LTO (Link Time Optimization), the linker gets to see what specialFunction() does. But there are limit of how thorough LTO will search.

OK so this is a case of me being told that the optimizer is good at doing something it isn't, I'm mostly just copying the existing style here :P

I think this is something we should have a quick discussion about on the chat channels, are you on the Godot rocket chat server yet?
I'm all for grabbing the pointer at the start and then just doing rd->..., it's much more readable too. If we all agree we can slowly move to this approach.

@darksylinc
Copy link
Author

It is true it's a killer on mobile but TBH there are not many ways to render a shadow map ATLAS and be performant on mobile. Unless you have one RTT per shadow map it doesn't perform well.
TBH I just disable shadow maps for low and mid end devices.

Yup, sometimes I think our users are expecting miracles and not realising that most good looking games on mobile only look that good on flagship phones and even then have highly customized shaders.

You can do miracles, but miracles require a lot of work and manual hand tunning :)

I'm not sure, there was a discussion about reverse Z some time ago, clay will know that one better then me.

So how does D32S8 compare? I'm guessing D24S8 was chosen because its a 32bit format.

Nowadays D24S8 is internally D32S8 and D24 is internally D32, therefore by choosing D32S8 you get better precision at the same cost.

Plus, D32 allows for reverse Z and the quality difference is GIGANTIC.

Reverse Z cannot be used on the following:

  • If D32 depth buffer is not supported (not an issue, even mobile ticks this checkmark)
  • On GLES2/3
  • On OpenGL versions that do not support GL_ARB_clip_control extension (in practice it's only macOS and mobile, and luckily there we just use Metal or MoltenVK).

If we have a means by which to deduplicate meshes automatically I think this is a very worthwhile improvement on the importer.

I cannot see how a tool would be able to fully automate this case (because the meshes do not look equal) but I can see a way to semi automate it:

  1. Godot makes a list of objects that look suspicious (because they have the exact same vertex & index count and the index buffer's content is exactly the same for them)
    • We need to show a render of each object so the user can see what the object looks like
  2. The user ticks which ones are supposed to be the same mesh
  3. Once we have confirmation from the user, we run an algorithm to try to de-clone them
  4. The workflow would be similar to czkawka which is a tool to deduplicate files.

@darksylinc
Copy link
Author

I think the structure is certainly there, it's more that right now we're sorting objects by shader and material, so objects that share vertex and index buffers wouldn't be in order.

I forgot to comment on this!

Can you point me out where in code this sort is made? I want to evaluate it (at a first glance, not sorting by vertex/index buffers sounds wrong).

@BastiaanOlij
Copy link

I'd be keen to learn more about the depth buffer things, I was working on code that resolves depth buffers for mobile and still need to add a fallback and this is where I ran into trouble with needing to do a copy with a raster shader:
godotengine/godot#78598

So we might be able to make some format optimisation here as well.

I'm also busy exposing the render buffers class to extensions and here I've merged the MSAA generation into this class instead of having it separate for mobile and vulkan and ran into issues with the buffer format as well. So simplifying this further would be welcome:
godotengine/godot#79142

I need to look into reverse Z, I've heard that mentioned a few times now and haven't wrapped my head around why it is an improvement, sound interesting.

I cannot see how a tool would be able to fully automate this case...

Me neither, I don't know the import system well enough but its worth exploring and you make some good suggestions here.

That said, I don't think we should wait on this and see if we can't manually improve some of the demos by deduplicating things. It's fairly "simple" to do, if you have two MeshInstance3D nodes that use the same mesh, but the mesh is a duplicate, you simply copy the mesh from on mesh instance onto the other.
If there is a difference in materials applied and it's applied on mesh level instead of mesh instance, you'll need to first copy those over to the material override entries on the MeshInstance3D

Can you point me out where in code this sort is made? I want to evaluate it (at a first glance, not sorting by vertex/index buffers sounds wrong).

It's done here: https://github.com/godotengine/godot/blob/master/servers/rendering/renderer_rd/forward_clustered/render_forward_clustered.cpp#L1690

render_list[RENDER_LIST_OPAQUE].sort_by_key();
render_list[RENDER_LIST_ALPHA].sort_by_reverse_depth_and_priority();

So all opaque objects are sorted by a key, and that key is made up of the material index, shader index, priority, and some other bits and bobs.
Transparent objects are sorted by priority and distance to camera.

@darksylinc
Copy link
Author

First, go to sleep. It's late for you already!

I'd be keen to learn more about the depth buffer things, I was working on code that resolves depth buffers for mobile and still need to add a fallback and this is where I ran into trouble with needing to do a copy with a raster shader:

When it comes to Mobile, the way Godot is doing things is a mix of deferred and forward, and that is admittedly not the best for Mobile. Bigger changes will be needed. It's a bit of a struggle and tradeoffs in many fronts, so this isn't easy.

I need to look into reverse Z, I've heard that mentioned a few times now and haven't wrapped my head around why it is an improvement, sound interesting.

I can give you the TL;DR:

  1. Regular depth maps depth to range [0; 1]
  2. Depth is calculated by doing z / w. This is standard projection math.
    • This curve is close to logarithmic in nature: When z is small, w is still big. But when z is big, z and w get very similar. Just think of x / (x + 1) kind of graph.
    • This means you get a lot of precision near the camera, and very poor precision far away. In theory this sounds reasonable.
    • The issue is so extreme that the range [near_plane; near_plane * 2) maps exactly to the interval [0; 0.5). This leaves [near_plane * 2; far_plane) to the interval [0.5; 1.0)
  3. IEEE Floating point has a similar logarithmic behavior: The closer you are from 0, the more precision you get. This is much easier to mentally visualize with 16-bit half floating point. The wikipedia entry has an excellent example.
    • So you get logarithmic behavior twice
    • In practice, if near plane is 0.1 (a very common value), you get like 3 billion different numbers for objects in the first 20 centimeters away from the camera. And the rest is for everything else. A massive waste of precision.
  4. Reverse-Z maps depth to range [1; 0] (i.e. it's the same range, but in reverse order, 1 = close to camera, 0 furthest away from camera)
    • The logarithmic nature of IEEE floating point set in reverse becomes exponential and thus cancels the near-logarithmic nature of z / w
    • As a result we end up with a much more even distribution of precision across the entire depth range.

@BastiaanOlij
Copy link

So for z simply becomes [far - near] - z and we reverse the z check? Seems easy enough to implement...

@darksylinc
Copy link
Author

darksylinc commented Jul 26, 2023

So for z simply becomes [far - near] - z and we reverse the z check? Seems easy enough to implement...

Yes but for best results it needs to be implemented when building the Projection matrix (so the shaders don't even need to change). The change itself is easy (minus a few tries until you get it right).

It is possible to create it from scratch (best precision) or to convert one.

The devil in the details:

  1. Code (CPU or shader) that assumed that 0 was closest to camera and 1 was furthest breaks (easy to fix once spotted)
  2. Code that reconstructs world- or view-space position from depth alone needs a few tweaks
  3. When you want to write something closest to camera (e.g. like drawing the sky after all opaque objects), setting z = 0 and w = 1 makes it work. However when you set it to z = 1 and w = 1; you get some dead pixels on certain GPUs. This is because due to some weird interactions in how interpolation works, 1 / 1 > 1 (i.e. 1 / 1 = 1.00001); which ends up failing GPU clipping. The solution is to simply set z = 1 - epsilon.
  4. If you apply reverse Z to Shadow Maps too (honestly I don't remember what Ogre does), obviously the shadow map sampling comparison test needs flipping. If you don't apply reverse Z to shadow maps, you still need a way to tag that you don't want reverse Z (and preserve this bit when passing the texViewProj matrix for regular rendering)

@BastiaanOlij
Copy link

This is a nice visual writeup btw: https://developer.nvidia.com/content/depth-precision-visualized

Yeah, it sounds pretty straight forward. I think we might get tripped up where OpenGL and Vulkan code share some of the logic, as it seems we can't apply this in GLES, and there might be an impact around XR where we can get a projection matrix from the XR runtime, but it wouldn't be hard to adjust those. I think OpenXR no longer does this and we have control over the projection matrix calculation as well.

I think the biggest issue we'll run into is peoples custom shaders breaking due to your point no 3. Though I don't think a lot of people use this (I do for a number of tricks in XR where I need to draw things on the near plane).

@fire
Copy link
Member

fire commented Aug 3, 2023

Obviously finding a way to "deduplicate" meshes during import would be a great improvement.

Some options:

  1. Write a gltf document extension that searches for exact arraymeshes and deduplicates.
  2. For merged meshes, cut mesh into grids

@ywmaa

This comment was marked as off-topic.

@darksylinc

This comment was marked as off-topic.

@ywmaa

This comment was marked as off-topic.

@ywmaa

This comment was marked as off-topic.

@Calinou

This comment was marked as off-topic.

@ywmaa

This comment was marked as off-topic.

@Saul2022

This comment was marked as off-topic.

@ywmaa

This comment was marked as off-topic.

@Saul2022

This comment was marked as off-topic.

@ywmaa

This comment was marked as off-topic.

@clayjohn
Copy link
Member

Please keep comments on topic. This is not the place to receive performance advice for your game. While contributors are willing to comment and help out, it really pollutes the issue and makes it harder for us to use it for its intended purpose.

@WickedInsignia
Copy link

As the original porter of the project I'll answer to the performance complaints regarding the Bistro:
It's intended as a lighting and graphics testbed, not a performance one. It should NOT be used as a performance benchmark nor used to judge Godot's performance.
There's no collision meshes nor baked occlusion culling so I thought people would assume it's not intended for performance assessment, but I'll amend the description so people definitely know.

Regardless the scene is so small that anything beyond the most cursory optimization wasn't necessary. I wouldn't consider not instancing a handful of objects to be horrible optimization considering the size and nature of the scene.
The priority was to match the exact placements of the original meshes and doing this with instances would've been time-consuming for little practical gain. Godot also just isn't very well-tooled for this process.

I appreciate the analysis given to Godot in this post, and although I'm no low-level rendering expert the rest seems pretty reasonable. Thanks for taking the time to put it together!

@Ansraer
Copy link

Ansraer commented Jan 12, 2024

How did I not know about this proposal until now?

Most of these points have already been addressed, but I still wanted to add my two cents for future reference.

Every pixel shader is using Discard

I agree with most of your points here, the one thing I want to comment on is the usage of discard in the early depth pass.
This is an intentional choice I made a year ago. The goal is to get a final depth output before we start the opaque pass.
This allows us to not only disable writing depth (as you already noticed) but to also always use COMPARE_OP_EQUAL. This means that we can use the early depth test for discard during the opaque pass and don't have to use discard in the far more heavy "full" fragment shader where it would hurt performance far more than in the tiny prepass shader.
I benchmarked this and especially in scenes with foliage this was a huge improvement compared to not having discard in the prepass.

Bistro: Horrible optimization

While the bistro scene itself was never meant to be an example of a highly optimized scene, it is a good showcase for some larger problems. The most pressing on is that we don't have any kind of deduplicator when meshes are imported. What makes this even worse is that Blender's gltf exporter turns all instances into copies, which is absolutely terrible from a performance point of view.
So a basic solution would be to compare hashes of meshes (and maybe some other information like vert count) when importing scenes to remove duplicates.

Depending on the cpu cost, an even better option might be to just store a meshes hash in godot's internal mesh format. This way we could check if an identical mesh has already been loaded during runtime, which could A) save us from having to read the mesh data from disk and B) work across multiple scenes, which could be very useful since the importer would only be able to remove duplicates within the current import. So if someone uses an assetlib inside Blender and places e.g. the same column in multiple different scenes, we could detect it with this system.

D24 was only faster on old HW (DX10 class HW). Nowadays D32 should be preferred (are you guys doing reverse Z? It increases depth precision A LOT)

Reverse depth has been something I have been advocating for a year now. The problem was that implementing it would suddenly switch all depth comparisons, which would affect user code and thus break compat, which is something we usually try to avoid.
We finally got the ok to break things from the production team yesterday, so rev z should hopefully soon be a thing (clay already changed the format today).

@Calinou
Copy link
Member

Calinou commented Jun 7, 2024

Note that reverse Z has been merged in 4.3 (see blog post).

Of course this algorithm CAN cause visual changes (it's LOD after all), but the beauty of it is that global bias can make the algorithm adapt to all sorts of hardware: Use a large bias value for high end HW, use low bias for low end HW (this mimics "draw distance" from 90's consoles).

I'm curious as to how to provide the best usability for culling small objects from shadow passes. There are a few things I have in mind:

  • "Too small for the shadow pass" depends on shadow map resolution and shadow max distance (for directional lights). It can also vary depending on the quadrant the shadow is in (for positional lights), as further quadrants have lower texel density.
    • We should make sure that when the user increases shadow resolution, smaller objects cast shadows correctly out of the box.
  • There should be a global multiplier for the minimum shadow size threshold, so that you can quickly adjust the value for low-end hardware.
  • Additionally, it might be worth adding another global value for turning off shadow casting of positional lights if they don't take much space on screen. This would be set to a very small value by default (only to cull shadows from lights that occupy an area of 2×2 or smaller on the viewport), but it could be increased for low-end hardware so that small lights don't cast shadows.

@ywmaa
Copy link

ywmaa commented Jun 7, 2024

What makes this even worse is that Blender's gltf exporter turns all instances into copies, which is absolutely terrible from a performance point of view.

I think this is fixed with blender 4.0 and up:
https://x.com/julienduroure/status/1722639934690062362?t=7WzWlWuhoqk_MnK11E0uRw&s=19

@darksylinc
Copy link
Author

I'm curious as to how to provide the best usability for culling small objects from shadow passes. There are a few things I have in mind:

I have a few comments on that based on my experience (Note: By "Shadow Map" I mean cascade or light. Basically any independent "quadrant"):

"Too small for the shadow pass" depends on shadow map resolution and shadow max distance (for directional lights). It can also vary depending on the quadrant the shadow is in (for positional lights), as further quadrants have lower texel density.

Please note that:

From a performance point of view, this means that you would need a visibility threshold per shadow map. The more thresholds you have, the higher the CPU cost to evaluate them all.
You get into a O(N*M) complexity where N is number of objects and M number of shadow maps.

Furthermore, since Godot is flexible, this would have to be handled dynamically and most likely through heap allocations e.g.

for i in objects 
   for j in shadow_maps
         object[i].visibility[j] = shadow_camera[j].distance( object[i] )

While this can be optimized, it tends to be a headache because everything is dynamic.

From an artistic point of view, you usually want the object's visibility to be somewhat linked with the shadow casting's visibility. Unless you're making a God of War like game (the original PS2/3 trilogy) or other "on rails" game where you are in complete control of camera angles and therefore can fine tune every shadow threshold to meet performance demands; the user will want something that works in general conditions.

Otherwise you may end up with big objects that for some reason are not casting shadows while everything else still is. Linking the object's visibility somewhat like this:

// When loading mesh:
obj.visibility_threshold = 1000; // 1000 meters. Set by user.
float default_threshold = obj.visibility_threshold * 0.75f; // At least default value. Can be overridden
if( has_overrides )
    default_threshold = override( obj );
obj.shadow_visibility_threshold = default_threshold * global_bias;

This means that you know the object is designed to be really small if 1000m away. And it should reasonably still be small enough to cast any visible shadow in the last 250 meters before it disappears.

Additionally, it might be worth adding another global value for turning off shadow casting of positional lights if they don't take much space on screen. This would be set to a very small value by default (only to cull shadows from lights that occupy an area of 2×2 or smaller on the viewport), but it could be increased for low-end hardware so that small lights don't cast shadows.

While one can get fancy with these things; these increase the maintenance burden due to increased code complexity, increase CPU costs (because calculations need to be evaluated and stored somewhere) while usually you get diminishing returns to be worth the trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests