From c2d87d781e9a61c8a070ba3536cf56afbc1d71b1 Mon Sep 17 00:00:00 2001 From: myaaaaaaaaa <103326468+myaaaaaaaaa@users.noreply.github.com> Date: Mon, 3 Jul 2023 13:18:12 -0400 Subject: [PATCH] Add a proposal for an "occlusion extents" function --- proposals/VK_EXT_occlusion_extents.adoc | 312 ++++++++++++++++++++++++ 1 file changed, 312 insertions(+) create mode 100644 proposals/VK_EXT_occlusion_extents.adoc diff --git a/proposals/VK_EXT_occlusion_extents.adoc b/proposals/VK_EXT_occlusion_extents.adoc new file mode 100644 index 000000000..c7efea132 --- /dev/null +++ b/proposals/VK_EXT_occlusion_extents.adoc @@ -0,0 +1,312 @@ +// Copyright 2021-2023 The Khronos Group Inc. +// +// SPDX-License-Identifier: CC-BY-4.0 + += VK_EXT_occlusion_extents +:toc: left +:refpage: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/ +:sectnums: + +This document proposes an efficient way to perform occlusion culling from within a shader. + +== Problem Statement + +With the advent of modern GPU-driven pipelines that involve very large triangle counts, +such as tessellation shading and mesh shading, +there is a growing need for more efficient occlusion culling methods to avoid overwhelming the GPU. + +Normally, occlusion culling is handled by the depth buffer. +Unfortunately, since the depth buffer currently only operates at the level of individual triangles, +applications that wish to perform culling at a coarser granularity +must manage their own copy of the depth buffer and implement their own occlusion testing functionality, +needlessly duplicating existing GPU functionality. + +== Solution Space + +In order for applications to perform coarse-grained occlusion culling more easily, +bounding boxes must be elevated into a first-class rasterization primitive alongside triangles, +much like they have been with ray tracing. +That is to say, all methods of discarding triangles (depth, frustum, stencil, scissor tests, etc) must be extended to support bounding boxes as well. + +Doing so would allow tessellation and mesh shaders to cull entire clusters of triangles at once, before individual triangles need to be sent to the rasterizer. + +=== `bool isOccluded(vec3 bb_min, vec3 bb_max)` + +The function typically used to implement occlusion culling. +Returns a binary pass/fail indicating the results of a conservative occlusion test. +That is to say, +if it returns `false`, the bounding box may or may not actually be occluded, +but if it returns `true`, it is always safe to be culled. + +=== `void setClipBounds(vec3 clip_min, vec3 clip_max)` + +Instructs the GPU to discard any fragments that land outside of the provided box, +which implies a bounding box that the GPU can use for culling. + +== Proposal + +This document proposes the following function: + +[source,glsl] +---- +void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) +---- + +This function behaves very similarly to `isOccluded()`, +in that it is accessible to any pre-rasterization shader +and that it accepts a bounding box that the application would like to cull. +However, rather than returning a binary pass/fail, +it returns what will henceforth be referred to as an "occlusion extents box". + +Implementations can return any "occlusion extents box" value they wish, +with the only constraint being that it must fail occlusion tests. +More specifically, +*given the "occlusion extents box" returned into `bb_min` and `bb_max`, +the implementation guarantees that if the rasterizer were to receive any triangle whose vertices all satisfy `bb_min \<= gl_Position.xyz \<= bb_max`, +zero fragments would be successfully drawn to the screen. +This can be due to depth, frustum, stencil, or scissor testing, +or because no samples were covered by the triangle.* + +Ideally, implementations should return the largest "occlusion extents box" +with the largest possible intersection with the application-provided bounding box. +However, as long as the returned "occlusion extents box" fails occlusion tests, +implementations have complete freedom in how they calculate it, +even ignoring the bounding box if they wish. + +The properties of a valid occlusion extents box means that any bounding box fully contained within one would also fail occlusion tests, +and can therefore be safely culled. +Shaders are thus able to test multiple bounding boxes with a single occlusion extents box generated by a single API call, +allowing for partial occlusion where child nodes of a BVH can still be culled even if the parent node wasn't. + +This makes it friendlier to wide BVHs than any other occlusion culling method, +with a notable example being mesh shading pipelines (which can be considered a very wide, 4-deep BVH). + +== Example Implementations + +The lax requirements intentionally provides vendors a great deal of freedom in how they would like to roll out support for this extension, both in terms of supported functionality and in terms of development timelines. + +=== No-Op [[noop]] + +Vendors can support this extension but decline to implement it: + +[source,glsl] +---- +void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) { + // return a 0-sized occlusion extents box that cannot cull any geometry + bb_min = bb_max; +} +---- + +This allows shaders to begin using `occlusionExtents()` immediately, +which benefits vendors as real-world code would quickly become available for tracing. + +=== Frustum Culling [[frustum]] + +A more practical implementation than the above, which can at the very least provide frustum culling: + +[source,glsl] +---- +void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) { + if (bb_max.x < -1.01) { + // Return a box that encloses anything past the left side of the viewport + bb_min = vec3(-INF, -INF, -INF); + bb_max = vec3(-1.01, INF, INF); + return; + } + + // Repeat for the right, top, and bottom sides of the viewport. Code omitted for brevity. + + // No-Op, see above + bb_min = bb_max; +} +---- + + +=== `isOccluded()` [[isocc]] + +`occlusionExtents()` can be implemented as a binary pass/fail test if this is all the hardware is capable of, mimicking the semantics of `isOccluded()`: + +[source,glsl] +---- +void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) { + // Frustum culling omitted for brevity. + + if (_internal_isOccluded(bb_min, bb_max)) { + // Since the given bounding box is occluded, it can be passed through as a valid occlusion extents box. + return; + } + + // No-Op, see above + bb_min = bb_max; +} +---- + +=== Iterative `isOccluded()` [[iter]] + +The following implementation requires the hardware to be able to perform multiple occlusion tests per shader invocation: + +[source,glsl] +---- +void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) { + vec3 center = (bb_max + bb_min) * 0.5; + vec3 radius = (bb_max - bb_min) * 2.0; // Starting "radius" + + // Start with an occlusion extents box 4x larger than the given bounding box. + // Shrink it until it fails the occlusion test, then return it. + for (int i = 0; i < 5; i++) { + bb_min = center - radius; + bb_max = center + radius; + + if (_internal_isOccluded(bb_min, bb_max)) { + // Valid box, return it. + return; + } + + // Invalid box, shrink it. + radius *= 0.5; + } + + // Give up + bb_min = bb_max; +} +---- + +The occlusion extents box returned by this particular implementation will generally be suboptimal when the input bounding box is only partially occluded. + +=== Mipmapped Depth Buffer [[minmax]] + +This resembles how an application would typically test against an occlusion HZB. This example adapts it to return an occlusion extents box instead. + +[source,glsl] +---- +void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) { + vec2 center = (bb_max.xy + bb_min.xy) * 0.5; + vec2 uv = (center + 1.0) * 0.5; // Convert from [-1, 1] space to [0, 1] space + + // Start at the lowest resolution mipmap, + // then progressively increase resolution. + for (int i = 0; i <= MIPMAP_MAX; i++) { + // Ignore the size of the bounding box and test just its center point + float zmax = textureLod(depthTextureMinMax, uv, MIPMAP_MAX-i).y; + if (bb_min.z <= zmax) { + // Point not occluded, go to the next mipmap level + continue; + } + + // Point occluded, return the region covered by this texel + float scale = pow(2.0, i); + bb_min.xy = floor(uv * scale) / scale; + bb_max.xy = ceil(uv * scale) / scale; + bb_min.z = zmax; + bb_max.z = INF; + return; + } + + // Give up + bb_min = bb_max; +} +---- + +=== Advanced Implementation + +The above implementations assume that `occlusionExtents()` must be backported to hardware not designed with it in mind. +Future hardware may allow shaders to directly access the depth buffer, +add more levels to their hierarchical Z buffer, +or potentially even add fixed function units dedicated to calculating occlusion extents boxes. + +== Example Usage + +Usage is fairly straightforward for applications: + +[source,glsl] +---- +// Single-threaded for readability +layout (local_size_x=1, local_size_y=1, local_size_z=1) in; + +void main() { + vec3 bb_min = ...; + vec3 bb_max = ...; + vec3 occ_min = bb_min; + vec3 occ_max = bb_max; + + occlusionExtents(occ_min, occ_max); + + // Cull entire task shader + if (all(lessThanEqual(occ_min, bb_min)) && all(lessThanEqual(bb_max, occ_max))) { + return; + } + + // Cull individual meshlets + int out_meshlets = 0; + for (int i = 0; i < num_meshlets; i++) { + bb_min = meshlets[i].bb_min; + bb_max = meshlets[i].bb_max; + + if (all(lessThanEqual(occ_min, bb_min)) && all(lessThanEqual(bb_max, occ_max))) { + continue; + } + + // optional: call occlusionExtents(bb_min, bb_max) a second time + + task_payload.meshlet_index[out_meshlets] = i; + out_meshlets++; + } + + EmitMeshTasksEXT(out_meshlets, 1, 1); +} +---- + +Vendors can then gradually test and optimize their implementations, +potentially introducing new fixed function units, +without further input from application developers. + +== Issues + +=== PROPOSED: Backporting/emulating at the driver level? + +Since the <> or <> implementations are always available as last resorts, +all GPUs can support this extension even if they don't provide the expected speedups. +However, it would be ideal if some form of occlusion functionality were to be backported +to hardware that supports tessellation and mesh shaders. + +==== Emulation via occlusion queries + +Upon encountering a shader that uses `occlusionExtents()`, +it may be possible for the driver to split all subsequent draw calls into two: + + . Render one (or more, see <>) quads to be occlusion queried. +The original shader can be copied into a small vertex shader, +where calls to `occlusionExtents()` are redirected to a function that uses the supplied bounding box to set `gl_Position`. + . Pass occlusion query data into the task shader. Calls to `occlusionExtents()` would then reference this data. + +This implies a fixed number of occlusion tests per invocation. +If the shader makes too many calls to `occlusionExtents()`, +all calls beyond the fixed limit will have to ignore their provided bounding boxes and just repeat previous occlusion extents boxes. +This is permitted, as the return value of `occlusionExtents()` does not need to be related to the supplied bounding box. + +See <> and <> for implementation examples. + +==== Emulation via copying the depth buffer to a texture + +This is the approach typically taken by applications that employ mesh shading/compute pre-passes. +Doing this at the driver level can streamline this process, +and should provide superior performance since it can theoretically copy the GPU's Hi-Z buffer directly without having to touch the full-resolution depth buffer. + +Note that implementations are free to only copy the depth buffer once during a frame, +right before the first time a shader with `occlusionExtents()` is used. + +See <> for an implementation example. + +=== UNRESOLVED: How does this apply to fragment and compute shaders? + +The semantics of this function are fairly intuitive for any shading stage that occurs before rasterization, +since it involves the same coordinate space as `gl_Position`. + +It is less evident for fragment shaders, which uses a different coordinate system. + +Compute shaders are even more troublesome as they live outside of the graphics pipeline, +and therefore their interaction with the depth buffer is not well-defined. + +=== UNRESOLVED: SPIR-V semantics? + +This proposal currently only covers GLSL semantics.