KhronosGroup · myaaaaaaaaa · Jul 3, 2023
diff --git a/proposals/VK_EXT_occlusion_extents.adoc b/proposals/VK_EXT_occlusion_extents.adoc
@@ -0,0 +1,312 @@
+// Copyright 2021-2023 The Khronos Group Inc.
+//
+// SPDX-License-Identifier: CC-BY-4.0
+
+= VK_EXT_occlusion_extents
+:toc: left
+:refpage: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/
+:sectnums:
+
+This document proposes an efficient way to perform occlusion culling from within a shader.
+
+== Problem Statement
+
+With the advent of modern GPU-driven pipelines that involve very large triangle counts,
+such as tessellation shading and mesh shading,
+there is a growing need for more efficient occlusion culling methods to avoid overwhelming the GPU.
+
+Normally, occlusion culling is handled by the depth buffer.
+Unfortunately, since the depth buffer currently only operates at the level of individual triangles,
+applications that wish to perform culling at a coarser granularity
+must manage their own copy of the depth buffer and implement their own occlusion testing functionality,
+needlessly duplicating existing GPU functionality.
+
+== Solution Space
+
+In order for applications to perform coarse-grained occlusion culling more easily,
+bounding boxes must be elevated into a first-class rasterization primitive alongside triangles,
+much like they have been with ray tracing.
+That is to say, all methods of discarding triangles (depth, frustum, stencil, scissor tests, etc) must be extended to support bounding boxes as well.
+
+Doing so would allow tessellation and mesh shaders to cull entire clusters of triangles at once, before individual triangles need to be sent to the rasterizer.
+
+=== `bool isOccluded(vec3 bb_min, vec3 bb_max)`
+
+The function typically used to implement occlusion culling.
+Returns a binary pass/fail indicating the results of a conservative occlusion test.
+That is to say,
+if it returns `false`, the bounding box may or may not actually be occluded,
+but if it returns `true`, it is always safe to be culled.
+
+=== `void setClipBounds(vec3 clip_min, vec3 clip_max)`
+
+Instructs the GPU to discard any fragments that land outside of the provided box,
+which implies a bounding box that the GPU can use for culling.
+
+== Proposal
+
+This document proposes the following function:
+
+[source,glsl]
+----
+void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max)
+----
+
+This function behaves very similarly to `isOccluded()`,
+in that it is accessible to any pre-rasterization shader
+and that it accepts a bounding box that the application would like to cull.
+However, rather than returning a binary pass/fail,
+it returns what will henceforth be referred to as an "occlusion extents box".
+
+Implementations can return any "occlusion extents box" value they wish,
+with the only constraint being that it must fail occlusion tests.
+More specifically,
+*given the "occlusion extents box" returned into `bb_min` and `bb_max`,
+the implementation guarantees that if the rasterizer were to receive any triangle whose vertices all satisfy `bb_min \<= gl_Position.xyz \<= bb_max`,
+zero fragments would be successfully drawn to the screen.
+This can be due to depth, frustum, stencil, or scissor testing,
+or because no samples were covered by the triangle.*
+
+Ideally, implementations should return the largest "occlusion extents box"
+with the largest possible intersection with the application-provided bounding box.
+However, as long as the returned "occlusion extents box" fails occlusion tests,
+implementations have complete freedom in how they calculate it,
+even ignoring the bounding box if they wish.
+
+The properties of a valid occlusion extents box means that any bounding box fully contained within one would also fail occlusion tests,
+and can therefore be safely culled.
+Shaders are thus able to test multiple bounding boxes with a single occlusion extents box generated by a single API call,
+allowing for partial occlusion where child nodes of a BVH can still be culled even if the parent node wasn't.
+
+This makes it friendlier to wide BVHs than any other occlusion culling method,
+with a notable example being mesh shading pipelines (which can be considered a very wide, 4-deep BVH).
+
+== Example Implementations
+
+The lax requirements intentionally provides vendors a great deal of freedom in how they would like to roll out support for this extension, both in terms of supported functionality and in terms of development timelines.
+
+=== No-Op [[noop]]
+
+Vendors can support this extension but decline to implement it:
+
+[source,glsl]
+----
+void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
+    // return a 0-sized occlusion extents box that cannot cull any geometry
+    bb_min = bb_max;
+}
+----
+
+This allows shaders to begin using `occlusionExtents()` immediately,
+which benefits vendors as real-world code would quickly become available for tracing.
+
+=== Frustum Culling [[frustum]]
+
+A more practical implementation than the above, which can at the very least provide frustum culling:
+
+[source,glsl]
+----
+void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
+    if (bb_max.x < -1.01) {
+        // Return a box that encloses anything past the left side of the viewport
+        bb_min = vec3(-INF, -INF, -INF);
+        bb_max = vec3(-1.01, INF, INF);
+        return;
+    }
+
+    // Repeat for the right, top, and bottom sides of the viewport. Code omitted for brevity.
+
+    // No-Op, see above
+    bb_min = bb_max;
+}
+----
+
+
+=== `isOccluded()` [[isocc]]
+
+`occlusionExtents()` can be implemented as a binary pass/fail test if this is all the hardware is capable of, mimicking the semantics of `isOccluded()`:
+
+[source,glsl]
+----
+void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
+    // Frustum culling omitted for brevity.
+
+    if (_internal_isOccluded(bb_min, bb_max)) {
+        // Since the given bounding box is occluded, it can be passed through as a valid occlusion extents box.
+        return;
+    }
+
+    // No-Op, see above
+    bb_min = bb_max;
+}
+----
+
+=== Iterative `isOccluded()` [[iter]]
+
+The following implementation requires the hardware to be able to perform multiple occlusion tests per shader invocation:
+
+[source,glsl]
+----
+void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
+    vec3 center = (bb_max + bb_min) * 0.5;
+    vec3 radius = (bb_max - bb_min) * 2.0; // Starting "radius"
+
+    // Start with an occlusion extents box 4x larger than the given bounding box.
+    // Shrink it until it fails the occlusion test, then return it.
+    for (int i = 0; i < 5; i++) {
+        bb_min = center - radius;
+        bb_max = center + radius;
+
+        if (_internal_isOccluded(bb_min, bb_max)) {
+            // Valid box, return it.
+            return;
+        }
+
+        // Invalid box, shrink it.
+        radius *= 0.5;
+    }
+
+    // Give up
+    bb_min = bb_max;
+}
+----
+
+The occlusion extents box returned by this particular implementation will generally be suboptimal when the input bounding box is only partially occluded.
+
+=== Mipmapped Depth Buffer [[minmax]]
+
+This resembles how an application would typically test against an occlusion HZB. This example adapts it to return an occlusion extents box instead.
+
+[source,glsl]
+----
+void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
+    vec2 center = (bb_max.xy + bb_min.xy) * 0.5;
+    vec2 uv = (center + 1.0) * 0.5; // Convert from [-1, 1] space to [0, 1] space
+
+    // Start at the lowest resolution mipmap,
+    // then progressively increase resolution.
+    for (int i = 0; i <= MIPMAP_MAX; i++) {
+        // Ignore the size of the bounding box and test just its center point
+        float zmax = textureLod(depthTextureMinMax, uv, MIPMAP_MAX-i).y;
+        if (bb_min.z <= zmax) {
+            // Point not occluded, go to the next mipmap level
+            continue;
+        }
+
+        // Point occluded, return the region covered by this texel
+        float scale = pow(2.0, i);
+        bb_min.xy = floor(uv * scale) / scale;
+        bb_max.xy =  ceil(uv * scale) / scale;
+        bb_min.z = zmax;
+        bb_max.z = INF;
+        return;
+    }
+
+    // Give up
+    bb_min = bb_max;
+}
+----
+
+=== Advanced Implementation
+
+The above implementations assume that `occlusionExtents()` must be backported to hardware not designed with it in mind.
+Future hardware may allow shaders to directly access the depth buffer,
+add more levels to their hierarchical Z buffer,
+or potentially even add fixed function units dedicated to calculating occlusion extents boxes.
+
+== Example Usage
+
+Usage is fairly straightforward for applications:
+
+[source,glsl]
+----
+// Single-threaded for readability
+layout (local_size_x=1, local_size_y=1, local_size_z=1) in;
+
+void main() {
+    vec3 bb_min = ...;
+    vec3 bb_max = ...;
+    vec3 occ_min = bb_min;
+    vec3 occ_max = bb_max;
+
+    occlusionExtents(occ_min, occ_max);
+
+    // Cull entire task shader
+    if (all(lessThanEqual(occ_min, bb_min)) && all(lessThanEqual(bb_max, occ_max))) {
+        return;
+    }
+
+    // Cull individual meshlets
+    int out_meshlets = 0;
+    for (int i = 0; i < num_meshlets; i++) {
+        bb_min = meshlets[i].bb_min;
+        bb_max = meshlets[i].bb_max;
+
+        if (all(lessThanEqual(occ_min, bb_min)) && all(lessThanEqual(bb_max, occ_max))) {
+            continue;
+        }
+
+        // optional: call occlusionExtents(bb_min, bb_max) a second time
+
+        task_payload.meshlet_index[out_meshlets] = i;
+        out_meshlets++;
+    }
+
+    EmitMeshTasksEXT(out_meshlets, 1, 1);
+}
+----
+
+Vendors can then gradually test and optimize their implementations,
+potentially introducing new fixed function units,
+without further input from application developers.
+
+== Issues
+
+=== PROPOSED: Backporting/emulating at the driver level?
+
+Since the <<noop>> or <<frustum>> implementations are always available as last resorts,
+all GPUs can support this extension even if they don't provide the expected speedups.
+However, it would be ideal if some form of occlusion functionality were to be backported
+to hardware that supports tessellation and mesh shaders.
+
+==== Emulation via occlusion queries
+
+Upon encountering a shader that uses `occlusionExtents()`,
+it may be possible for the driver to split all subsequent draw calls into two:
+
+ . Render one (or more, see <<iter>>) quads to be occlusion queried.
+The original shader can be copied into a small vertex shader,
+where calls to `occlusionExtents()` are redirected to a function that uses the supplied bounding box to set `gl_Position`.
+ . Pass occlusion query data into the task shader. Calls to `occlusionExtents()` would then reference this data.
+
+This implies a fixed number of occlusion tests per invocation.
+If the shader makes too many calls to `occlusionExtents()`,
+all calls beyond the fixed limit will have to ignore their provided bounding boxes and just repeat previous occlusion extents boxes.
+This is permitted, as the return value of `occlusionExtents()` does not need to be related to the supplied bounding box.
+
+See <<isocc>> and <<iter>> for implementation examples.
+
+==== Emulation via copying the depth buffer to a texture
+
+This is the approach typically taken by applications that employ mesh shading/compute pre-passes.
+Doing this at the driver level can streamline this process,
+and should provide superior performance since it can theoretically copy the GPU's Hi-Z buffer directly without having to touch the full-resolution depth buffer.
+
+Note that implementations are free to only copy the depth buffer once during a frame,
+right before the first time a shader with `occlusionExtents()` is used.
+
+See <<minmax>> for an implementation example.
+
+=== UNRESOLVED: How does this apply to fragment and compute shaders?
+
+The semantics of this function are fairly intuitive for any shading stage that occurs before rasterization,
+since it involves the same coordinate space as `gl_Position`.
+
+It is less evident for fragment shaders, which uses a different coordinate system.
+
+Compute shaders are even more troublesome as they live outside of the graphics pipeline,
+and therefore their interaction with the depth buffer is not well-defined.
+
+=== UNRESOLVED: SPIR-V semantics?
+
+This proposal currently only covers GLSL semantics.