Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a proposal for an "occlusion extents" function #2163

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
312 changes: 312 additions & 0 deletions proposals/VK_EXT_occlusion_extents.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
// Copyright 2021-2023 The Khronos Group Inc.
//
// SPDX-License-Identifier: CC-BY-4.0

= VK_EXT_occlusion_extents
:toc: left
:refpage: https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/
:sectnums:

This document proposes an efficient way to perform occlusion culling from within a shader.

== Problem Statement

With the advent of modern GPU-driven pipelines that involve very large triangle counts,
such as tessellation shading and mesh shading,
there is a growing need for more efficient occlusion culling methods to avoid overwhelming the GPU.

Normally, occlusion culling is handled by the depth buffer.
Unfortunately, since the depth buffer currently only operates at the level of individual triangles,
applications that wish to perform culling at a coarser granularity
must manage their own copy of the depth buffer and implement their own occlusion testing functionality,
needlessly duplicating existing GPU functionality.

== Solution Space

In order for applications to perform coarse-grained occlusion culling more easily,
bounding boxes must be elevated into a first-class rasterization primitive alongside triangles,
much like they have been with ray tracing.
That is to say, all methods of discarding triangles (depth, frustum, stencil, scissor tests, etc) must be extended to support bounding boxes as well.

Doing so would allow tessellation and mesh shaders to cull entire clusters of triangles at once, before individual triangles need to be sent to the rasterizer.

=== `bool isOccluded(vec3 bb_min, vec3 bb_max)`

The function typically used to implement occlusion culling.
Returns a binary pass/fail indicating the results of a conservative occlusion test.
That is to say,
if it returns `false`, the bounding box may or may not actually be occluded,
but if it returns `true`, it is always safe to be culled.

=== `void setClipBounds(vec3 clip_min, vec3 clip_max)`

Instructs the GPU to discard any fragments that land outside of the provided box,
which implies a bounding box that the GPU can use for culling.

== Proposal

This document proposes the following function:

[source,glsl]
----
void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max)
----

This function behaves very similarly to `isOccluded()`,
in that it is accessible to any pre-rasterization shader
and that it accepts a bounding box that the application would like to cull.
However, rather than returning a binary pass/fail,
it returns what will henceforth be referred to as an "occlusion extents box".

Implementations can return any "occlusion extents box" value they wish,
with the only constraint being that it must fail occlusion tests.
More specifically,
*given the "occlusion extents box" returned into `bb_min` and `bb_max`,
the implementation guarantees that if the rasterizer were to receive any triangle whose vertices all satisfy `bb_min \<= gl_Position.xyz \<= bb_max`,
zero fragments would be successfully drawn to the screen.
This can be due to depth, frustum, stencil, or scissor testing,
or because no samples were covered by the triangle.*

Ideally, implementations should return the largest "occlusion extents box"
with the largest possible intersection with the application-provided bounding box.
However, as long as the returned "occlusion extents box" fails occlusion tests,
implementations have complete freedom in how they calculate it,
even ignoring the bounding box if they wish.

The properties of a valid occlusion extents box means that any bounding box fully contained within one would also fail occlusion tests,
and can therefore be safely culled.
Shaders are thus able to test multiple bounding boxes with a single occlusion extents box generated by a single API call,
allowing for partial occlusion where child nodes of a BVH can still be culled even if the parent node wasn't.

This makes it friendlier to wide BVHs than any other occlusion culling method,
with a notable example being mesh shading pipelines (which can be considered a very wide, 4-deep BVH).

== Example Implementations

The lax requirements intentionally provides vendors a great deal of freedom in how they would like to roll out support for this extension, both in terms of supported functionality and in terms of development timelines.

=== No-Op [[noop]]

Vendors can support this extension but decline to implement it:

[source,glsl]
----
void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
// return a 0-sized occlusion extents box that cannot cull any geometry
bb_min = bb_max;
}
----

This allows shaders to begin using `occlusionExtents()` immediately,
which benefits vendors as real-world code would quickly become available for tracing.

=== Frustum Culling [[frustum]]

A more practical implementation than the above, which can at the very least provide frustum culling:

[source,glsl]
----
void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
if (bb_max.x < -1.01) {
// Return a box that encloses anything past the left side of the viewport
bb_min = vec3(-INF, -INF, -INF);
bb_max = vec3(-1.01, INF, INF);
return;
}

// Repeat for the right, top, and bottom sides of the viewport. Code omitted for brevity.

// No-Op, see above
bb_min = bb_max;
}
----


=== `isOccluded()` [[isocc]]

`occlusionExtents()` can be implemented as a binary pass/fail test if this is all the hardware is capable of, mimicking the semantics of `isOccluded()`:

[source,glsl]
----
void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
// Frustum culling omitted for brevity.

if (_internal_isOccluded(bb_min, bb_max)) {
// Since the given bounding box is occluded, it can be passed through as a valid occlusion extents box.
return;
}

// No-Op, see above
bb_min = bb_max;
}
----

=== Iterative `isOccluded()` [[iter]]

The following implementation requires the hardware to be able to perform multiple occlusion tests per shader invocation:

[source,glsl]
----
void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
vec3 center = (bb_max + bb_min) * 0.5;
vec3 radius = (bb_max - bb_min) * 2.0; // Starting "radius"

// Start with an occlusion extents box 4x larger than the given bounding box.
// Shrink it until it fails the occlusion test, then return it.
for (int i = 0; i < 5; i++) {
bb_min = center - radius;
bb_max = center + radius;

if (_internal_isOccluded(bb_min, bb_max)) {
// Valid box, return it.
return;
}

// Invalid box, shrink it.
radius *= 0.5;
}

// Give up
bb_min = bb_max;
}
----

The occlusion extents box returned by this particular implementation will generally be suboptimal when the input bounding box is only partially occluded.

=== Mipmapped Depth Buffer [[minmax]]

This resembles how an application would typically test against an occlusion HZB. This example adapts it to return an occlusion extents box instead.

[source,glsl]
----
void occlusionExtents(inout vec3 bb_min, inout vec3 bb_max) {
vec2 center = (bb_max.xy + bb_min.xy) * 0.5;
vec2 uv = (center + 1.0) * 0.5; // Convert from [-1, 1] space to [0, 1] space

// Start at the lowest resolution mipmap,
// then progressively increase resolution.
for (int i = 0; i <= MIPMAP_MAX; i++) {
// Ignore the size of the bounding box and test just its center point
float zmax = textureLod(depthTextureMinMax, uv, MIPMAP_MAX-i).y;
if (bb_min.z <= zmax) {
// Point not occluded, go to the next mipmap level
continue;
}

// Point occluded, return the region covered by this texel
float scale = pow(2.0, i);
bb_min.xy = floor(uv * scale) / scale;
bb_max.xy = ceil(uv * scale) / scale;
bb_min.z = zmax;
bb_max.z = INF;
return;
}

// Give up
bb_min = bb_max;
}
----

=== Advanced Implementation

The above implementations assume that `occlusionExtents()` must be backported to hardware not designed with it in mind.
Future hardware may allow shaders to directly access the depth buffer,
add more levels to their hierarchical Z buffer,
or potentially even add fixed function units dedicated to calculating occlusion extents boxes.

== Example Usage

Usage is fairly straightforward for applications:

[source,glsl]
----
// Single-threaded for readability
layout (local_size_x=1, local_size_y=1, local_size_z=1) in;

void main() {
vec3 bb_min = ...;
vec3 bb_max = ...;
vec3 occ_min = bb_min;
vec3 occ_max = bb_max;

occlusionExtents(occ_min, occ_max);

// Cull entire task shader
if (all(lessThanEqual(occ_min, bb_min)) && all(lessThanEqual(bb_max, occ_max))) {
return;
}

// Cull individual meshlets
int out_meshlets = 0;
for (int i = 0; i < num_meshlets; i++) {
bb_min = meshlets[i].bb_min;
bb_max = meshlets[i].bb_max;

if (all(lessThanEqual(occ_min, bb_min)) && all(lessThanEqual(bb_max, occ_max))) {
continue;
}

// optional: call occlusionExtents(bb_min, bb_max) a second time

task_payload.meshlet_index[out_meshlets] = i;
out_meshlets++;
}

EmitMeshTasksEXT(out_meshlets, 1, 1);
}
----

Vendors can then gradually test and optimize their implementations,
potentially introducing new fixed function units,
without further input from application developers.

== Issues

=== PROPOSED: Backporting/emulating at the driver level?

Since the <<noop>> or <<frustum>> implementations are always available as last resorts,
all GPUs can support this extension even if they don't provide the expected speedups.
However, it would be ideal if some form of occlusion functionality were to be backported
to hardware that supports tessellation and mesh shaders.

==== Emulation via occlusion queries

Upon encountering a shader that uses `occlusionExtents()`,
it may be possible for the driver to split all subsequent draw calls into two:

. Render one (or more, see <<iter>>) quads to be occlusion queried.
The original shader can be copied into a small vertex shader,
where calls to `occlusionExtents()` are redirected to a function that uses the supplied bounding box to set `gl_Position`.
. Pass occlusion query data into the task shader. Calls to `occlusionExtents()` would then reference this data.

This implies a fixed number of occlusion tests per invocation.
If the shader makes too many calls to `occlusionExtents()`,
all calls beyond the fixed limit will have to ignore their provided bounding boxes and just repeat previous occlusion extents boxes.
This is permitted, as the return value of `occlusionExtents()` does not need to be related to the supplied bounding box.

See <<isocc>> and <<iter>> for implementation examples.

==== Emulation via copying the depth buffer to a texture

This is the approach typically taken by applications that employ mesh shading/compute pre-passes.
Doing this at the driver level can streamline this process,
and should provide superior performance since it can theoretically copy the GPU's Hi-Z buffer directly without having to touch the full-resolution depth buffer.

Note that implementations are free to only copy the depth buffer once during a frame,
right before the first time a shader with `occlusionExtents()` is used.

See <<minmax>> for an implementation example.

=== UNRESOLVED: How does this apply to fragment and compute shaders?

The semantics of this function are fairly intuitive for any shading stage that occurs before rasterization,
since it involves the same coordinate space as `gl_Position`.

It is less evident for fragment shaders, which uses a different coordinate system.

Compute shaders are even more troublesome as they live outside of the graphics pipeline,
and therefore their interaction with the depth buffer is not well-defined.

=== UNRESOLVED: SPIR-V semantics?

This proposal currently only covers GLSL semantics.