-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DX12 Sampler Heap Requirements Overly Strict #3350
Comments
After experimenting with PR #3499 and some discussions in Matrix, we've made a more concrete plan for how we'll attempt to solve this. The naïve solution is to dedupe the samplers that are uploaded to the GPU sampler heap. The problem with this is that the root signature is currently set up to have 1 descriptor table for every bind group in the pipeline where that bind group's samplers are stored. This means that the samplers of each bind group currently need to be contiguous in the heap, since descriptor tables are contiguous ranges in memory. This is a problem for deduping, since the samplers would no longer be contiguous. One option would be to add more descriptor tables, but this would mean that we'd have one descriptor table per sample in the root signature. This is bad since the root signature is precious space, containing a max of 64 DWORDS where each descriptor table takes up 1 DWORD. As is recommended by the dx12 docs, we don't want to waste any space in the root signature as it is used for lots of stuff such as push constants. One proposed solution to this is to use root constants instead of descriptor tables to store indices into the sampler heap. The entire sampler heap will be bound as a single descriptor table pointing to an array all the samplers. Since there's a maximum of 2048 samplers in the heap, we would only need 12 bits to represent the sampler's index in the heap. This means that we could store 32 sampler indices in the root signature in 12 root constants (root constants are 1 DWORD or 32 bits each, and Another similar solution is to allocate an extra buffer which contains arrays of sampler indices. We can use that to store the list of samplers for each bind group and then store a pointer to that list in the root signature. This would add an extra indirection to fetching the samplers but would take up less space in the root signature (1 DWORD per bind group + 1 for the sampler array). It's also a simpler solution that would work for both fixed samplers and also sampler arrays. In fact, even if we use the other approach for fixed samplers, we would still need to do this to keep supporting sampler arrays. @cwfitzgerald and I are in agreement that we start with this approach for all cases and see how it goes. Some other relevant info:
|
Thanks for the investigation; I'd also say that the 2nd approach sounds like a good start! |
Nice! Also looping in @jimblandy on this |
@teoxoy If the CTS were running into problems caused by this not-great allocation system in Mozilla CI, what would the symptoms be? Basically, as long as this is an optimality problem, it's low priority; once it's a blocker to CTS health, then it's a higher priority. |
you would get a panic with the "not enough memory left" error message, as in this issue: #2857 |
WebGPU requires FL11_1 or (FL11_0 + resource binding tier 2) (ref). The table here lists FL11_1 as having a minimum resource binding tier of 3, so I thought we can assume tier 2 as a minimum but the table here lists haswell & broadwell as being FL11_1 with a resource binding tier of 1. We already dropped support for haswell on d3d12 (#4709) but I thought we could still support broadwell. Is the table on wikipedia wrong? If not, is it valid that broadwell only supports resource binding tier 1 given that the microsoft docs say FL11_1 should support tier 3? |
The table is probably wrong in some way - I'm not particularly interested in supporting it, especially when there's no good solution for samplers without at least RBT2. |
I guess dropping it would be fine since tier 1 has the additional limit of:
from https://microsoft.github.io/DirectX-Specs/d3d/ResourceBinding.html#levels-of-hardware-support which sounds even more restricting. It would be good to know if the table was wrong or it's actually the case that it only supports tier 1. |
Detailing the resolved plan a little more: Samplers are fully de-duplicated in the backend. On creation, they are added to the sampler heap. On destruction, that slot in the sampler heap is up for re-use. Sampler creation fails if there is no more slots on the heap. This is allowed because we are updating descriptors that are not actively used by the gpu. Globally, for all shaders that bind samplers, there is a single descriptor table which binds the entire sampler heap. This is not part of a bind group. For each bindgroup that contains a sampler, there is a single descriptor pointing to a buffer with u32s. Each u32 is an index into the sampler heap corresponding to that sampler in the bind group. When accessing the sampler, the appropriate position in the buffer is read and the index is used to index into the sampler descriptor set binding. When creating a bind group, this sampler index buffer is created and populated with all the appropriate sampler indexes. This is then bound as part of binding a bind group. @group(0) @binding(0) var sampler_0: sampler; // sampler buffer 0 index 0
@group(0) @binding(1) var texture_0: texture2d<f32>;
@group(0) @binding(2) var sampler_1: sampler; // sampler buffer 0 index 1
@group(1) @binding(0) var sampler_2: sampler; // sampler buffer 1 index 0
@group(1) @binding(1) var sampler_3: sampler; // sampler buffer 1 index 1
@group(1) @binding(2) var sampler_4: sampler; // sampler buffer 1 index 2
// In the HLSL output, the equiv of:
var sampler_descriptors: binding_array<sampler>; // Contains 2032 possibly invalid samplers.
@group(0) @binding(wherever) var<storage> sampler_array_bind_group_0: array<u32>; // Contains two values in this example.
@group(1) @binding(wherever) var<storage> sampler_array_bind_group_1: array<u32>; // Contains three values in this example.
// When reading a sampler, wgsl of this:
textureSample(texture_0, sampler_1, uv);
textureSample(texture_0, sampler_4, uv);
// Gets translated to:
textureSample(texture_0, sampler_descriptors[sampler_array_bind_group_0[1]], uv);
textureSample(texture_0, sampler_descriptors[sampler_array_bind_group_1[2]], uv); Note for the FutureIf we use SM6.6 bindless, we can index directly into the sampler heap, skipping the |
I suspect that some bevy demos such as https://bevyengine.org/examples-webgpu/3D%20Rendering/load-gltf/ are hitting this issue very easily. |
DX12 is totally weird about samplers. At any one time there may be exactly one sampler heap bound to a command buffer. Each sampler heap may contain no more than 2048 sampler descriptors. Currently we only ever use a single sampler heap. This means that the maximum amount of sampler bindings we can allocate in all live bind groups is 2048. This is extremely restrictive, especially when compared to other apis.
You can switch which sampler heap is bound to the command buffer at any time, but according to the command list documentation this may be expensive and cause flushes.
We need a better strategy of dealing with sampler descriptors. I suspect this will involve some kind of sampler descriptor cache where we try to re-use the same sampler heap as much as possible, only overflowing if we run out of space, then just switching sampler heaps as rarely as possible.
The text was updated successfully, but these errors were encountered: