-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EncasedBufferVec
, an higher-performance alternative to StorageBuffer
, and make GpuArrayBuffer
use it.
#12670
Conversation
`StorageBuffer`, and make `GpuArrayBuffer` use it. `EncasedBufferVec` is like `BufferVec`, but it doesn't require that the type be `Pod`. Alternately, it's like `StorageBuffer<Vec<T>>`, except it doesn't allow CPU access to the data after it's been pushed. `GpuArrayBuffer` already doesn't allow CPU access to the data, so switching it to use `EncasedBufferVec` doesn't regress any functionality and offers higher performance. Shutting off CPU access eliminates the need to copy to a scratch buffer, which results in significantly higher performance. *Note that this needs teoxoy/encase#65 from @james7132 to achieve end-to-end performance benefits*, because `encase` is rather slow at encoding data without that patch, swamping the benefits of avoiding the copy. With that patch applied, and `#[inline]` added to `encase`'s `derive` implementation of `write_into` on structs, this results in a *16% overall speedup on `many_cubes --no-frustum-culling`*. I've verified that the generated code is now close to optimal. The only reasonable potential improvement that I see is to eliminate the zeroing in `push`. This requires unsafe code, however, so I'd prefer to leave that to a followup.
I'm not wedded to the name |
Nice performance improvements in conjunction with James’ encase changes. I don’t know what the name should be but I don’t think it’s quite right. I also want us to check thoroughly that this will never produce misaligned data. I think it will be fine but I’m not certain. For a wgsl binding containing only an array the alignment of each element is supposed to be the alignment of T and if T is a struct then it is the max of the alignments of the members of the struct type. And the size of a struct type is the offset of the last member plus the size of the last member rounded up to the alignment. So I think it will be fine… There is one case where it might break - WebGL2 where 16-byte aligned sizes are always needed or something like that. |
How about renaming |
I did that renaming since nobody objected. |
It looks like your PR is a breaking change, but you didn't provide a migration guide. Could you add some context on what users should update when this change get released in a new version of Bevy? |
Look's great, but I'm marking this as blocked until the |
Encase released a new version a few days ago. I think this just needs #12757 now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming CI passes, LGTM. There's just a few issues left with the renaming I think.
I haven't done any performance testing but I trust you on that part. I've tested that everything still works as expected though.
@@ -28,7 +34,7 @@ use wgpu::BufferUsages; | |||
/// * [`GpuArrayBuffer`](crate::render_resource::GpuArrayBuffer) | |||
/// * [`BufferVec`] | |||
/// * [`Texture`](crate::render_resource::Texture) | |||
pub struct BufferVec<T: Pod> { | |||
pub struct RawBufferVec<T: Pod> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't add a suggestion but the doc comment has a few mentions to BufferVec
that should be renamed.
// Take a slice of the new data for `write_into` to use. This is | ||
// important: it hoists the bounds check up here so that the compiler | ||
// can eliminate all the bounds checks that `write_into` will emit. | ||
let mut dest = &mut self.data[offset..(offset + element_size)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice, I never thought about doing that but it's a neat trick to know.
/// the `BufferVec` was created, the buffer on the [`RenderDevice`] | ||
/// is marked as [`BufferUsages::COPY_DST`](BufferUsages). | ||
pub fn reserve(&mut self, capacity: usize, device: &RenderDevice) { | ||
if capacity <= self.capacity && !self.label_changed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for my own curiosity, nothing needs to change. Why shouldn't we try to allocate a smaller buffer if the capacity is smaller?
Co-authored-by: IceSentry <[email protected]>
Co-authored-by: IceSentry <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly comfortable with the changes here, and the performance gains are compelling evidence that it actually does what it's trying to do.
The math looks correct, I agree with your TODO to consider using UNSAFE. Code quality and docs are, like always, high: much appreciated!
Once the mistaken doc references to the wrong BufferVec
in RawBufferVec
are cleaned up, you'll have my approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I tested it with some examples on wasm just to check.
However, there are a bunch of places still where it refers to BufferVec in docs on functions that actually take RawBufferVec. It might be a good idea to grep for BufferVec and look at all of them.
Closed as I am completely burned out and have no motivation to continue working on this. |
…d make GpuArrayBuffer use it. (#13199) This is an adoption of #12670 plus some documentation fixes. See that PR for more details. --- ## Changelog * Renamed `BufferVec` to `RawBufferVec` and added a new `BufferVec` type. ## Migration Guide `BufferVec` has been renamed to `RawBufferVec` and a new similar type has taken the `BufferVec` name. --------- Co-authored-by: Patrick Walton <[email protected]> Co-authored-by: Alice Cecile <[email protected]> Co-authored-by: IceSentry <[email protected]>
EncasedBufferVec
is likeBufferVec
, but it doesn't require that the type bePod
. Alternately, it's likeStorageBuffer<Vec<T>>
, except it doesn't allow CPU access to the data after it's been pushed.GpuArrayBuffer
already doesn't allow CPU access to the data, so switching it to useEncasedBufferVec
doesn't regress any functionality and offers higher performance.Shutting off CPU access eliminates the need to copy to a scratch buffer, which results in significantly higher performance. Note that this needs teoxoy/encase#65 from @james7132 to achieve end-to-end performance benefits, because
encase
is rather slow at encoding data without that patch, swamping the benefits of avoiding the copy. With that patch applied, and#[inline]
added toencase
'sderive
implementation ofwrite_into
on structs, this results in a 16% overall speedup onmany_cubes --no-frustum-culling
.I've verified that the generated code is now close to optimal. The only reasonable potential improvement that I see is to eliminate the zeroing in
push
. This requires unsafe code, however, so I'd prefer to leave that to a followup.Here's
write_batched_instance_buffer
before-and-after (yellow = after, red = before) formany_cubes --no-frustum-culling
: