Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialize write_into for trivially memcpy-able types #53

Open
james7132 opened this issue Sep 21, 2023 · 4 comments
Open

Specialize write_into for trivially memcpy-able types #53

james7132 opened this issue Sep 21, 2023 · 4 comments

Comments

@james7132
Copy link
Contributor

Right now, encase recursively calls write_into when serializing out types into GPU memory. This typically follows the flow of copy field/value -> advance the writer by it's padding -> repeat until done.

This unfortunately breaks vectorization when copying the data. Larger structs with heavily recursive types like 4x4 matrices end up with tens or hundreds of these steps when they could be just directly memcpy-ed into the target buffer.

For all types that have a runtime fixed size and do not have any additional padding, they're trivially memcpy-able into and out of GPU buffers. Similarly, arrays, slices and Vecs of these types are can also be batch memcpy-ed where applicable.

This information is statically available at compile time in a type's METADATA. If statements on constant expressions will optimize out the unused branch. This should be doable even without compiler support for specialization.

@james7132
Copy link
Contributor Author

james7132 commented Sep 22, 2023

This was discussed on the Bevy Discord (link). A summarization of the analysis of the codegen:

  • encase already does a capacity check only once per write:
    pub fn new<T: ShaderType>(data: &T, buffer: B, offset: usize) -> Result<Self> {
  • ArrayMetadata's accessors not being inlined, even on the most aggressive compilation settings.
    • Experimental change: Commit
    • Change in codegen: diff
    • Observed result: Metadata not being accessed causes unsupported use cases (i.e. certain structs in uniform buffers) to collapsing into the panic, resulting in a lot more codegen than necessary. Though this is probably not actively impacting hotpath performance
    • Additional note: this also probably applies to the other metadata types as well, even if they're all const.
  • SliceExt::array and SliceExt::array_mut both use checked conversions into &mut [T] on the target subslice. This results in a lot of extra branching. Attempted to replace this with an unchecked conversion instead.
    • Experimental change: commit
    • Change in codegen: diff
    • Observed result: All of the branches disappeared. The copy is not vectorized, though it should be, but there aren't any unnecessary branches anymore.
    • Additional note: This implementation is unsound as there is no capacity check when converting to the array. Either the unsafe needs to be lifted out and added as an invariant (basically treating the slice as a raw pointer), or we need to find another way to avoid the branch.

TODO: Actually benchmark the changes here to see if the gains are significant enough to warrant these kinds of changes.

@james7132
Copy link
Contributor Author

One potential other middle ground is to change the vector and matrix implementations to directly copy their bytes instead of relying on the underlying components' implementations. This would eliminate the vast majority of the branches being produced, potentially allows for vectorized copies, and avoids the need for infectious use of unsafe.

@teoxoy
Copy link
Owner

teoxoy commented Nov 26, 2023

This is great stuff, thanks for looking into it!

I think the most promising optimization here would be using unchecked versions of SliceExt::array and SliceExt::array_mut, making all read and write methods unsafe and making sure we always check the bounds in the buffer wrappers which is the API most users will interact with anyway. A PR doing this would be welcome but I'm curious how much perf we will get from this, we should benchmark it.

How did you generate the codegen in the diffs above? With the unchecked change, I think those copies should have been vectorized.

Regarding const fns not being inlined, that's odd to me, I thought all const fns would be evaluated at compile time and their result inlined. If you could open a PR that adds the #[inline] attribute on all const fns, that would be great!

@james7132
Copy link
Contributor Author

How did you generate the codegen in the diffs above? With the unchecked change, I think those copies should have been vectorized.

It's honestly sort of jank. I compile results with cargo and use rust flags to make rustc output the *.asm files, then use rustfilt to pretty it up a bit, then using git to track the diff in a nice interface. I'd use something like Godbolt, but setting up my own server with the Rust crates I want tends to be a lot more effort.

Regarding const fns not being inlined, that's odd to me, I thought all const fns would be evaluated at compile time and their result inlined. If you could open a PR that adds the #[inline] attribute on all const fns, that would be great!

If the entire expression is constant (inputs, outputs, static dependencies), the result will be computed at compile time, but if used in a non-const context, it will be treated as a normal function, including not inlining if the function is considered big enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants