-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize write_into for trivially memcpy-able types #53
Comments
This was discussed on the Bevy Discord (link). A summarization of the analysis of the codegen:
TODO: Actually benchmark the changes here to see if the gains are significant enough to warrant these kinds of changes. |
One potential other middle ground is to change the vector and matrix implementations to directly copy their bytes instead of relying on the underlying components' implementations. This would eliminate the vast majority of the branches being produced, potentially allows for vectorized copies, and avoids the need for infectious use of unsafe. |
This is great stuff, thanks for looking into it! I think the most promising optimization here would be using unchecked versions of How did you generate the codegen in the diffs above? With the unchecked change, I think those copies should have been vectorized. Regarding |
It's honestly sort of jank. I compile results with cargo and use rust flags to make rustc output the *.asm files, then use rustfilt to pretty it up a bit, then using git to track the diff in a nice interface. I'd use something like Godbolt, but setting up my own server with the Rust crates I want tends to be a lot more effort.
If the entire expression is constant (inputs, outputs, static dependencies), the result will be computed at compile time, but if used in a non-const context, it will be treated as a normal function, including not inlining if the function is considered big enough. |
Right now, encase recursively calls
write_into
when serializing out types into GPU memory. This typically follows the flow of copy field/value -> advance the writer by it's padding -> repeat until done.This unfortunately breaks vectorization when copying the data. Larger structs with heavily recursive types like 4x4 matrices end up with tens or hundreds of these steps when they could be just directly memcpy-ed into the target buffer.
For all types that have a runtime fixed size and do not have any additional padding, they're trivially memcpy-able into and out of GPU buffers. Similarly, arrays, slices and Vecs of these types are can also be batch memcpy-ed where applicable.
This information is statically available at compile time in a type's METADATA. If statements on constant expressions will optimize out the unused branch. This should be doable even without compiler support for specialization.
The text was updated successfully, but these errors were encountered: