-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement batched query support #6161
Conversation
9190701
to
1188177
Compare
1188177
to
1fd4d43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of feedback! Mostly docs related.
In addition to my nits, I think this deserves a full example (ideally with easy mode-toggling for self-benchmarking) in the root level examples/bevy_ecs
. This is a cool feature, but it's going to be tricky to discover, understand and use for intermediate users.
crates/bevy_ecs/src/query/state.rs
Outdated
@@ -778,6 +811,157 @@ impl<Q: WorldQuery, F: ReadOnlyWorldQuery> QueryState<Q, F> { | |||
); | |||
} | |||
} | |||
/// This is a batched version of `for_each_mut` that accepts a batch size, `N`, and a desired alignment for the batches `ALIGN`. | |||
/// Runs `f` and `f_batch` on each query result for the given [`World`]. This can be used to improve performance of your queries using SIMD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs need to explain:
- what is alignment
- what is SIMD
- when this might improve (or worsen) performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be okay if I just wrote a short blurb about these things, referring to e.g. Wikipedia or maybe some Intel blog posts for a more in-depth treatment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's totally fine. I just want this to be a bit less impenetrable to users who are new to low-level programming :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @alice-i-cecile , I added a short little introduction to Query::for_each_mut_batched
covering these concepts, would you mind giving it a read when you get a chance? I don't want to go overboard with info but discoverability is a good thing. Thank you!
Wow, you're fast!! I will see about getting these finished in the coming days, thanks for your patience :) |
b1c8717
to
7269379
Compare
Happy Thanksgiving! Hah, I'm glad I wrote tests. Found a few soundness issues thanks to the Miri CI we have here. You bet I'll be running that locally now too :) fortunately the issues weren't fundamental. Currently working on the remaining documentation related problems now that the soundness issues have been resolved. There's a few ways of using the API and the current example only shows the more "advanced" way with repr(transparent). It's now possible to more easily use batches with a non-"transparent" API using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I havent reviewed all of this, I don't think we should merge this with the current changes to fetch.rs
but I also think there is not much we can do until feature(generic_associated_types)
is stabilized (other than removing the genericness over ALIGN
and having 3 separate fns or just having one). next rust release should have stable gats which is in ~20 days so I don't think this is too much of an ask to wait (and its extraordinarily unlikely imo that this PR actually gets merged in such a short timeframe anyway)
|
||
//Note: NEEDED for batching. This is the layout of the array itself, not the layout of its elements. | ||
let array_layout = array_layout.align_to(batch::MAX_SIMD_ALIGNMENT).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems pretty sus to me, this is no longer giving the layout of an array and is also no longer a copy of the layout.rs
fn from std
. docs and fn name ought to be changed to explain what this fn does and making it super clear that this can return a layout with stride!=size which is pretty unusual in rust, or alternatively (and my preference) call .pad_to_align()
so we arent doing anything "weird".
edit: apparently there is no array_layout
in std..? I am wildly confused as to why this doc comment says its taken from std when I cannot see this code there. regardless point still remains that this is no longer creating the layout of an array and needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I can see where I wasn't clear with this. stride == size
definitely still holds -- this statement just aligns the start of the array, if that makes sense (the chunk of memory for the array itself is aligned to MAX_SIMD_ALIGNMENT
). The element layout remains exactly the same as it was before. This was needed to guarantee that a batch starting at index 0 would be aligned.
For interested readers, typically SIMD code operates in three "phases", a prologue, a "vector region", and an epilogue.
The prologue and epilogue are just "scalar code paths", i.e., your typical for loops. The prologue's purpose is to process elements up until it is possible to execute vector instructions, which generally require a certain alignment to work. However, the vector region can only execute in batches of size N
, which means that if there are any remaining items to do that can't fit in a batch of size N
, they have to be processed serially.
As an example, consider an array of [f32; 128]
, which is aligned to a boundary of 4 (say, address 100
). Suppose you had a special vector instruction set that could operate on batches of 8 f32
s at a time, the catch being, the loads have to be aligned to 32 as well (sizeof f32
being 4
). Then, because of the alignment of the array, you would only be able to batch from index 7 onwards (since index 7 would be at address 128). Then you would of course need a prologue to handle that "slow path". However, if the array started at address 96
, you could see that no prologue would be necessary.
This line here eliminates the prologue case from Bevy by guaranteeing that batch 0 is aligned to the correct SIMD alignment in all cases, because MAX_SIMD_ALIGNMENT
is the greatest alignment required of any instruction on the given system (including vector instructions). Does this make sense? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stride == size does not hold for the layout returned by this fn, align(MAX_SIMD_ALIGNMENT) size(2)
for example isnt stride == size but could be potentially returned from this fn. if you look at the docs for Layout::align_to
you can see that it says that calling align_to(64)
on a 32 size layout doesnt give you 64size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's exactly what I was after -- this array_layout
is what is ultimately used to allocate memory for the entire array after the stride has already been taken into account. It is derived from the item_layout
using the repeat_layout
function. The flow (as I understand it) works as follows:
1 - The item_layout
of T
is passed in via the BlobVec
constructor and held. (stride matters here, as you rightly point out!)
2 - The array_layout
is generated from repeat_layout(layout, n)
(stride is taken into account by repeat_layout -- this gives us a Layout that can hold N item_layout
s [size == stride] and aligned to alignment of item_layout)
3 - The "SIMD aligned" array_layout
is obtained by taking that array_layout
and then calling align_to
on it with MAX_SIMD_ALIGNMENT
(stride has already been taken into account -- we are simply telling the allocator "take this Layout and ensure that the alignment requirement is MAX_SIMD_ALIGNMENT
. I.e. this is now a Layout that can hold N item_layout
s [size == stride] except the the chunk of memory will now be aligned to MAX_SIMD_ALIGNMENT)
4 - It is ultimately (3) that is passed into the allocator
Note that the item_layout
is left pristine through all of this to ensure we don't add any extra padding between elements. It is strictly just the array_layout
that gets align_to(MAX_SIMD_ALIGNMENT)
.
The actual elements of the array are still laid out like repeat_layout
would do it.
This is the best of my understanding of that at the moment. If the code called align_to
on the item_layout
itself, then stride != size
for sure. But since it's being called on array_layout
after repeat_layout
has already done its job, the stride should already have been taken into account. Is my understanding right here?
(Apologies for the ASCII art, I'm a visual thinker :D)
1 - item_layout provided to constructor
2 - align(item_layout): | item_layout item_layout item_layout item_layout ... (N times) | (repeat_layout)
3 - align(MAX_SIMD_ALIGNMENT): | item_layout item_layout item_layout item_layout ... (N times) | (array_layout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rust playground link illustrating it a little further: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=099b1168135ed7ed189dd3bad76477c1
crates/bevy_ptr/src/batch.rs
Outdated
//A "batch" of ZSTs is itself a ZST. | ||
//This sounds dangerous, but won't cause harm as nothing | ||
//will actually access anything "in the slice". | ||
//ZSTs are trivially well aligned, so there is nothing to be checked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ZSTs are not trivially well aligned they are the same as every other type and should also have a debug check here. For ()
align is 1
so every non null pointer is aligned which is pretty trivialy, there are however types that have size 0 and align >=2
align_offset
also says that it might be implemented to unconditionally return usize::MAX
, and it is only to be used for perf not correctness so we also shouldn't use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely an oversight on my part, and I agree this needs to be done better. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the comment about ZSTs being trivially well aligned is wrong. I've removed that. Fortunately it didn't have much to do with the rest of the code :)
Do you have any suggestions for how to check that the pointer is aligned idiomatically? This StackOverflow seems to suggest that align_offset
is the way to go until pointer_is_aligned
is stable. I've rewritten this as a pointer to integer cast modulo ALIGN
(with a TODO to change this when pointer_is_aligned
is stable), but I'm not 100% convinced this is the best way to go either.
The latest commit I've pushed up has these changes.
crates/bevy_ptr/src/lib.rs
Outdated
if core::mem::size_of::<T>() > 0 { | ||
//NOTE: ZSTs may cause this "slice" to point into nothingness. | ||
//This sounds dangerous, but won't cause harm as nothing | ||
//will actually access anything "in the slice" | ||
debug_assert!(ptr.align_offset(batch::MAX_SIMD_ALIGNMENT) == 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, we shouldn't use align_offset
or skip checking align for ZSTs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See comment above)
Thanks very much for your comments @BoxyUwU -- this is a learning experience in Rust for me and I'll work at addressing yours (and the rest of @alice-i-cecile 's ) comments over the next few weeks. I think delaying until GATs are stable makes sense and I can actually start readying the code for that in the meantime. Thank you for your patience :) |
f3f5e78
to
a06c865
Compare
Next up on my list is to get one of the Bevy examples going with batched queries (in a very limited way) to demonstrate batched queries a bit better. Thanks for your comments and your patience everyone. |
Not a problem! It's been great to see this slowly progress. |
@InBetweenNames can you please mark in the PR description that this |
Not a problem! I have some more changes coming down the pipeline when I get the energy to make the API a little friendlier. With the stabilization of GATs being imminent, I'd like to clean up the code with those (not a big change), and also restructure for_each_mut_batched::<4>( |((mut a, Align16), (b, Align32))| {
println!("a: {:?} b: {:?}", a, b);
} ); That is, the For the above example, such a variant would be equivalent to: for_each_mut_batched::<4>( |((mut a, Unaligned), (b, Unaligned))| {
println!("a: {:?} b: {:?}", a, b);
} ); Simple, safe, effective :) Can't promise any timelines for that (hopefully this month), but I am working on it in the evenings here. Bevy is in a very unique position to provide HPC primitives like this :) Down the pipeline, we can look at avoiding false-sharing in a type-safe and compile time checked manner when doing batching across multiple threads, too. That's my dream toolkit for simulations right there :) |
45cd178
to
31ed43c
Compare
Alright, managed to get this ported over to the new GAT way of doing things and simplified a lot in the process. However, the per-component alignment feature isn't implemented yet. I'm going to try seeing if I can use |
31ed43c
to
cc6e5ab
Compare
I took a little detour to try out what this would look like if q.for_each_mut_batched::<4>(
|(mut position, velocity)| {
//Scalar path
},
|(mut position, velocity)| {
//Batched path
}); Code provided here: https://github.com/inbetweennames/bevy/tree/generic_const_exprs All alignments automatically computed, absolutely zero work needs to be done by the user to do anything with alignments, and the autovectorizer loves it. The type is very simple to understand too: (One can dream!) Obviously I can't propose that for merging, but it does serve as an important design validation point that you really can make some trivial changes to the code once it is possible to evaluate const expressions in generics and just have this for free. I suppose this could be hacked in with |
Bad news :( ... even with the |
f115546
to
35ab799
Compare
@alice-i-cecile @james7132 @BoxyUwU @TheRawMeatball I think this is now fully ready for review! For example usage, see this benchmark 😄 -- it has been included in the docs too. |
Awesome :D I'll aim to do a full review by the end of next Monday: there's a ton of ECS work in the air right now. |
Sounds good, no rush! :) EDIT: Hmm weird, I guess I don't know the review request interface that well... didn't mean to remove the requests for reviews. |
PR bevyengine#6161 Issue bevyengine#1990 * Only Dense queries are accelerated currently * Code refactored to use GATs * Batch type does not encode alignment as per discussion * Simplified calling for_each_{mut_}batched (no longer need _ arguments) * Documentation about SIMD and batching * Added an example and an AVX-specific benchmark
35ab799
to
5025724
Compare
No changes in the latest commit; just rebased. I see some new verbiage in main about "BatchedStrategies" for parallel queries. That's cool -- just wondering if maybe a different name should be given to this different concept introduced in this PR to avoid confusion. While similar, the intent is different: the batching introduced in this PR is about processing batches of N components at a time using SIMD, whereas parallel batch sizes appear to be more about managing overhead and synchronization costs when executing a query across multiple threads. I'm open to suggestions here :) |
When #6773 is merged, I'll port this over to using the new style of iteration. |
FYI: did a little reading into the naming convention behind this concept. EnTT refers to it as "chunked iteration": skypjack/entt#462 How do we feel about using the name "chunk" instead of "batch" for this PR, to set it apart from other uses of "batch" in Bevy? Is there a better name we could use instead? |
From @james7132 on Discord. If you can find a way to make it work without this, or convince us that the assessment is wrong / it's worth making an exception, let us know. Similarly, if you need us to make changes to the internals to allow 3rd party crates that need unstable Rust features to implement this externally, we're also generally very open to that! |
Hey Alice, this was an old comment that was addressed already. This code
does not rely on any unstable rust features. Thank you!
…On Tue, Feb 28, 2023, 8:22 a.m. Alice Cecile ***@***.***> wrote:
[7:27 AM] james7132: On closer inspection, I don't think the batched
queries is going to be ready for 0.11 (or even 0.13 at this rate), it's
reliance on unstable rust features is probably a non-starter for it.
From @james7132 <https://github.com/james7132> on Discord
<https://discord.com/channels/691052431525675048/749335865876021248/1080103960943267911>.
If you can find a way to make it work without this, or convince us that the
assessment is wrong / it's worth making an exception, let us know.
Similarly, if you need us to make changes to the internals to allow 3rd
party crates that need unstable Rust features to implement this externally,
we're also generally very open to that!
—
Reply to this email directly, view it on GitHub
<#6161 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3VLDEEHERTHWTCNL26MULWZX32LANCNFSM6AAAAAAQ4QFHE4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Sorry about that, must have been reading an old version of the PR. Really shouldn't be doing this stuff at 4AM, haha. Quick pass seems to suggest we can aim for a 0.11 timeline for this. I really want something like this to help round out the last bits of CPU performance we may be lacking in the ECS. |
//# SAFETY: Vec3 is repr(C) for x86_64 | ||
let mx0 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(0)); | ||
let mx1 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(4)); | ||
let mx2 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(8)); | ||
let mx3 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(12)); | ||
let mx4 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(16)); | ||
let mx5 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(20)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is glam::Vec3
, this is unsound and repr(C)
is not enough. Vec3, as opposed to Vec3A, has padding bytes. That is, its layout is:
f32 | f32 | f32 | ? |
---|---|---|---|
init | init | init | poison |
so this is filling the 4th lane with data that should not be read. Vec3A is defined as __m128
already. You must use a full 16-byte type with zero padding, consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Indeed, you're right: the C (and C++) standards don't provide any guarantee against arbitrary padding being added between elements or at the end of a struct, even in the case it wouldn't be necessary to guarantee stride alignment in sequential array accesses. It's perfectly legitimate for a conforming C compiler to lay glam::Vec3
out in exactly the way you have suggested, including using a trap representation for the padding at the end of the struct. Even if glam::Vec3
were defined in terms of a single array [f32; 3]
, you still can't escape padding at the end as you have indicated, unless you made it repr(transparent)
anyway.
In C++, the way we'd normally deal with this is with a static assert for sizeof(Vec3) == 3*sizeof(float)
. It's a hack, but it works to avoid this problem. Indeed, on all major x86 ABIs, this holds. Since static asserts don't really exist in Rust, I'm not sure how to accomplish something equivalent there in this example.
I'd like to mention though, that real x86 C (and C++) compilers would never, ever add padding for the sake of adding it. Padding is something done on x86 to ensure alignment, and the alignment of Vec3
is not required to be anything higher than the alignment of f32
. It could be, and that would be legal in a theoretic sense for C, but no implementation out there -- on x86 -- would ever do it. For one thing, that would contradict the psABI and break compatibility with other code on x86 systems. However yes, one could construct a toy C compiler that would do just that, say to anticipate vectorization later on -- although I expect such a transform would be a pessimization a lot more often than not.
So, laying out Vec3
with 12 bytes of space and aligning to the same alignment of f32
meets the alignment requirements just fine -- and this is what real implementations -- on x86 -- do. Not that this excuses anything, mind you, because I agree: the code should be correct from a theoretic sense, or guarded by a compile-time guard to ward off the UB parts (as one would do in C++).
With all this said, as someone that isn't super in the know of the current Rust landscape in dealing with these things, what's the standard practice here? In C++ I'd stick a static assert on it and be done with it. How do we make this work in Rust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it was very late at night when I scanned through things here, and I both mislaid some mental math while reading this and flubbed what I meant to say. My apologies. Let me correct myself: the safety documentation here needs to explicate the actual safety precondition which is the layout match of 8 * 3 / (6 * 4) == 1
and these are unaligned loads, instead of a handwave about repr(C)
, as if (n * x) / (m * y) == 1
doesn't hold, then you have the possibility of reading the poisoned value.
You needn't particularly worry about toy C compilers, and I wasn't trying to language lawyer you on those. Rust picks specific C compilers to be compatible with (though it gets more vague if e.g. two compilers emit "the same" ABI but actually disagree on fine details). Thus when targeting x86_64-pc-windows-msvc, obviously it emits compatible code with the Windows x64 ABI as emitted by MSVC, etc. etc. down the line. There are problems that arise when e.g. gcc and clang disagree on the ABI.
You can generate a "static assert" in Rust using the pattern
const _: () = {
let precondition = {
// some code that can run in const eval
};
if precondition == false {
panic!("a nicely-written message")
};
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem and thanks for the feedback. I wrote a good amount of this PR myself during some late nights 😅 It sounds like we're converging on a solution. Would it be acceptable to document this about the expected layout of Vec3 for x86 in the example, and use that pattern to statically assert this is the case for completeness? With maybe an additional comment about how a repr(transparent)
type would be preferable for code that needs to be portable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussing repr(transparent)
isn't needed (it's a bit of a red herring) but otherwise yes, that sounds fine. My real concern is just about someone retouching this code later without a deep and horrifying knowledge we have of the details of the x86-64 ABI, what __m128
even is, and C layout, in a way that changes the numbers and causes the the last _mm_loadu_ps
to read 1 more float than actually exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just for my own understanding, can you clarify your statement about repr(transparent)
? My understanding is that if Vec3
were defined as a repr(transparent)
struct with a single [f32; 3]
member, all padding issues go away. Indeed, this is similar advice to what you'd see in the C++ space. Is there something I'm not seeing there?
Unfortunately I can't provide any input on concerns about maintainability -- I think this kind of stuff just comes with the gamedev (and simulation) territory. At least this is user provided code, not library code in Bevy, so it's not like we have ECS internals doing this kind of thing.
Backlog cleanup: this one has a long history of attempts to get it across the finish line! Closing for now due to inactivity, leaving in place the original tracking issue (#1990) and perhaps it could be adopted from there if still needed. |
Objective
Fixes #1990
Solution
Changelog
N
using thefor_each_batched
/for_each_mut_batched
interfacesMAX_SIMD_ALIGNMENT
constant that table columns are aligned to (at minimum) (cache related)BlobVec
s are now aligned toMAX_SIMD_ALIGNMENT
to support batching use cases and better performance (aligned loads)SimdAlignedVec
type was added to provideMAX_SIMD_ALIGNMENT
on the inner buffer, providingVec
-like semantics (aligned loads)-Ctarget-feature=+avx
to yourRUSTFLAGS
on x86)A detailed discussion can be found in #1990 , including benchmarks justifying why batched queries are worth implementing.
Older benchmarks can be found here: https://github.com/inbetweennames/bevy-simd-bench
Migration Guide
No breaking changes.
Fixes #6161