Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement batched query support #6161

Closed

Conversation

InBetweenNames
Copy link

@InBetweenNames InBetweenNames commented Oct 4, 2022

  • Only Dense queries are accelerated currently
  • Certain features are awaiting on GATs/generic const expressions
  • "Alignment-oblivious" architecture designed for best-in-class performance on modern (>2009) microarchitectures for simplicity

Objective

Fixes #1990

Solution

  • Implement a batch-oriented view of the Bevy ECS suitable for SIMD acceleration, providing overalignment of table columns to ensure efficient aligned access where possible.

Changelog

  • Queries can be executed in batches of N using the for_each_batched/for_each_mut_batched interfaces
  • Added a `WorldQueryBatch`` trait to support batched queries
  • Performance related changes for batches:
    • Added a MAX_SIMD_ALIGNMENT constant that table columns are aligned to (at minimum) (cache related)
    • BlobVecs are now aligned to MAX_SIMD_ALIGNMENT to support batching use cases and better performance (aligned loads)
    • SimdAlignedVec type was added to provide MAX_SIMD_ALIGNMENT on the inner buffer, providing Vec-like semantics (aligned loads)
  • Tests have been added for the new functionality
  • Docs too (with runnable examples)
  • Benchmarks provided directly in the Bevy tree (requires AVX to run! Add -Ctarget-feature=+avx to your RUSTFLAGS on x86)

A detailed discussion can be found in #1990 , including benchmarks justifying why batched queries are worth implementing.
Older benchmarks can be found here: https://github.com/inbetweennames/bevy-simd-bench

Migration Guide

No breaking changes.

Fixes #6161

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels Oct 4, 2022
@alice-i-cecile alice-i-cecile requested review from james7132, TheRawMeatball and BoxyUwU and removed request for TheRawMeatball October 4, 2022 12:08
Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of feedback! Mostly docs related.

In addition to my nits, I think this deserves a full example (ideally with easy mode-toggling for self-benchmarking) in the root level examples/bevy_ecs. This is a cool feature, but it's going to be tricky to discover, understand and use for intermediate users.

crates/bevy_ecs/src/change_detection.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/change_detection.rs Outdated Show resolved Hide resolved
crates/bevy_ptr/src/batch.rs Outdated Show resolved Hide resolved
crates/bevy_ptr/src/batch.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/fetch.rs Outdated Show resolved Hide resolved
@@ -778,6 +811,157 @@ impl<Q: WorldQuery, F: ReadOnlyWorldQuery> QueryState<Q, F> {
);
}
}
/// This is a batched version of `for_each_mut` that accepts a batch size, `N`, and a desired alignment for the batches `ALIGN`.
/// Runs `f` and `f_batch` on each query result for the given [`World`]. This can be used to improve performance of your queries using SIMD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs need to explain:

  • what is alignment
  • what is SIMD
  • when this might improve (or worsen) performance

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be okay if I just wrote a short blurb about these things, referring to e.g. Wikipedia or maybe some Intel blog posts for a more in-depth treatment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's totally fine. I just want this to be a bit less impenetrable to users who are new to low-level programming :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alice-i-cecile , I added a short little introduction to Query::for_each_mut_batched covering these concepts, would you mind giving it a read when you get a chance? I don't want to go overboard with info but discoverability is a good thing. Thank you!

crates/bevy_ecs/src/query/state.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/state.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/query/state.rs Outdated Show resolved Hide resolved
crates/bevy_ecs/src/storage/aligned_vec.rs Outdated Show resolved Hide resolved
@alice-i-cecile alice-i-cecile added C-Feature A new feature, making something new possible X-Controversial There is active debate or serious implications around merging this PR labels Oct 4, 2022
@InBetweenNames
Copy link
Author

Wow, you're fast!! I will see about getting these finished in the coming days, thanks for your patience :)

@InBetweenNames
Copy link
Author

Happy Thanksgiving!

Hah, I'm glad I wrote tests. Found a few soundness issues thanks to the Miri CI we have here. You bet I'll be running that locally now too :) fortunately the issues weren't fundamental.

Currently working on the remaining documentation related problems now that the soundness issues have been resolved.

There's a few ways of using the API and the current example only shows the more "advanced" way with repr(transparent). It's now possible to more easily use batches with a non-"transparent" API using the map and as_array functions of the AlignedBatch trait. The tradeoff is that it seems to optimize worse -- the Rust docs seem to indicate that the performance situation is a little finicky with array maps. In a quick test I did, I saw loads and stores that could have been easily optimized out appearing in the assembly. The docs recommend using the iter() API instead -- maybe I'll play around with that later this week after I clean up the docs a bit more.

Copy link
Member

@BoxyUwU BoxyUwU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havent reviewed all of this, I don't think we should merge this with the current changes to fetch.rs but I also think there is not much we can do until feature(generic_associated_types) is stabilized (other than removing the genericness over ALIGN and having 3 separate fns or just having one). next rust release should have stable gats which is in ~20 days so I don't think this is too much of an ask to wait (and its extraordinarily unlikely imo that this PR actually gets merged in such a short timeframe anyway)

crates/bevy_ecs/src/storage/blob_vec.rs Outdated Show resolved Hide resolved
Comment on lines 383 to 373

//Note: NEEDED for batching. This is the layout of the array itself, not the layout of its elements.
let array_layout = array_layout.align_to(batch::MAX_SIMD_ALIGNMENT).unwrap();
Copy link
Member

@BoxyUwU BoxyUwU Oct 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems pretty sus to me, this is no longer giving the layout of an array and is also no longer a copy of the layout.rs fn from std. docs and fn name ought to be changed to explain what this fn does and making it super clear that this can return a layout with stride!=size which is pretty unusual in rust, or alternatively (and my preference) call .pad_to_align() so we arent doing anything "weird".

edit: apparently there is no array_layout in std..? I am wildly confused as to why this doc comment says its taken from std when I cannot see this code there. regardless point still remains that this is no longer creating the layout of an array and needs to be updated

Copy link
Author

@InBetweenNames InBetweenNames Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I can see where I wasn't clear with this. stride == size definitely still holds -- this statement just aligns the start of the array, if that makes sense (the chunk of memory for the array itself is aligned to MAX_SIMD_ALIGNMENT). The element layout remains exactly the same as it was before. This was needed to guarantee that a batch starting at index 0 would be aligned.

For interested readers, typically SIMD code operates in three "phases", a prologue, a "vector region", and an epilogue.
The prologue and epilogue are just "scalar code paths", i.e., your typical for loops. The prologue's purpose is to process elements up until it is possible to execute vector instructions, which generally require a certain alignment to work. However, the vector region can only execute in batches of size N, which means that if there are any remaining items to do that can't fit in a batch of size N, they have to be processed serially.

As an example, consider an array of [f32; 128], which is aligned to a boundary of 4 (say, address 100). Suppose you had a special vector instruction set that could operate on batches of 8 f32s at a time, the catch being, the loads have to be aligned to 32 as well (sizeof f32 being 4). Then, because of the alignment of the array, you would only be able to batch from index 7 onwards (since index 7 would be at address 128). Then you would of course need a prologue to handle that "slow path". However, if the array started at address 96, you could see that no prologue would be necessary.

This line here eliminates the prologue case from Bevy by guaranteeing that batch 0 is aligned to the correct SIMD alignment in all cases, because MAX_SIMD_ALIGNMENT is the greatest alignment required of any instruction on the given system (including vector instructions). Does this make sense? Thank you!

Copy link
Member

@BoxyUwU BoxyUwU Oct 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stride == size does not hold for the layout returned by this fn, align(MAX_SIMD_ALIGNMENT) size(2) for example isnt stride == size but could be potentially returned from this fn. if you look at the docs for Layout::align_to you can see that it says that calling align_to(64) on a 32 size layout doesnt give you 64size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly what I was after -- this array_layout is what is ultimately used to allocate memory for the entire array after the stride has already been taken into account. It is derived from the item_layout using the repeat_layout function. The flow (as I understand it) works as follows:

1 - The item_layout of T is passed in via the BlobVec constructor and held. (stride matters here, as you rightly point out!)
2 - The array_layout is generated from repeat_layout(layout, n) (stride is taken into account by repeat_layout -- this gives us a Layout that can hold N item_layouts [size == stride] and aligned to alignment of item_layout)
3 - The "SIMD aligned" array_layout is obtained by taking that array_layout and then calling align_to on it with MAX_SIMD_ALIGNMENT (stride has already been taken into account -- we are simply telling the allocator "take this Layout and ensure that the alignment requirement is MAX_SIMD_ALIGNMENT. I.e. this is now a Layout that can hold N item_layouts [size == stride] except the the chunk of memory will now be aligned to MAX_SIMD_ALIGNMENT)
4 - It is ultimately (3) that is passed into the allocator

Note that the item_layout is left pristine through all of this to ensure we don't add any extra padding between elements. It is strictly just the array_layout that gets align_to(MAX_SIMD_ALIGNMENT).
The actual elements of the array are still laid out like repeat_layout would do it.

This is the best of my understanding of that at the moment. If the code called align_to on the item_layout itself, then stride != size for sure. But since it's being called on array_layout after repeat_layout has already done its job, the stride should already have been taken into account. Is my understanding right here?

(Apologies for the ASCII art, I'm a visual thinker :D)


1 - item_layout provided to constructor
2 - align(item_layout):        | item_layout item_layout item_layout item_layout ... (N times) | (repeat_layout)
3 - align(MAX_SIMD_ALIGNMENT): | item_layout item_layout item_layout item_layout ... (N times) | (array_layout)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//A "batch" of ZSTs is itself a ZST.
//This sounds dangerous, but won't cause harm as nothing
//will actually access anything "in the slice".
//ZSTs are trivially well aligned, so there is nothing to be checked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZSTs are not trivially well aligned they are the same as every other type and should also have a debug check here. For () align is 1 so every non null pointer is aligned which is pretty trivialy, there are however types that have size 0 and align >=2

align_offset also says that it might be implemented to unconditionally return usize::MAX, and it is only to be used for perf not correctness so we also shouldn't use it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely an oversight on my part, and I agree this needs to be done better. Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the comment about ZSTs being trivially well aligned is wrong. I've removed that. Fortunately it didn't have much to do with the rest of the code :)

Do you have any suggestions for how to check that the pointer is aligned idiomatically? This StackOverflow seems to suggest that align_offset is the way to go until pointer_is_aligned is stable. I've rewritten this as a pointer to integer cast modulo ALIGN (with a TODO to change this when pointer_is_aligned is stable), but I'm not 100% convinced this is the best way to go either.

The latest commit I've pushed up has these changes.

Comment on lines 224 to 229
if core::mem::size_of::<T>() > 0 {
//NOTE: ZSTs may cause this "slice" to point into nothingness.
//This sounds dangerous, but won't cause harm as nothing
//will actually access anything "in the slice"
debug_assert!(ptr.align_offset(batch::MAX_SIMD_ALIGNMENT) == 0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, we shouldn't use align_offset or skip checking align for ZSTs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See comment above)

crates/bevy_ecs/src/query/fetch.rs Outdated Show resolved Hide resolved
@InBetweenNames
Copy link
Author

Thanks very much for your comments @BoxyUwU -- this is a learning experience in Rust for me and I'll work at addressing yours (and the rest of @alice-i-cecile 's ) comments over the next few weeks. I think delaying until GATs are stable makes sense and I can actually start readying the code for that in the meantime. Thank you for your patience :)

@InBetweenNames InBetweenNames force-pushed the batched-queries branch 2 times, most recently from f3f5e78 to a06c865 Compare October 19, 2022 01:20
@InBetweenNames
Copy link
Author

Next up on my list is to get one of the Bevy examples going with batched queries (in a very limited way) to demonstrate batched queries a bit better. Thanks for your comments and your patience everyone.

@alice-i-cecile
Copy link
Member

Not a problem! It's been great to see this slowly progress.

@james7132
Copy link
Member

james7132 commented Nov 1, 2022

@InBetweenNames can you please mark in the PR description that this Fixes #6161 so that it gets auto-closed when this is merged? I'll give this a review soon. This has been quite the API hole compared to what Unity's ECS has that Bevy sorely needs.

@InBetweenNames
Copy link
Author

@InBetweenNames can you please mark in the PR description that this Fixes #6161 so that it gets auto-closed when this is merged? I'll give this a review soon. This has been quite the API hole compared to what Unity's ECS has that Bevy sorely needs.

Not a problem! I have some more changes coming down the pipeline when I get the energy to make the API a little friendlier. With the stabilization of GATs being imminent, I'd like to clean up the code with those (not a big change), and also restructure for_each_mut_batched to work similar to the following:

    for_each_mut_batched::<4>( |((mut a, Align16), (b, Align32))| {
        
        println!("a: {:?} b: {:?}", a, b);
        
    } );

That is, the Align parameter now becomes per-component, which is way more flexible. As a bonus, I can provide a for_each_mut_batched variant that doesn't care about alignment -- with the usual caveats that would imply in terms of codegen.

For the above example, such a variant would be equivalent to:

    for_each_mut_batched::<4>( |((mut a, Unaligned), (b, Unaligned))| {
        
        println!("a: {:?} b: {:?}", a, b);
        
    } );

Simple, safe, effective :)

Can't promise any timelines for that (hopefully this month), but I am working on it in the evenings here. Bevy is in a very unique position to provide HPC primitives like this :) Down the pipeline, we can look at avoiding false-sharing in a type-safe and compile time checked manner when doing batching across multiple threads, too. That's my dream toolkit for simulations right there :)

@InBetweenNames InBetweenNames marked this pull request as draft November 6, 2022 21:30
@InBetweenNames
Copy link
Author

Alright, managed to get this ported over to the new GAT way of doing things and simplified a lot in the process. However, the per-component alignment feature isn't implemented yet. I'm going to try seeing if I can use typenum to automatically compute it, which would keep things still relatively simple while still not relying on generic const expressions (which seem to have some foundational problems). I've moved the PR over to "draft" status as I don't think it should be merged before alignments can be manually set per component or they can be automatically computed. Still, it's worth reviewing the changes so far as what is committed shouldn't change drastically after those "feature complete" changes are in.

@InBetweenNames
Copy link
Author

InBetweenNames commented Nov 9, 2022

I took a little detour to try out what this would look like if generic_const_exprs were stable, and it's actually really nice:

q.for_each_mut_batched::<4>(
                |(mut position, velocity)| { 
                    //Scalar path
                },
                |(mut position, velocity)| { 
                    //Batched path
                });

Code provided here: https://github.com/inbetweennames/bevy/tree/generic_const_exprs

All alignments automatically computed, absolutely zero work needs to be done by the user to do anything with alignments, and the autovectorizer loves it. The type is very simple to understand too: AlignedBatch<T, N> automatically has the correct batch alignment computed using the gcd! Literally all the user needs to do is select the batch size N, provide a scalar and vector path, and the rest is taken care of for you. It really is that nice.

(One can dream!)

Obviously I can't propose that for merging, but it does serve as an important design validation point that you really can make some trivial changes to the code once it is possible to evaluate const expressions in generics and just have this for free. I suppose this could be hacked in with typenum too, maybe... might be worth a shot.

@InBetweenNames
Copy link
Author

Bad news :( ... even with the typenum hackery, it seems this can't be done on stable as it's not possible to get the size of a type T as a typenum without using generic_const_exprs. Therefore, it seems the best we can do (for now, on stable) is implement the per-component manual alignments as described previously. That's okay, because the "unaligned" variant can then be promoted to "alignments automatically derived" without code changes once generic_const_exprs are stable, and the manual aligned version can easily migrated off. I'll go ahead and start working on that. The API is still very usable this way :)

@InBetweenNames
Copy link
Author

InBetweenNames commented Jan 16, 2023

@alice-i-cecile @james7132 @BoxyUwU @TheRawMeatball I think this is now fully ready for review!

For example usage, see this benchmark 😄 -- it has been included in the docs too.

@alice-i-cecile
Copy link
Member

Awesome :D I'll aim to do a full review by the end of next Monday: there's a ton of ECS work in the air right now.

@InBetweenNames
Copy link
Author

InBetweenNames commented Jan 16, 2023

Sounds good, no rush! :)

EDIT: Hmm weird, I guess I don't know the review request interface that well... didn't mean to remove the requests for reviews.

@InBetweenNames InBetweenNames requested review from BoxyUwU and removed request for james7132 and TheRawMeatball January 17, 2023 00:25
@james7132 james7132 self-requested a review January 17, 2023 22:29
PR bevyengine#6161
Issue bevyengine#1990

* Only Dense queries are accelerated currently
* Code refactored to use GATs
* Batch type does not encode alignment as per discussion
* Simplified calling for_each_{mut_}batched (no longer need _ arguments)
* Documentation about SIMD and batching
* Added an example and an AVX-specific benchmark
@InBetweenNames
Copy link
Author

No changes in the latest commit; just rebased. I see some new verbiage in main about "BatchedStrategies" for parallel queries. That's cool -- just wondering if maybe a different name should be given to this different concept introduced in this PR to avoid confusion. While similar, the intent is different: the batching introduced in this PR is about processing batches of N components at a time using SIMD, whereas parallel batch sizes appear to be more about managing overhead and synchronization costs when executing a query across multiple threads. I'm open to suggestions here :)

@InBetweenNames
Copy link
Author

When #6773 is merged, I'll port this over to using the new style of iteration.

@InBetweenNames
Copy link
Author

FYI: did a little reading into the naming convention behind this concept. EnTT refers to it as "chunked iteration": skypjack/entt#462

How do we feel about using the name "chunk" instead of "batch" for this PR, to set it apart from other uses of "batch" in Bevy? Is there a better name we could use instead?

@alice-i-cecile alice-i-cecile removed this from the 0.11 milestone Feb 28, 2023
@alice-i-cecile alice-i-cecile added the S-Blocked This cannot move forward until something else changes label Feb 28, 2023
@alice-i-cecile
Copy link
Member

[7:27 AM] james7132: On closer inspection, I don't think the batched queries is going to be ready for 0.11 (or even 0.13 at this rate), it's reliance on unstable rust features is probably a non-starter for it.

From @james7132 on Discord. If you can find a way to make it work without this, or convince us that the assessment is wrong / it's worth making an exception, let us know.

Similarly, if you need us to make changes to the internals to allow 3rd party crates that need unstable Rust features to implement this externally, we're also generally very open to that!

@InBetweenNames
Copy link
Author

InBetweenNames commented Feb 28, 2023 via email

@alice-i-cecile alice-i-cecile removed the S-Blocked This cannot move forward until something else changes label Feb 28, 2023
@james7132
Copy link
Member

james7132 commented Mar 1, 2023

Sorry about that, must have been reading an old version of the PR. Really shouldn't be doing this stuff at 4AM, haha.

Quick pass seems to suggest we can aim for a 0.11 timeline for this. I really want something like this to help round out the last bits of CPU performance we may be lacking in the ECS.

@james7132 james7132 added this to the 0.11 milestone Mar 1, 2023
Comment on lines +43 to +49
//# SAFETY: Vec3 is repr(C) for x86_64
let mx0 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(0));
let mx1 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(4));
let mx2 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(8));
let mx3 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(12));
let mx4 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(16));
let mx5 = _mm_loadu_ps((aos_inner as *const Vec3 as *const f32).offset(20));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is glam::Vec3, this is unsound and repr(C) is not enough. Vec3, as opposed to Vec3A, has padding bytes. That is, its layout is:

f32 f32 f32 ?
init init init poison

so this is filling the 4th lane with data that should not be read. Vec3A is defined as __m128 already. You must use a full 16-byte type with zero padding, consistently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Indeed, you're right: the C (and C++) standards don't provide any guarantee against arbitrary padding being added between elements or at the end of a struct, even in the case it wouldn't be necessary to guarantee stride alignment in sequential array accesses. It's perfectly legitimate for a conforming C compiler to lay glam::Vec3 out in exactly the way you have suggested, including using a trap representation for the padding at the end of the struct. Even if glam::Vec3 were defined in terms of a single array [f32; 3], you still can't escape padding at the end as you have indicated, unless you made it repr(transparent) anyway.

In C++, the way we'd normally deal with this is with a static assert for sizeof(Vec3) == 3*sizeof(float). It's a hack, but it works to avoid this problem. Indeed, on all major x86 ABIs, this holds. Since static asserts don't really exist in Rust, I'm not sure how to accomplish something equivalent there in this example.

I'd like to mention though, that real x86 C (and C++) compilers would never, ever add padding for the sake of adding it. Padding is something done on x86 to ensure alignment, and the alignment of Vec3 is not required to be anything higher than the alignment of f32. It could be, and that would be legal in a theoretic sense for C, but no implementation out there -- on x86 -- would ever do it. For one thing, that would contradict the psABI and break compatibility with other code on x86 systems. However yes, one could construct a toy C compiler that would do just that, say to anticipate vectorization later on -- although I expect such a transform would be a pessimization a lot more often than not.

So, laying out Vec3 with 12 bytes of space and aligning to the same alignment of f32 meets the alignment requirements just fine -- and this is what real implementations -- on x86 -- do. Not that this excuses anything, mind you, because I agree: the code should be correct from a theoretic sense, or guarded by a compile-time guard to ward off the UB parts (as one would do in C++).

With all this said, as someone that isn't super in the know of the current Rust landscape in dealing with these things, what's the standard practice here? In C++ I'd stick a static assert on it and be done with it. How do we make this work in Rust?

Copy link

@workingjubilee workingjubilee Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it was very late at night when I scanned through things here, and I both mislaid some mental math while reading this and flubbed what I meant to say. My apologies. Let me correct myself: the safety documentation here needs to explicate the actual safety precondition which is the layout match of 8 * 3 / (6 * 4) == 1 and these are unaligned loads, instead of a handwave about repr(C), as if (n * x) / (m * y) == 1 doesn't hold, then you have the possibility of reading the poisoned value.

You needn't particularly worry about toy C compilers, and I wasn't trying to language lawyer you on those. Rust picks specific C compilers to be compatible with (though it gets more vague if e.g. two compilers emit "the same" ABI but actually disagree on fine details). Thus when targeting x86_64-pc-windows-msvc, obviously it emits compatible code with the Windows x64 ABI as emitted by MSVC, etc. etc. down the line. There are problems that arise when e.g. gcc and clang disagree on the ABI.

You can generate a "static assert" in Rust using the pattern

const _: () = {
    let precondition = {
        // some code that can run in const eval
    };
    if precondition == false {
        panic!("a nicely-written message")
    };
};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem and thanks for the feedback. I wrote a good amount of this PR myself during some late nights 😅 It sounds like we're converging on a solution. Would it be acceptable to document this about the expected layout of Vec3 for x86 in the example, and use that pattern to statically assert this is the case for completeness? With maybe an additional comment about how a repr(transparent) type would be preferable for code that needs to be portable?

Copy link

@workingjubilee workingjubilee Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing repr(transparent) isn't needed (it's a bit of a red herring) but otherwise yes, that sounds fine. My real concern is just about someone retouching this code later without a deep and horrifying knowledge we have of the details of the x86-64 ABI, what __m128 even is, and C layout, in a way that changes the numbers and causes the the last _mm_loadu_ps to read 1 more float than actually exists.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just for my own understanding, can you clarify your statement about repr(transparent)? My understanding is that if Vec3 were defined as a repr(transparent) struct with a single [f32; 3] member, all padding issues go away. Indeed, this is similar advice to what you'd see in the C++ space. Is there something I'm not seeing there?

Unfortunately I can't provide any input on concerns about maintainability -- I think this kind of stuff just comes with the gamedev (and simulation) territory. At least this is user provided code, not library code in Bevy, so it's not like we have ECS internals doing this kind of thing.

@alice-i-cecile alice-i-cecile removed this from the 0.11 milestone Jun 19, 2023
@hymm hymm added this to the 0.13 milestone Oct 9, 2023
@alice-i-cecile alice-i-cecile removed this from the 0.13 milestone Jan 24, 2024
@ItsDoot ItsDoot added the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Aug 19, 2024
@bas-ie
Copy link
Contributor

bas-ie commented Oct 6, 2024

Backlog cleanup: this one has a long history of attempts to get it across the finish line! Closing for now due to inactivity, leaving in place the original tracking issue (#1990) and perhaps it could be adopted from there if still needed.

@bas-ie bas-ie closed this Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! X-Controversial There is active debate or serious implications around merging this PR
Projects
Status: Open PR
Development

Successfully merging this pull request may close these issues.

Batched ECS Query
8 participants