-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically batch similar commands together to improve performance and robustness #10154
Comments
Custom commands are fine since they're still commands. The batching implementation I'm investigating (used by flecs) reduces the number of moves per entity from Those would interrupt the batching or we'd have to do things in a weird order (e.g. run commands first, deferred operations last). IMO just remove The only thing to note about custom commands is that batching would be able to "move" entity commands to run before them. (edit) If (edit 2) I'm not recommending we remove |
Since commands modify the shared state of the world, I don't think they could be reordered safely, unless we know ahead of time the set of all commands and all of their interactions, which is impossible with custom commands and will make adding new commands very difficult and fragile. By not reordering the commands, the user has more control over how the world is modified, which is critical when the commands are trying to reflect in the world modifications to the user's state. It also becomes the user's responsibility to ensure commands are added in a valid order, which is much easier to control and debug when the commands are executed in the order they are created in in the user's code. |
I'm pretty sure batching is orthogonal to this. Commands in general shouldn't even have errors, and all our built-in commands should be infallible. The panic that happens when an All we need to do is make a PR that (1) changes the panic to a
It's more nuanced than this. The batching implementation I am working on can and will re-order1 the observed effects of commands, but in a specific way. Only entity commands would be re-ordered. Take this queue for example.
The batching process will link commands on each entity so that we only move it once, to its final archetype. This will "shift" all the operations on So with batching, things will happen in this order.
The reality is that custom commands should never make implicit assumptions about which entities exist and what components they may or may not have. i.e. If you pushed Also, this behavior is how Footnotes
|
After more thinking, I agree. |
I think the core question here is what exactly is the purpose of I think the most valuable part of commands is the ability to delay mutating the world, thus allowing all of the non-mutating functionality to be executed in parallel. Giving all of the responsibility of mutating the world to
Given your example queue, imagine that I think that at best, it is possible to batch non-custom commands that don't cross custom command boundaries. And even that requires the engine developer to ensure that the type of commands being batched can be reordered safely, which increases the difficulty of developing the engine. |
Today, you can already write two systems that queue You don't know what systems plugins have scheduled. And if we merge #9822 (which I want to do), you won't even know exactly when commands are applied.
I don't see room for compromise. We can't make any exceptions because we will run into this exact same problem when we try to implement our own version of That kind of immediate reactivity will put us in a situation where custom user code is potentially running right after every entity command, and if custom code creates boundaries, then we'd be back to zero batching, which is not acceptable.
Commands have no outside context. A command can't assume what happened before it or what will happen after it. And that will be even more true with the reactivity I just mentioned. From my perspective, we have a lot to gain (performance and reactivity, the lack of which are real, tangible problems) and little to lose by batching over custom code. The fact that |
I think this could lead to very unintuitive code if taken to the extreme. For example, if we have the following system: fn my_system(mut commands: Commands) {
let entity = commands.spawn_empty().id();
commands.add(MyCustomCommand::new(entity));
commands.entity(entity).insert(Transform::default());
} I would find it very surprising to discover that
My main worry comes from my own experience with #10122; The core problem comes from systems that share mutable state through a resource: these systems are unware of their own execution order, and therefore they can only generate commands based on the state that they observe when they run. They always generate the correct commands for the state, and the purpose of these commands is to reflect the state into bevy's world. But then the commands don't get applied in the same order as they are spawned, which in the best case would lead to a desync between the state and bevy's world, and in the worst case, a panic. For my personal issue, infallible commands will work, but I'm just not sure this is a solution that scales: the current bevy commands can easily become infallible, but what if in the future a new command gets added where its not trivial to make it infallible? and even worse, now all users who implement custom commands will have to be aware of the delicacies of implementing custom commands that never fail and are completely order-independent. This can lead to extremely hard-to-debug bugs. I hope batching won't be problematic and yield good performance, but I just got burnt by #10122 and I hope I can help the future implementation avoid the problems I faced. |
That's fine. The docs should definitely call these things out. It's just pointless to implement batching if we also had to enforce a rule that it must stop at custom code. That's just not an actionable path to take.
"Infallible" was a poor choice of words on my part. I was only talking about the panic that happens when entity commands just exist in the "wrong order", like
If there are situations where a user can understandably write custom commands that depend on order (unrelated to the existing panic), it'd help to highlight some examples, because I'm drawing a blank. If those patterns exist, maybe custom commands aren't the right tool, something else could be better. Also, I only said "infallible" with respect to our built-in commands. If users want their custom commands to error, by all means, but I don't think those errors would result from stuff happening between the command's dispatch and its application. |
I agree, most of the benefit would be lost this way.
I think the main example here is using commands to sync up the different parts of the state while avoiding taking a mutable reference to them to enable parallelism: Of course, this example could be fixed by either guaranteeing that custom (and Given that I don't have a better example, I think I am coming around to the idea of batching, and maybe the ultimate solution for now is to add batching with good docs that detail the potentially surprising pitfalls that could catch users by surprise. If eventually someone has a real world example where batching is unsatisfactory, it should probably become configurable during application creation for users who would rather have ordered commands instead of faster commands. |
Is there any significance to taking a mutable reference to
In this specific example, I don't think A or B would have come from a plugin if they're updating the game score. Presumably they're both the user's code, so scheduling If two systems have no discernible direct or transitive order (i.e. it's a toss-up who runs first) and both take a mutable reference to the same thing, bevy won't pick an order, but they'll be reported by the ambiguity checker.
Mutable reference or not, if two systems with an ambiguous order both run and queue commands modifying This problem seems intrinsic to parallel system execution and the constraint-based scheduling approach.
A guarantee that command queues are applied in the same order the systems ran would help users debug/identify a case of #10122, but I don't think that would address a broader thing—if systems can have a non-deterministic order and this is inherited by their command queues, then their commands also run in a non-deterministic order. Something that's come up for discussion recently is whether or not the scheduler actually improves performance. Workloads that can actually benefit from parallel system execution—ones where the time saved by running systems at the same time is more than the time lost to the added overhead—are rare (most systems finish very quickly). In many of bevy's tests, the Anyway, making serial system execution the default (and also relegating the scheduler to a targeted optimization tool) is a possible outcome that would also resolve this issue. (edit: Worth pointing out that serial execution still allows for
I think it's too early to consider this. IMO the first response should be to see if there's a way a user can revise their code to do what they want. That pattern could then be highlighted in the documentation. Just because something doesn't work how a user initially expects, doesn't mean the library needs to meet those expectations (by making it optional). I personally don't think batching should be something the user can turn off (or something they should even want to turn off). If an app was "large enough", you probably couldn't even get away with turning it off. I don't like the idea of a common piece of advice ("turn batching off") only working for, say, jam games. (edit: For reference, commands are a known time sink for bevy, especially when it comes to preparing to render, and |
The difference is that it guarantees the systems will be executed after one another, and not in parallel. It makes sure they generate appropriate commands for the state they observe. You could also imagine these are complex systems that modify the game state in some manner.
The idea is that you don't care which system runs first because they will both generate appropriate commands for the state they observe: Essentially, they use the shared state to synchronize and ensure consistency. If As for performance, I think batching commands probably also requires benchmarks to see if it improves performance. Commands are probably even simpler than most systems (though I could be wrong on this), and the time it would take to batch them might be longer than simply executing them. This is certainly a worthy exploration though. Ultimately, we probably should get benchmarks to determine the tradeoffs, and under the assumption that there are tradeoffs, I would vote for configurability. |
I don't think that's ever a safe assumption. Mutable reference or not, one system's command will run first and then the other system's command won't see the state it was "expecting".
Of course, but moving an entity is pretty expensive. To move an entity from one table (and archetype) to another, If an entity has 10 components and you add 10 more with individual Batching would always reduce this to a single round of copying (and allocation) and it really only changes the internal order of operations to do so (the traversal to arrive at the final destination table/archetype must happen either way). The main overhead I foresee is the extra iteration(s) of the command queue. |
If
I didn't know bevy was that inefficient. This is certainly a very strong reason to believe batching could lead to major performance gains. I look forward to seeing it implemented. |
You haven't addressed what I've been saying about this problem still existing if you take the same setup and just change it so that Is that scenario meaningfully different?
(bolded emphasis mine) You've been putting forward this idea that conflicting mutable access imbues systems with some kind of implicit relationship, but I disagree with it. A system can only "care about" or "have expectations of" others if those are captured as explicit dependencies, either directly (e.g. We could guarantee that system command queues will be executed in the same order that the systems started (and that's pretty orthogonal to batching btw), but that would be invariant of the systems having mutable or immutable access. That guarantee also does not make "systems with conflicting access have a non-deterministic order" any less of a goof that can cause subtle bugs. I think the simplest guideline for users writing custom commands is: Custom commands should be self-contained. Don't write a command that makes any assumptions about being preceded or followed by any other commands. (If by some stretch of imagination that can't be upheld, we should probably add a mechanism for one command to queue another.) We could formally guarantee that custom commands are not disturbed relative to each other (again, batching only cares about moving built-in commands), but if the systems dispatching them have a non-deterministic order, you can still have bugs. |
It is different: it is analogues to multiple threads mutating the same memory location with or without locks. When using When using Its similar to the problem novice programmers run into when they try to use a if locked == false {
locked = true;
// do something
} else {
// try again later
} If there are 2 threads running this code simultaneously, they might both check the The key difference is apparent when the systems use the commands specifically to sync bevy's state to the resource's state:
Notice that the systems are completely unaware of each other. They don't require any specific ordering, but they do share a lock (the
I don't think that's orthogonal: if 2 systems sent commands to insert a bundle on the same entity, shouldn't they be batched?
That is true, but mutable and immutable access would implicitly change whether or not this order is observable by the systems.
The key is that they do have some order, something that can't be said for systems with no conflicts. Of course, this could still be a breeding ground for bugs, but can also be a powerful tool when its needed. |
I don't think that is "almost certainly" true (if it's an enum, sure), but I can't speak for what everyone is doing. I'm cool with "queues should be applied in the same order that systems run", I just don't agree that it follows from all this stuff about mutable access.
Yes. Matching the order command queues apply to the order systems start is independent of whether or not those commands get batched. If we had to concatenate the queues together to batch over all of their commands, they can still lined up in that order. |
You're right, of course its gonna depend on the situation, the system, and the state, and may or may not be a different command. The key is that the command can be adapted to the new state.
You're right again, it doesn't follow. Rather, the '
That means that given 3 systems, the commands from the last system could execute before some of the commands from the second system. That's not exactly system order, but I don't think it introduces any problems we haven't resolved yet. |
That is unrelated. If you don't use |
The order of system execution can be randomized every single frame. The important part is that using |
I see. You're right that the mutable access could change at which point a system runs each frame. I read the issue you linked above and that does demonstrate your point. I don't think that either example is affected by batching since batching would affect the order of non-batched vs batched commands but not the ordering within each group. I wonder how #9822 plays into this. |
I apologize if my |
That makes sense :) |
The connection to batching is as follows: If we have 3 systems that are synchronized using |
This PR have multiple purposes, but unfortunately they highly related, so I can't split it into multiple PRs. Right for prediction crates like https://github.com/Bendzae/bevy_replicon_snap or https://github.com/RJ/bevy_timewarp we rely on overriding deserialization and remove functions. But this approach have 3 major drawbacks: 1. In order to customize serialization for interpolated or predicted components, user need to copy the code from the crate or use some crate-provided helpers. 2. Serialization function is is unsafe due to pointer manipulation. 3. If you want to override writing or removal only for some entities, it's costly to check if a component present on each writing. Marker-based API solves all of the above. Now for replication user register only `serialize`, `deserialize` and newly added `deserialize_in_place`. These functions assigned to rules as before. So instead of `ComponentFns`, we now have `SerdeFns`. And these function now only do one thing - serialization/deserialization. All other logic now represented by `CommandFns`. Unlike `SerdeFns`, this struct created for each component only once and automatically when user register a `SerdeFns` for a component. It uses default `write` and `remove` functions. To override it, user can register a marker and assign a custom functions for each component. We use markers instead of archetypes because clients do not know in advance the archetype of the entity being deserialized. Then on receive we collect a reusable `Vec<bool>` for each marker (by checking if a marker is present) and based on it pick `write`/`remove` functions for each component. We pick first marker that have a registration for the current component and present on an entity (`true` in the `Vec`). Markers are sorted by priority customizable by user. Since for deserialization we call two functions now (`write` and then `deserialize`), we can do a nice trick to remove `unsafe` from ser/de customization and make it even more flexible! Since `write` knows the component type of `deserialize`, we can store a type-erased function pointers and let `write` "restore" the type. "restoration" is done by calling `transmute`, but we abstract it inside `SerdeFns` and check the type if `debug_assertions` enabled. So `serialize` and `deserialize` now just a normal functions with a very simple signature. This also unlocks the usage of `deserialize_in_place` that fallback into `deserialize` if not overridden by user. Similar trick is done for `serialize`, except `read` is non-overridable by user and only used to remove extra unsafety from the public API. The mentioned `read` and `write` are still unsafe since it's possible to pass `SerdeFns` that was created with a different type. And lastly, instead of using `EntityWorldMut` for writing and removal, we use `EntityMut` and `Commands`. Using `EntityWorldMut` have 2 major drawbacks: - You can't borrow a component from `EntityWorldMut` for in-place deserialization and `ClientMapper` at the same time. - Each insertion creates a new archetype. With commands we can spawn new entities inside `ClientMapper` and it will be possible to batch insertions in the future, see: bevyengine/bevy#10154 Huge thanks to @NiseVoid for the idea! --------- Co-authored-by: UkoeHB <[email protected]>
This would be useful for networking. When you deserialize received components, you don't know all of them in advance. Batching adjacent commands would reduce archetype moves. |
To implement this, we need to somehow split apart commands on the basis of what they do. Our flexible "do-anything opaquely" command design is the core problem to overcome. There are three possible designs:
I prefer 3, as it's the least breaking and most flexible. I'm quite opposed to 1, as the migration path is quite painful. |
I think that observers have an overlapping use case with |
# Objective #16132 introduced entity cloning functionality, and while it works and is useful, it can be made faster. This is the promised follow-up to improve performance. ## Solution **PREFACE**: This is my first time writing `unsafe` in rust and I have only vague idea about what I'm doing. I would encourage reviewers to scrutinize `unsafe` parts in particular. The solution is to clone component data to an intermediate buffer and use `EntityWorldMut::insert_by_ids` to insert components without additional archetype moves. To facilitate this, `EntityCloner::clone_entity` now reads all components of the source entity and provides clone handlers with the ability to read component data straight from component storage using `read_source_component` and write to an intermediate buffer using `write_target_component`. `ComponentId` is used to check that requested type corresponds to the type available on source entity. Reflect-based handler is a little trickier to pull of: we only have `&dyn Reflect` and no direct access to the underlying data. `ReflectFromPtr` can be used to get `&dyn Reflect` from concrete component data, but to write it we need to create a clone of the underlying data using `Reflect`. For this reason only components that have `ReflectDefault` or `ReflectFromReflect` or `ReflectFromWorld` can be cloned, all other components will be skipped. The good news is that this is actually only a temporary limitation: once #13432 lands we will be able to clone component without requiring one of these `type data`s. This PR also introduces `entity_cloning` benchmark to better compare changes between the PR and main, you can see the results in the **showcase** section. ## Testing - All previous tests passing - Added test for fast reflect clone path (temporary, will be removed after reflection-based cloning lands) - Ran miri ## Showcase Here's a table demonstrating the improvement: | **benchmark** | **main, avg** | **PR, avg** | **change, avg** | | ----------------------- | ------------- | ----------- | --------------- | | many components reflect | 18.505 µs | 2.1351 µs | -89.095% | | hierarchy wide reflect* | 22.778 ms | 4.1875 ms | -81.616% | | hierarchy tall reflect* | 107.24 µs | 26.322 µs | -77.141% | | hierarchy many reflect | 78.533 ms | 9.7415 ms | -87.596% | | many components clone | 1.3633 µs | 758.17 ns | -45.937% | | hierarchy wide clone* | 2.7716 ms | 3.3411 ms | +20.546% | | hierarchy tall clone* | 17.646 µs | 20.190 µs | +17.379% | | hierarchy many clone | 5.8779 ms | 4.2650 ms | -27.439% | *: these benchmarks have entities with only 1 component ## Considerations Once #10154 is resolved a large part of the functionality in this PR will probably become obsolete. It might still be a little bit faster than using command batching, but the complexity might not be worth it. ## Migration Guide - `&EntityCloner` in component clone handlers is changed to `&mut ComponentCloneCtx` to better separate data. - Changed `EntityCloneHandler` from enum to struct and added convenience functions to add default clone and reflect handler more easily. --------- Co-authored-by: Alice Cecile <[email protected]> Co-authored-by: Chris Russell <[email protected]>
Has anybody discussed the alternative to batching, which is to optimize the APIs used by command implementations, such that executing the command buffer isn't the expensive part? Specifically: what if we made it so that the fundamental entity & component storage in This would allow us to keep our nice, simple, ordered, synchronous, strongly consistent APIs around commands, hooks & observers - whilst also mitigating the spurious archetype moves that arise from them. E.g.: a hypothetical body of work might be:
Of course this approach means some operations via a [1, 2]: My colleague has some ideas here, but I've run out of steam for typing it up in this message, it's long enough as-is! PS: If this idea is unpalatable, I have a completely different idea that I've thought through less, which is along the lines of changing hooks, observers and custom commands to "operate in planning space", in the same way that some planning algorithms work in "planning space" instead of "state space". I can talk more about that if people express interest, though. |
That's very interesting. The way I'm interpreting what you're proposing is to keep a second "copy" of an entity's components, in a sparse set or packed, object-like format. Not a literal copy of the components (since components are not required to implement Like, commands would populate or take away from the sparse/object copy, and then the final values can be copied to the table at the end of flush. In the meantime, when observers are reacting, the world can route lookups to the sparse/object copy. I agree that this would accomplish the same goal of compacting/accumulating the structural changes affecting an entity so that we only have to copy its components once. (I think both strategies have a lot in common with compaction in log-structured databases.) I do wonder how much internal complexity this would introduce when it comes to routing queries to the latest data, but overall, I think it's definitely worth investigating. A nice benefit of what you're suggesting would be that nothing gets re-ordered. |
# Objective bevyengine#16132 introduced entity cloning functionality, and while it works and is useful, it can be made faster. This is the promised follow-up to improve performance. ## Solution **PREFACE**: This is my first time writing `unsafe` in rust and I have only vague idea about what I'm doing. I would encourage reviewers to scrutinize `unsafe` parts in particular. The solution is to clone component data to an intermediate buffer and use `EntityWorldMut::insert_by_ids` to insert components without additional archetype moves. To facilitate this, `EntityCloner::clone_entity` now reads all components of the source entity and provides clone handlers with the ability to read component data straight from component storage using `read_source_component` and write to an intermediate buffer using `write_target_component`. `ComponentId` is used to check that requested type corresponds to the type available on source entity. Reflect-based handler is a little trickier to pull of: we only have `&dyn Reflect` and no direct access to the underlying data. `ReflectFromPtr` can be used to get `&dyn Reflect` from concrete component data, but to write it we need to create a clone of the underlying data using `Reflect`. For this reason only components that have `ReflectDefault` or `ReflectFromReflect` or `ReflectFromWorld` can be cloned, all other components will be skipped. The good news is that this is actually only a temporary limitation: once bevyengine#13432 lands we will be able to clone component without requiring one of these `type data`s. This PR also introduces `entity_cloning` benchmark to better compare changes between the PR and main, you can see the results in the **showcase** section. ## Testing - All previous tests passing - Added test for fast reflect clone path (temporary, will be removed after reflection-based cloning lands) - Ran miri ## Showcase Here's a table demonstrating the improvement: | **benchmark** | **main, avg** | **PR, avg** | **change, avg** | | ----------------------- | ------------- | ----------- | --------------- | | many components reflect | 18.505 µs | 2.1351 µs | -89.095% | | hierarchy wide reflect* | 22.778 ms | 4.1875 ms | -81.616% | | hierarchy tall reflect* | 107.24 µs | 26.322 µs | -77.141% | | hierarchy many reflect | 78.533 ms | 9.7415 ms | -87.596% | | many components clone | 1.3633 µs | 758.17 ns | -45.937% | | hierarchy wide clone* | 2.7716 ms | 3.3411 ms | +20.546% | | hierarchy tall clone* | 17.646 µs | 20.190 µs | +17.379% | | hierarchy many clone | 5.8779 ms | 4.2650 ms | -27.439% | *: these benchmarks have entities with only 1 component ## Considerations Once bevyengine#10154 is resolved a large part of the functionality in this PR will probably become obsolete. It might still be a little bit faster than using command batching, but the complexity might not be worth it. ## Migration Guide - `&EntityCloner` in component clone handlers is changed to `&mut ComponentCloneCtx` to better separate data. - Changed `EntityCloneHandler` from enum to struct and added convenience functions to add default clone and reflect handler more easily. --------- Co-authored-by: Alice Cecile <[email protected]> Co-authored-by: Chris Russell <[email protected]>
What problem does this solve or what need does it fill?
Commands are currently applied one at a time, in system graph order, and within that in the order they were sent within systems (FIFO). This causes some problems:
EntityCommands
create useless temporary archetypes #5074What solution would you like?
Automatically batch adjacent similar commands together.
What alternative(s) have you considered?
We could also reorder commands to reduce surprising errors and confusing interactions. A sample ordering might be:
This would avoid a large number of spurious failures without needing explicit command error handling or system ordering. I'm unsure if that should be done as part of this work, or even at all however.
Additional context
@maniwani has been investigating this, and thinks we should remove the general purpose
Deferred
system parameter to reduce the risk of unbatchable operations. I'm personally less convinced: wouldn't those arguments similarly apply to any custom commands?The text was updated successfully, but these errors were encountered: