Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically batch similar commands together to improve performance and robustness #10154

Open
alice-i-cecile opened this issue Oct 17, 2023 · 28 comments
Labels
A-ECS Entities, components, systems, and events A-Networking Sending data between clients, servers and machines C-Performance A change motivated by improving speed, memory usage or compile times C-Usability A targeted quality-of-life change that makes Bevy easier to use D-Complex Quite challenging from either a design or technical perspective. Ask for help!

Comments

@alice-i-cecile
Copy link
Member

alice-i-cecile commented Oct 17, 2023

What problem does this solve or what need does it fill?

Commands are currently applied one at a time, in system graph order, and within that in the order they were sent within systems (FIFO). This causes some problems:

What solution would you like?

Automatically batch adjacent similar commands together.

What alternative(s) have you considered?

We could also reorder commands to reduce surprising errors and confusing interactions. A sample ordering might be:

  • Spawn
  • Insert
  • Remove
  • Despawn

This would avoid a large number of spurious failures without needing explicit command error handling or system ordering. I'm unsure if that should be done as part of this work, or even at all however.

Additional context

@maniwani has been investigating this, and thinks we should remove the general purpose Deferred system parameter to reduce the risk of unbatchable operations. I'm personally less convinced: wouldn't those arguments similarly apply to any custom commands?

@alice-i-cecile alice-i-cecile added A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times C-Usability A targeted quality-of-life change that makes Bevy easier to use D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Oct 17, 2023
@maniwani
Copy link
Contributor

maniwani commented Oct 17, 2023

I'm personally less convinced: wouldn't those arguments similarly apply to any custom commands?

Custom commands are fine since they're still commands.

The batching implementation I'm investigating (used by flecs) reduces the number of moves per entity from N to 1. The more commands we can "batch" (i.e. all queued commands from all systems at the current sync point), the fewer moves. That kind of "bird's eye view" over all queued commands can't happen if there are deferred operations that aren't commands.

Those would interrupt the batching or we'd have to do things in a weird order (e.g. run commands first, deferred operations last). IMO just remove Deferred. Commands are general-purpose. Why do we need Deferred when custom commands exist?

The only thing to note about custom commands is that batching would be able to "move" entity commands to run before them.

(edit) If Deferred exists to avoid allocating on the heap like Commands does with CommandQueue, I intend to address that with the batching implementation. There are too many queues. I'd consolidate things so each system has a single, thread-safe command queue. No matter how many Commands params a system has, they'll all reference its one queue.

(edit 2) I'm not recommending we remove Deferred before implementing batching. I'd probably do both in the same PR.

@RedMindZ
Copy link

We could also reorder commands to reduce surprising errors and confusing interactions

Since commands modify the shared state of the world, I don't think they could be reordered safely, unless we know ahead of time the set of all commands and all of their interactions, which is impossible with custom commands and will make adding new commands very difficult and fragile.

By not reordering the commands, the user has more control over how the world is modified, which is critical when the commands are trying to reflect in the world modifications to the user's state. It also becomes the user's responsibility to ensure commands are added in a valid order, which is much easier to control and debug when the commands are executed in the order they are created in in the user's code.

@maniwani
Copy link
Contributor

maniwani commented Oct 17, 2023

We could also reorder commands to reduce surprising errors and confusing interactions.

This would avoid a large number of spurious failures without needing explicit command error handling or system ordering.

I'm pretty sure batching is orthogonal to this.

Commands in general shouldn't even have errors, and all our built-in commands should be infallible.

The panic that happens when an insert or remove comes after a despawn is something we just arbitrarily do. They should just be ignored (with a debug! warning). It isn't a bug if two different systems from two different plugins want to despawn an entity at the same time for different reasons, or if a plugin system wants to despawn an entity at the same time user code wants to give it a component, assuming it'll still exist. Batching does not add anything new to this equation.

All we need to do is make a PR that (1) changes the panic to a debug! warning and (2) privates all fields of Insert, Remove, and Despawn so that users can only make these commands through EntityCommands. You can't get an EntityCommands for an entity unless that entity is alive (or will be), so all commands created through EntityCommands must be considered valid.

Since commands modify the shared state of the world, I don't think they could be reordered safely, unless we know ahead of time the set of all commands and all of their interactions, which is impossible with custom commands and will make adding new commands very difficult and fragile.

It's more nuanced than this. The batching implementation I am working on can and will re-order1 the observed effects of commands, but in a specific way. Only entity commands would be re-ordered.

Take this queue for example.

1. spawn e1
2. spawn e2
3. add Mass to e1
4. add Velocity to e2
5. add Velocity to e1
6. <custom command 1>
7. add Mass to e2
8. <custom command 2>
9. spawn e3

The batching process will link commands on each entity so that we only move it once, to its final archetype. This will "shift" all the operations on e1 ahead of all operations on e2 and "shift" the last e2 command ahead of <custom command 1>.

So with batching, things will happen in this order.

1. spawn e1 with Mass and Velocity
2. spawn e2 with Mass and Velocity
3. <custom command 1>
4. <custom command 2>
5. spawn e3

The reality is that custom commands should never make implicit assumptions about which entities exist and what components they may or may not have. i.e. If you pushed <custom command 1> assuming that Mass won't have been added to e2 yet, that isn't safe to assume. I don't think we explicitly encouraged assuming things before, but it'd be actively discouraged after this.

Also, this behavior is how flecs (a real ECS library that is currently faster and has more features than bevy_ecs) does command batching2, so it is not impossible to implement or unbearably disruptive to users.

Footnotes

  1. (edit) Before anyone says anything about performance, when I say "re-order" and "shift", I don't mean moving the commands around in memory. The command queue will just be iterated twice. Once to link the commands for the same entity together and a second time to apply.

  2. flecs does not support custom commands, but it says the same about its "observers" (which are custom user callbacks). There are some other, minor differences due to how flecs runs systems differently, but those aren't particularly important.

@alice-i-cecile
Copy link
Member Author

I'm pretty sure batching is orthogonal to this.

After more thinking, I agree.

@RedMindZ
Copy link

I think the core question here is what exactly is the purpose of Commands:
According to the docs:
Each command can be used to modify the World in arbitrary ways

I think the most valuable part of commands is the ability to delay mutating the world, thus allowing all of the non-mutating functionality to be executed in parallel. Giving all of the responsibility of mutating the world to Commands means that commands should have the full power to mutate the world in any order, and that power is lost by reordering the commands.
It is possible to instead delegate some of the responsibility of mutating the world to exclusive systems, but that should really be a last resort, given that they 'pause the world'. In some situations, it might be that case that all of the performance gained by batching the commands is dwarfed by the time wasted waiting for an exclusive system to mutate the world. Although, applying the commands is essentially the same as executing an exclusive system, so moving the work from commands to exclusive systems might not actually have a performance difference.

The reality is that custom commands should never make implicit assumptions about which entities exist and what components they may or may not have

Given your example queue, imagine that <custom command 1> ends up reducing to add Mass to e2. This could be the case if <custom command 1> is a more complex command, that in this specific invocation only ends up adding the Mass component to e2. Without batching, the final mass of the entity will be the one specified in command 7, but with batching the mass will be the one specified in command 6. Even thought the custom command didn't assume anything about the other commands, the results end up different.

I think that at best, it is possible to batch non-custom commands that don't cross custom command boundaries. And even that requires the engine developer to ensure that the type of commands being batched can be reordered safely, which increases the difficulty of developing the engine.

@maniwani
Copy link
Contributor

maniwani commented Oct 17, 2023

Given your example queue, imagine that <custom command 1> ends up reducing to add Mass to e2. This could be the case if <custom command 1> is a more complex command, that in this specific invocation only ends up adding the Mass component to e2.

Without batching, the final mass of the entity will be the one specified by command 7, but with batching the mass will be the one specified in command 6. Even thought the custom command didn't assume anything about the other commands, the results end up different.

Today, you can already write two systems that queue add T to e1, with those commands set to be applied at the same sync point. You can't assume which one runs first unless you ordered the systems. It's already fragile. You already can't rely on a total order of commands in all cases.

You don't know what systems plugins have scheduled. And if we merge #9822 (which I want to do), you won't even know exactly when commands are applied.

I think that at best, it is possible to batch non-custom commands that don't cross custom command boundaries. And even that requires the engine developer to ensure that the type of commands being batched can be reordered safely, which increases the difficulty of developing the engine.

I don't see room for compromise. We can't make any exceptions because we will run into this exact same problem when we try to implement our own version of flecs observers (custom callbacks with the same powers as custom commands) to react to things like "T was added to an entity".

That kind of immediate reactivity will put us in a situation where custom user code is potentially running right after every entity command, and if custom code creates boundaries, then we'd be back to zero batching, which is not acceptable.

Giving all of the responsibility of mutating the world to Commands means that commands should have the full power to mutate the world in any order, and that power is lost by reordering the commands.

Commands have no outside context. A command can't assume what happened before it or what will happen after it. And that will be even more true with the reactivity I just mentioned.

From my perspective, we have a lot to gain (performance and reactivity, the lack of which are real, tangible problems) and little to lose by batching over custom code. The fact that flecs has adopted the behavior I described (it's where I got the idea from) and has never looked back is a pretty strong indicator that it won't be nearly as bad as you're worrying it will be.

@RedMindZ
Copy link

A command can't assume what happened before it or what will happen after it

I think this could lead to very unintuitive code if taken to the extreme. For example, if we have the following system:

fn my_system(mut commands: Commands) {
    let entity = commands.spawn_empty().id();
    commands.add(MyCustomCommand::new(entity));
    commands.entity(entity).insert(Transform::default());
}

I would find it very surprising to discover that Insert was executed before MyCustomCommand. If this is the path we take, there should at least be a warning in the docs of Commands that commands can execute in an arbitrary order.

The fact that flecs has adopted the behavior I described (it's where I got the idea from) and has never looked back is a pretty strong indicator that it won't be nearly as bad as you're worrying it will be

My main worry comes from my own experience with #10122; The core problem comes from systems that share mutable state through a resource: these systems are unware of their own execution order, and therefore they can only generate commands based on the state that they observe when they run. They always generate the correct commands for the state, and the purpose of these commands is to reflect the state into bevy's world. But then the commands don't get applied in the same order as they are spawned, which in the best case would lead to a desync between the state and bevy's world, and in the worst case, a panic.

For my personal issue, infallible commands will work, but I'm just not sure this is a solution that scales: the current bevy commands can easily become infallible, but what if in the future a new command gets added where its not trivial to make it infallible? and even worse, now all users who implement custom commands will have to be aware of the delicacies of implementing custom commands that never fail and are completely order-independent. This can lead to extremely hard-to-debug bugs.

I hope batching won't be problematic and yield good performance, but I just got burnt by #10122 and I hope I can help the future implementation avoid the problems I faced.

@maniwani
Copy link
Contributor

maniwani commented Oct 18, 2023

I would find it very surprising to discover that Insert was executed before MyCustomCommand. If this is the path we take, there should at least be a warning in the docs of Commands that commands can execute in an arbitrary order.

That's fine. The docs should definitely call these things out. It's just pointless to implement batching if we also had to enforce a rule that it must stop at custom code. That's just not an actionable path to take.

What if in the future a new command gets added where its not trivial to make it infallible?

"Infallible" was a poor choice of words on my part. I was only talking about the panic that happens when entity commands just exist in the "wrong order", like remove being in line after a despawn command. When we finally decide to stop panicking over that, I can't imagine a command we could possibly add that would introduce a new ordering dependency.

users will have to be aware of the delicacies of implementing custom commands

If there are situations where a user can understandably write custom commands that depend on order (unrelated to the existing panic), it'd help to highlight some examples, because I'm drawing a blank. If those patterns exist, maybe custom commands aren't the right tool, something else could be better.

Also, I only said "infallible" with respect to our built-in commands. If users want their custom commands to error, by all means, but I don't think those errors would result from stuff happening between the command's dispatch and its application.

@RedMindZ
Copy link

It's just pointless to implement batching if we also had to enforce a rule that it must stop at custom code

I agree, most of the benefit would be lost this way.

If there are situations where a user can understandably write custom commands that depend on order (unrelated to the existing panic), it'd help to highlight some examples

I think the main example here is using commands to sync up the different parts of the state while avoiding taking a mutable reference to them to enable parallelism:
Let's say we have 2 systems (A and B) that take a mutable reference to GameState. Both of these systems want to update the Score resource based on the current GameState, but they don't even take a reference to the Score to reduce the dependency between them and other systems. Now, system A correctly responds to the game state and adds the command Score += 5. Then, system B runs, correctly responds to the state, and adds the command Score *= 2. If the order of the commands from A and B gets switched, the score would end up incorrect.
In this example, we could also replace the Score commands with Insert commands of the same component, and if the order is unreliable, you may end up with the component inserted by A rather then the component inserted by B.

Of course, this example could be fixed by either guaranteeing that custom (and Insert) commands maintain their relative order, or simply by taking a mutable reference to the score. I don't think this example is perfect, but I think it hints at a potential issue.

Given that I don't have a better example, I think I am coming around to the idea of batching, and maybe the ultimate solution for now is to add batching with good docs that detail the potentially surprising pitfalls that could catch users by surprise. If eventually someone has a real world example where batching is unsatisfactory, it should probably become configurable during application creation for users who would rather have ordered commands instead of faster commands.

@maniwani
Copy link
Contributor

maniwani commented Oct 18, 2023

Let's say we have 2 systems (A and B) that take a mutable reference to GameState. Both of these systems want to update the Score resource based on the current GameState, but they don't even take a reference to the Score to reduce the dependency between them and other systems. Now, system A correctly responds to the game state and adds the command Score += 5. Then, system B runs, correctly responds to the state, and adds the command Score *= 2.

Is there any significance to taking a mutable reference to GameState in this example? The problem would still happen if both systems had a Res<GameState> instead.

If the order of the commands from A and B gets switched, the score would end up incorrect.

In this specific example, I don't think A or B would have come from a plugin if they're updating the game score. Presumably they're both the user's code, so scheduling B.after(A) would be the appropriate resolution. If you know a correct order, you have to tell the scheduler. Are you expecting app.add_systems(Foo, (A, B)) to just infer B.after(A) from the tuple order and both systems having a ResMut<GameState>?

If two systems have no discernible direct or transitive order (i.e. it's a toss-up who runs first) and both take a mutable reference to the same thing, bevy won't pick an order, but they'll be reported by the ambiguity checker.

I think the main example here is using commands to sync up the different parts of the state while avoiding taking a mutable reference to them to enable parallelism.

Mutable reference or not, if two systems with an ambiguous order both run and queue commands modifying X, then you can't know the final value of X with certainty.

This problem seems intrinsic to parallel system execution and the constraint-based scheduling approach.

Of course, this example could be fixed by either guaranteeing that custom (and Insert) commands maintain their relative order, or simply by taking a mutable reference to the score. I don't think this example is perfect, but I think it hints at a potential issue.

A guarantee that command queues are applied in the same order the systems ran would help users debug/identify a case of #10122, but I don't think that would address a broader thing—if systems can have a non-deterministic order and this is inherited by their command queues, then their commands also run in a non-deterministic order.

Something that's come up for discussion recently is whether or not the scheduler actually improves performance. Workloads that can actually benefit from parallel system execution—ones where the time saved by running systems at the same time is more than the time lost to the added overhead—are rare (most systems finish very quickly). In many of bevy's tests, the SingleThreadedExecutor performs roughly the same or slightly better than the MultiThreadedExecutor, and we lack data from actual games to know if/where that changes.

Anyway, making serial system execution the default (and also relegating the scheduler to a targeted optimization tool) is a possible outcome that would also resolve this issue.

(edit: Worth pointing out that serial execution still allows for rayon-esque parallel iteration inside the systems.)

If eventually someone has a real world example where batching is unsatisfactory, it should probably become configurable during application creation for users who would rather have ordered commands instead of faster commands.

I think it's too early to consider this. IMO the first response should be to see if there's a way a user can revise their code to do what they want. That pattern could then be highlighted in the documentation.

Just because something doesn't work how a user initially expects, doesn't mean the library needs to meet those expectations (by making it optional). I personally don't think batching should be something the user can turn off (or something they should even want to turn off). If an app was "large enough", you probably couldn't even get away with turning it off. I don't like the idea of a common piece of advice ("turn batching off") only working for, say, jam games.

(edit: For reference, commands are a known time sink for bevy, especially when it comes to preparing to render, and flecs's author shared some info about the performance impact of batching in flecs in this Discord discussion. So my speculation is the difference in performance is very significant and why sufficiently complex apps would simply be unable to turn it off.)

@RedMindZ
Copy link

Is there any significance to taking a mutable reference to GameState in this example? The problem would still happen if both systems had a Res<GameState> instead.

The difference is that it guarantees the systems will be executed after one another, and not in parallel. It makes sure they generate appropriate commands for the state they observe. You could also imagine these are complex systems that modify the game state in some manner.

Presumably they're both the user's code, so scheduling B.after(A) would be the appropriate resolution

The idea is that you don't care which system runs first because they will both generate appropriate commands for the state they observe: Essentially, they use the shared state to synchronize and ensure consistency. If A runs first and modifies the state and generates a command, B will observe the new state and will generate a command appropriate for the new state. A and B don't care about each other, but they make the (arguably) incorrect assumption that when they observe GameState, their commands will observe the equivalent bevy World. Maybe systems should not make this assumption, but I think it can be useful at times. That's why I suggested making it configurable (the same way you can turn on or off the multi-threaded executor).

As for performance, I think batching commands probably also requires benchmarks to see if it improves performance. Commands are probably even simpler than most systems (though I could be wrong on this), and the time it would take to batch them might be longer than simply executing them. This is certainly a worthy exploration though.

Ultimately, we probably should get benchmarks to determine the tradeoffs, and under the assumption that there are tradeoffs, I would vote for configurability.

@maniwani
Copy link
Contributor

The idea is that you don't care which system runs first because they will both generate appropriate commands for the state they observe.

A and B don't care about each other, but they make the (arguably) incorrect assumption that when they observe GameState, their commands will observe the equivalent bevy World. Maybe systems should not make this assumption, but I think it can be useful at times.

I don't think that's ever a safe assumption. Mutable reference or not, one system's command will run first and then the other system's command won't see the state it was "expecting".

As for performance, I think batching commands probably also requires benchmarks to see if it improves performance. Commands are probably even simpler than most systems (though I could be wrong on this), and the time it would take to batch them might be longer than simply executing them. This is certainly a worthy exploration though.

Of course, but moving an entity is pretty expensive. To move an entity from one table (and archetype) to another, bevy has to allocate a new element in each of the destination table's columns and then copy every one the entity's existing components to that new row.

If an entity has 10 components and you add 10 more with individual insert commands, that'll be 10 moves, so bevy would copy data (and potentially reallocate component vectors) 145 times (10 components, then 11, and so on), all for one entity. So lots of entities changing their composition lots of times results in a lot of time spent.

Batching would always reduce this to a single round of copying (and allocation) and it really only changes the internal order of operations to do so (the traversal to arrive at the final destination table/archetype must happen either way). The main overhead I foresee is the extra iteration(s) of the command queue.

@RedMindZ
Copy link

one system's command will run first and then the other system's command won't see the state it was "expecting"

If A runs before B, then B expects seeing the state after it has been modified by A. So if A sends the command Score = 5 and B sends the command Score *= 2, then B's commands expects to see Score = 5 when they execute, to then later set the final score to 10.

If an entity has 10 components and you add 10 more with individual insert commands, that'll be 10 moves, so bevy would copy data (and potentially reallocate component vectors) 145 times

I didn't know bevy was that inefficient. This is certainly a very strong reason to believe batching could lead to major performance gains. I look forward to seeing it implemented.

@maniwani
Copy link
Contributor

maniwani commented Oct 18, 2023

You haven't addressed what I've been saying about this problem still existing if you take the same setup and just change it so that A and B both have Res<GameState> instead of ResMut<GameState>. Same commands.

Is that scenario meaningfully different?

A and B don't care about each other.

If A runs before B, then B expects seeing the state after it has been modified by A.

(bolded emphasis mine)

You've been putting forward this idea that conflicting mutable access imbues systems with some kind of implicit relationship, but I disagree with it. A system can only "care about" or "have expectations of" others if those are captured as explicit dependencies, either directly (e.g. A.before(C)) or transitively (e.g. A.before(B), B.before(C) or (A, B, C).chain()). Only those count.

We could guarantee that system command queues will be executed in the same order that the systems started (and that's pretty orthogonal to batching btw), but that would be invariant of the systems having mutable or immutable access. That guarantee also does not make "systems with conflicting access have a non-deterministic order" any less of a goof that can cause subtle bugs.


I think the simplest guideline for users writing custom commands is: Custom commands should be self-contained. Don't write a command that makes any assumptions about being preceded or followed by any other commands. (If by some stretch of imagination that can't be upheld, we should probably add a mechanism for one command to queue another.)

We could formally guarantee that custom commands are not disturbed relative to each other (again, batching only cares about moving built-in commands), but if the systems dispatching them have a non-deterministic order, you can still have bugs.

@RedMindZ
Copy link

Is that scenario meaningfully different?

It is different: it is analogues to multiple threads mutating the same memory location with or without locks.

When using ResMut, the system knows it is the only one with access to the state, meaning that it observes the 'real', most up-to-date state. It also knows that any modifications to the state it performs will be seen by all other systems. This eliminates race conditions and guarantees that there will be some order, even if that order is random every frame. This case is analogues to acquiring a lock on the memory before reading it to ensure the read value is consistent, and the written value is consistent.

When using Res, the system doesn't know that it is the only one operating according to the current state. If 2 systems send a command to modify the state, that means that neither of them is responding to the state as it would be after the other system updated it. This case is analogues to reading a value from memory, modifying it, and then writing it back with no locks. If 2 systems use Res, this becomes a race condition.

Its similar to the problem novice programmers run into when they try to use a bool as a lock and they write something like this:

if locked == false {
    locked = true;
    // do something
} else {
    // try again later
}

If there are 2 threads running this code simultaneously, they might both check the if at the same time, see its unlocked, and they will both think they acquired the lock.

The key difference is apparent when the systems use the commands specifically to sync bevy's state to the resource's state:
One system modifies the resource, and sends a command to update bevy.
The other system will then observe the new state of the resource, and send a command appropriate for the new state. This is almost certainly a different command then the one that would have been sent if the old state was observed.
You can essentially think about the resource as a communication mechanism between threads, and consistent communication between threads requires locks, hence the ResMut. This is not an explicit communication mechanism since the systems don't mutate it with the intention to inform other systems, but rather an implicit communication mechanism.

You've been putting forward this idea that conflicting mutable access imbues systems with some kind of implicit relationship, but I disagree with it.

Notice that the systems are completely unaware of each other. They don't require any specific ordering, but they do share a lock (the ResMut). Since they do share a lock, there is an implicit relationship there that guarantees some order will exist, even if its different every frame. Res, on the other hand, is like having no lock at all, and then there is truly no relationship between the systems, and no order at all is guaranteed to exist. When I say 'no order at all', I mean that the systems can't agree on who modified the state first, since they both saw the first state, and they both 'think' they are the first to change it. In the ResMut case, all systems can agree on the order in which modifications happened, since each system saw a different state.

We could guarantee that system command queues will be executed in the same order that the systems started (and that's pretty orthogonal to batching btw) ...

I don't think that's orthogonal: if 2 systems sent commands to insert a bundle on the same entity, shouldn't they be batched?

... but that would be invariant of the systems having mutable or immutable access ...

That is true, but mutable and immutable access would implicitly change whether or not this order is observable by the systems.

... That guarantee also does not make "systems with conflicting access have a non-deterministic order" any less of a goof that can cause subtle bugs.

The key is that they do have some order, something that can't be said for systems with no conflicts. Of course, this could still be a breeding ground for bugs, but can also be a powerful tool when its needed.

@maniwani
Copy link
Contributor

One system modifies the resource, and sends a command to update bevy. The other system will then observe the new state of the resource, and send a command appropriate for the new state. This is almost certainly a different command then the one that would have been sent if the old state was observed.

Of course, this could still be a breeding ground for bugs, but can also be a powerful tool when its needed.

I don't think that is "almost certainly" true (if it's an enum, sure), but I can't speak for what everyone is doing.

I'm cool with "queues should be applied in the same order that systems run", I just don't agree that it follows from all this stuff about mutable access.

I don't think that's orthogonal: If 2 systems sent commands to insert a bundle on the same entity, shouldn't they be batched?

Yes.

Matching the order command queues apply to the order systems start is independent of whether or not those commands get batched. If we had to concatenate the queues together to batch over all of their commands, they can still lined up in that order.

@RedMindZ
Copy link

I don't think that is "almost certainly" true

You're right, of course its gonna depend on the situation, the system, and the state, and may or may not be a different command. The key is that the command can be adapted to the new state.

I'm cool with "queues should be applied in the same order that systems run", I just don't agree that it follows from all this stuff about mutable access.

You're right again, it doesn't follow. Rather, the 'ResMut is a lock' idea enables systems to take advantage of this in the first place. Without ResMut acting as a lock, systems can't agree on whose commands should run first.

If we had to concatenate the queues together to batch over all of their commands, they can still lined up in that order.

That means that given 3 systems, the commands from the last system could execute before some of the commands from the second system. That's not exactly system order, but I don't think it introduces any problems we haven't resolved yet.

@tim-blackbird
Copy link
Contributor

Without ResMut acting as a lock, systems can't agree on whose commands should run first.

That is unrelated. If you don't use .after(), .before() or .chain() then the order of command application between systems is random each run of the application regardless of them being able to run in parallel or not because of overlapping mutable access.

@RedMindZ
Copy link

RedMindZ commented Oct 19, 2023

The order of system execution can be randomized every single frame. The important part is that using ResMut allows systems to know what the order is every frame. You can simply imagine that each system writes its own name to ResMut<Vec<String>> every frame, and this way the order is known to the systems. Under normal circumstance the order will be implicit in the GameState rather than written explicitly.

@tim-blackbird
Copy link
Contributor

tim-blackbird commented Oct 19, 2023

I see. You're right that the mutable access could change at which point a system runs each frame.
I what threw me off was that your example with Score isn't affected by this (Unless I'm being silly).

I read the issue you linked above and that does demonstrate your point.
There the entity is spawned immediately in the the system A but the insert that follows is a command which ends up running after the despawn command from system B even though system A ran before system B did.

I don't think that either example is affected by batching since batching would affect the order of non-batched vs batched commands but not the ordering within each group.

I wonder how #9822 plays into this.

@RedMindZ
Copy link

I apologize if my Score example wasn't clear. There, the systems synchronize/observe the order of execution through the ResMut<GameState> while sending commands for the Score resource, which they don't even request access to.

@tim-blackbird
Copy link
Contributor

That makes sense :)
I don't see how that relates to batching though.

@RedMindZ
Copy link

The connection to batching is as follows: If we have 3 systems that are synchronized using ResMut, then batching could make some commands from the last system execute before some commands from the second system. We discussed this issue at length above, and my bottom line is that I don't think there are any issues with it as long as it is only built-in commands that can cross a system's command queue boundary (a.k.a execute before some commands from an earlier system). And custom commands can't really be batched meaningfully anyways, since we can't know what they do.

Shatur added a commit to projectharmonia/bevy_replicon that referenced this issue Apr 20, 2024
This PR have multiple purposes, but unfortunately they highly related, so I can't split it into multiple PRs.

Right for prediction crates like https://github.com/Bendzae/bevy_replicon_snap or https://github.com/RJ/bevy_timewarp we rely on overriding deserialization and remove functions. But this approach have 3 major drawbacks:

1. In order to customize serialization for interpolated or predicted components, user need to copy the code from the crate or use some crate-provided helpers.
2. Serialization function is is unsafe due to pointer manipulation.
3. If you want to override writing or removal only for some entities, it's costly to check if a component present on each writing.

Marker-based API solves all of the above. Now for replication user register only `serialize`, `deserialize` and newly added `deserialize_in_place`. These functions assigned to rules as before. So instead of `ComponentFns`, we now have `SerdeFns`. And these function now only do one thing - serialization/deserialization.

All other logic now represented by `CommandFns`. Unlike `SerdeFns`, this struct created for each component only once and automatically when user register a `SerdeFns` for a component. It uses default `write` and `remove` functions. To override it, user can register a marker and assign a custom functions for each component. We use markers instead of archetypes because clients do not know in advance the archetype of the entity being deserialized.

Then on receive we collect a reusable `Vec<bool>` for each marker (by checking if a marker is present) and based on it pick `write`/`remove` functions for each component. We pick first marker that have a registration for the current component and present on an entity (`true` in the `Vec`). Markers are sorted by priority customizable by user.

Since for deserialization we call two functions now (`write` and then `deserialize`), we can do a nice trick to remove `unsafe` from ser/de customization and make it even more flexible!
Since `write` knows the component type of `deserialize`, we can store a type-erased function pointers and let `write` "restore" the type. "restoration" is done by calling `transmute`, but we abstract it inside `SerdeFns` and check the type if `debug_assertions` enabled.

So `serialize` and `deserialize` now just a normal functions with a very simple signature. This also unlocks the usage of `deserialize_in_place` that fallback into `deserialize` if not overridden by user.
Similar trick is done for `serialize`, except `read` is non-overridable by user and only used to remove extra unsafety from the public API.

The mentioned `read` and `write` are still unsafe since it's possible to pass `SerdeFns` that was created with a different type.

And lastly, instead of using `EntityWorldMut` for writing and removal, we use `EntityMut` and `Commands`. Using `EntityWorldMut` have 2 major drawbacks:
- You can't borrow a component from `EntityWorldMut` for in-place deserialization and `ClientMapper` at the same time.
- Each insertion creates a new archetype.

With commands we can spawn new entities inside `ClientMapper` and it will be possible to batch insertions in the future, see: bevyengine/bevy#10154

Huge thanks to @NiseVoid for the idea!

---------

Co-authored-by: UkoeHB <[email protected]>
@Shatur
Copy link
Contributor

Shatur commented Apr 28, 2024

This would be useful for networking. When you deserialize received components, you don't know all of them in advance. Batching adjacent commands would reduce archetype moves.

@alice-i-cecile alice-i-cecile added the A-Networking Sending data between clients, servers and machines label Apr 28, 2024
@alice-i-cecile
Copy link
Member Author

To implement this, we need to somehow split apart commands on the basis of what they do. Our flexible "do-anything opaquely" command design is the core problem to overcome.

There are three possible designs:

  1. Only allow a fixed set of commands, and tell everyone else to use one-shot systems / observers. Remove Deferred.
  2. Limit commands to a fixed set of operations, and create a new Deferred param for all other operations.
  3. Enumerate the fixed set of operations for Commands, but add an Arbitrary branch to the metaphorical enum inside of it that supports other operations.

I prefer 3, as it's the least breaking and most flexible. I'm quite opposed to 1, as the migration path is quite painful.

@Shatur
Copy link
Contributor

Shatur commented Sep 15, 2024

I think that observers have an overlapping use case with Deferred...

github-merge-queue bot pushed a commit that referenced this issue Dec 18, 2024
# Objective

#16132 introduced entity cloning functionality, and while it works and
is useful, it can be made faster. This is the promised follow-up to
improve performance.

## Solution

**PREFACE**: This is my first time writing `unsafe` in rust and I have
only vague idea about what I'm doing. I would encourage reviewers to
scrutinize `unsafe` parts in particular.

The solution is to clone component data to an intermediate buffer and
use `EntityWorldMut::insert_by_ids` to insert components without
additional archetype moves.

To facilitate this, `EntityCloner::clone_entity` now reads all
components of the source entity and provides clone handlers with the
ability to read component data straight from component storage using
`read_source_component` and write to an intermediate buffer using
`write_target_component`. `ComponentId` is used to check that requested
type corresponds to the type available on source entity.

Reflect-based handler is a little trickier to pull of: we only have
`&dyn Reflect` and no direct access to the underlying data.
`ReflectFromPtr` can be used to get `&dyn Reflect` from concrete
component data, but to write it we need to create a clone of the
underlying data using `Reflect`. For this reason only components that
have `ReflectDefault` or `ReflectFromReflect` or `ReflectFromWorld` can
be cloned, all other components will be skipped. The good news is that
this is actually only a temporary limitation: once #13432 lands we will
be able to clone component without requiring one of these `type data`s.

This PR also introduces `entity_cloning` benchmark to better compare
changes between the PR and main, you can see the results in the
**showcase** section.

## Testing

- All previous tests passing
- Added test for fast reflect clone path (temporary, will be removed
after reflection-based cloning lands)
- Ran miri

## Showcase
Here's a table demonstrating the improvement:

| **benchmark** | **main, avg** | **PR, avg** | **change, avg** |
| ----------------------- | ------------- | ----------- |
--------------- |
| many components reflect | 18.505 µs | 2.1351 µs | -89.095% |
| hierarchy wide reflect* | 22.778 ms | 4.1875 ms | -81.616% |
| hierarchy tall reflect* | 107.24 µs | 26.322 µs | -77.141% |
| hierarchy many reflect | 78.533 ms | 9.7415 ms | -87.596% |
| many components clone | 1.3633 µs | 758.17 ns | -45.937% |
| hierarchy wide clone* | 2.7716 ms | 3.3411 ms | +20.546% |
| hierarchy tall clone* | 17.646 µs | 20.190 µs | +17.379% |
| hierarchy many clone | 5.8779 ms | 4.2650 ms | -27.439% |

*: these benchmarks have entities with only 1 component

## Considerations
Once #10154 is resolved a large part of the functionality in this PR
will probably become obsolete. It might still be a little bit faster
than using command batching, but the complexity might not be worth it.

## Migration Guide
- `&EntityCloner` in component clone handlers is changed to `&mut
ComponentCloneCtx` to better separate data.
- Changed `EntityCloneHandler` from enum to struct and added convenience
functions to add default clone and reflect handler more easily.

---------

Co-authored-by: Alice Cecile <[email protected]>
Co-authored-by: Chris Russell <[email protected]>
@omaskery
Copy link
Contributor

omaskery commented Dec 21, 2024

Has anybody discussed the alternative to batching, which is to optimize the APIs used by command implementations, such that executing the command buffer isn't the expensive part?

Specifically: what if we made it so that the fundamental entity & component storage in World was able to store pending archetype changes (entity creation, entity destruction, component additions, and component removals) in data structures better suited to random mutation (and less optimal for querying), until being flushed into the query-optimised ECS structures we know and love.

This would allow us to keep our nice, simple, ordered, synchronous, strongly consistent APIs around commands, hooks & observers - whilst also mitigating the spurious archetype moves that arise from them.

E.g.: a hypothetical body of work might be:

  • Extract entity & component storage into a new struct, WorldStorage (name TBD):

    • This works as it does now, contains all the tables & sparse data, etc.
  • Create PendingStructuralChanges (name TBD), a struct optimised for:

    • Recording new entities, and component insertion/removal
    • Requesting (im-)mutable references to specific components of a specific entity
    • Iterating the tracked entities in a deterministic order, each associated with its proposed new archetype, for zipping with query iterators to efficiently determine whether entities should be excluded or included from queries.
    • There are ways to make this data structure fairly efficient [1]
  • Create DeferredWorldStorage (name TBD), a struct containing a WorldStorage and PendingStructuralChanges:

    • It implements archetype changing operations by recording them in the PendingStructuralChanges
      • Contradictory archetype moves can "cancel out":
        • Removing then re-adding a component:
          • Remove the "removed" tracking in PendingStructuralChanges
          • Write the new value to WorldStorage
        • Adding then removing a component: just remove the "added" tracking from PendingStructuralChanges
    • It implements non-archetype changing read & write operations by:
      • First seeing if the PendingStructuralChanges can handle the read/write
      • Falls back to the underlying WorldStorage
    • It implements queries by filtering & augmenting the query results from WorldStorage:
      • As it iterates the archetypes matched by the query:
        • It omits entities for whom PendingStructuralChanges reports a different archetype
        • It insert entities for whom PendingStructuralChanges reports a matching archetype
        • There are ways to make this fairly efficient [2]
    • It provides a flush operation:
      • This drains PendingStructuralChanges into the WorldStorage whilst:
        • Batch allocating space in all relevant storage, as it knows all of the entity counts, archetypes, etc. up front.
        • Ensuring one archetype move per entity.
  • The current Bevy World would be implemented by directly containing a WorldStorage and just calling methods on it.

  • However, you would change it to use DeferredWorldStorage instead, and then:

    • All write access to the world would be as it currently is, whether via command buffer or not, except:
    • You call flush on the DeferredWorldStorage when appropriate, e.g. after applying a command buffer.
    • The simple mutable methods on the World's public API would probably still route directly to the nested WorldStorage inside the DeferredWorldStorage, it would be over the top to do book keeping in PendingStructuralChange for a single structural change followed by calling flush.

Of course this approach means some operations via a DeferredWorldStorage are slightly slower, or more noticeably slower for queries as more pending archetype changes accumulate, but a combination of usage patterns ("people might not typically do a lot of querying whilst archetype changes are pending"), and savings from removing additional archetype moves, means it might just still be better.

[1, 2]: My colleague has some ideas here, but I've run out of steam for typing it up in this message, it's long enough as-is!

PS: If this idea is unpalatable, I have a completely different idea that I've thought through less, which is along the lines of changing hooks, observers and custom commands to "operate in planning space", in the same way that some planning algorithms work in "planning space" instead of "state space". I can talk more about that if people express interest, though.

@maniwani
Copy link
Contributor

maniwani commented Dec 21, 2024

What if we made it so that the fundamental entity & component storage in World was able to store pending archetype changes (entity creation, entity destruction, component additions, and component removals) in data structures better suited to random mutation (and less optimal for querying), until being flushed into the query-optimised ECS structures we know and love?

That's very interesting.

The way I'm interpreting what you're proposing is to keep a second "copy" of an entity's components, in a sparse set or packed, object-like format. Not a literal copy of the components (since components are not required to implement Clone), but some space where commands can dump their component values.

Like, commands would populate or take away from the sparse/object copy, and then the final values can be copied to the table at the end of flush. In the meantime, when observers are reacting, the world can route lookups to the sparse/object copy.

I agree that this would accomplish the same goal of compacting/accumulating the structural changes affecting an entity so that we only have to copy its components once. (I think both strategies have a lot in common with compaction in log-structured databases.)

I do wonder how much internal complexity this would introduce when it comes to routing queries to the latest data, but overall, I think it's definitely worth investigating. A nice benefit of what you're suggesting would be that nothing gets re-ordered.

ecoskey pushed a commit to ecoskey/bevy that referenced this issue Jan 6, 2025
# Objective

bevyengine#16132 introduced entity cloning functionality, and while it works and
is useful, it can be made faster. This is the promised follow-up to
improve performance.

## Solution

**PREFACE**: This is my first time writing `unsafe` in rust and I have
only vague idea about what I'm doing. I would encourage reviewers to
scrutinize `unsafe` parts in particular.

The solution is to clone component data to an intermediate buffer and
use `EntityWorldMut::insert_by_ids` to insert components without
additional archetype moves.

To facilitate this, `EntityCloner::clone_entity` now reads all
components of the source entity and provides clone handlers with the
ability to read component data straight from component storage using
`read_source_component` and write to an intermediate buffer using
`write_target_component`. `ComponentId` is used to check that requested
type corresponds to the type available on source entity.

Reflect-based handler is a little trickier to pull of: we only have
`&dyn Reflect` and no direct access to the underlying data.
`ReflectFromPtr` can be used to get `&dyn Reflect` from concrete
component data, but to write it we need to create a clone of the
underlying data using `Reflect`. For this reason only components that
have `ReflectDefault` or `ReflectFromReflect` or `ReflectFromWorld` can
be cloned, all other components will be skipped. The good news is that
this is actually only a temporary limitation: once bevyengine#13432 lands we will
be able to clone component without requiring one of these `type data`s.

This PR also introduces `entity_cloning` benchmark to better compare
changes between the PR and main, you can see the results in the
**showcase** section.

## Testing

- All previous tests passing
- Added test for fast reflect clone path (temporary, will be removed
after reflection-based cloning lands)
- Ran miri

## Showcase
Here's a table demonstrating the improvement:

| **benchmark** | **main, avg** | **PR, avg** | **change, avg** |
| ----------------------- | ------------- | ----------- |
--------------- |
| many components reflect | 18.505 µs | 2.1351 µs | -89.095% |
| hierarchy wide reflect* | 22.778 ms | 4.1875 ms | -81.616% |
| hierarchy tall reflect* | 107.24 µs | 26.322 µs | -77.141% |
| hierarchy many reflect | 78.533 ms | 9.7415 ms | -87.596% |
| many components clone | 1.3633 µs | 758.17 ns | -45.937% |
| hierarchy wide clone* | 2.7716 ms | 3.3411 ms | +20.546% |
| hierarchy tall clone* | 17.646 µs | 20.190 µs | +17.379% |
| hierarchy many clone | 5.8779 ms | 4.2650 ms | -27.439% |

*: these benchmarks have entities with only 1 component

## Considerations
Once bevyengine#10154 is resolved a large part of the functionality in this PR
will probably become obsolete. It might still be a little bit faster
than using command batching, but the complexity might not be worth it.

## Migration Guide
- `&EntityCloner` in component clone handlers is changed to `&mut
ComponentCloneCtx` to better separate data.
- Changed `EntityCloneHandler` from enum to struct and added convenience
functions to add default clone and reflect handler more easily.

---------

Co-authored-by: Alice Cecile <[email protected]>
Co-authored-by: Chris Russell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events A-Networking Sending data between clients, servers and machines C-Performance A change motivated by improving speed, memory usage or compile times C-Usability A targeted quality-of-life change that makes Bevy easier to use D-Complex Quite challenging from either a design or technical perspective. Ask for help!
Projects
Status: Todo
Development

No branches or pull requests

6 participants