-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Start running systems while prepare_systems is running #4919
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like these performance wins; reducing the overhead of running systems is one of the most important areas because it makes doing the right thing compatible with doing the fast thing.
Some cleanup feedback, but the strategy, performance wins and code quality look good to me.
Co-authored-by: Alice Cecile <[email protected]>
Did some additional benchmarking to verify the results here. The results are remarkably similar to #4740's, which makes sense since both underpin how systems are scheduled onto the scheduler's threads. It overall seem to have an identical effect by decreasing the time between scheduling as system and when it's actually started, even if done via a different method. Both do seem to have a negative impact on when all or most systems are not scheduled to run due to run criteria. However, since the methodology for both is largely orthogonal, combining both together seems to compound the positive gains, and dampen the losses. Some of the gains of one is wiped out by the other, but the general cases where scheduling dominates have had their times cut in half or more. Main vs this PR
Internalization vs this PR vs Both Together
Main vs Combined
|
Added a few other small, probably insignificant improvements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on code clarity. I'd like to ensure that this tricky logic stays relatively approachable, so I'm applying a heightened standard of review.
The comment about the "don't queue up tasks if the system can already start" change is the most critical; the logical flow is very convoluted there, and I think we should reshuffle this significantly in order to ensure that it's easy to read the branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through the code as thoroughly as I could, and familiarized myself with the dependencies fixedbitset and event-listener involved. :)
Everything seems to make sense. I didn't notice anything that seemed wrong.
Got a few small ideas for further potential improvements/tweaks to explore, but I'll keep quiet so we keep this PR well-scoped and don't drag it on. 😆
@james7132 if you have a moment to rebase this I can merge it in :) |
Adding |
bors r+ |
# Objective While using the ParallelExecutor, systems do not actually start until `prepare_systems` completes. In stages where there are large numbers of "empty" systems with very little work to do, this delay adds significant overhead, which can add up over many stages. ## Solution Immediately and synchronously signal the start of systems that can run without dependencies inside `prepare_systems` instead of waiting for the first executor iteration after `prepare_systems` completes. Any system that is dependent on them still cannot run until after `prepare_systems` completes, but there are a large number of unconstrained systems in the base engine where this is a general benefit in almost every case. ## Performance This change was tested against `many_foxes` in the default configuration. As this change is sensitive to the overhead around scheduling systems, the spans for measuring system timing, system overhead, and system commands were all commented out for these measurements. The median stage timings between `main` and this PR are as follows: |stage|main|this PR| |:--|:--|:--| |First|75.54 us|61.61 us| |LoadAssets|51.05 us|42.32 us| |PreUpdate|54.6 us|55.56 us| |Update|61.89 us|51.5 us| |PostUpdate|7.27 ms|6.71 ms| |AssetEvents|47.82 us|35.95 us| |Last|39.19 us|37.71 us| |reserve_and_flush|57.83 us|48.2 us| |Extract|1.41 ms|1.28 ms| |Prepare|554.49 us|502.53 us| |Queue|216.29 us|207.51 us| |Sort|67.03 us|60.99 us| |Render|1.73 ms|1.58 ms| |Cleanup|33.55 us|30.76 us| |Clear Entities|18.56 us|17.05 us| |**full frame**|**11.9 ms**|**10.91 ms**| For the first few stages, the benefit is small but cumulative over each. For PostUpdate in particular, this allows `parent_update` to run while prepare_systems is running, which is required for the animation and transform propagation systems, which dominate the time spent in the stage, but also frontloads the contention as the other "empty" systems are also running while `parent_update` is running. For Render, where there is just a single large exclusive system, the benefit comes from not waiting on a spuriously scheduled task on the task pool to kick off the system: it's immediately scheduled to run.
Pull request successfully merged into main. Build succeeded: |
# Objective While using the ParallelExecutor, systems do not actually start until `prepare_systems` completes. In stages where there are large numbers of "empty" systems with very little work to do, this delay adds significant overhead, which can add up over many stages. ## Solution Immediately and synchronously signal the start of systems that can run without dependencies inside `prepare_systems` instead of waiting for the first executor iteration after `prepare_systems` completes. Any system that is dependent on them still cannot run until after `prepare_systems` completes, but there are a large number of unconstrained systems in the base engine where this is a general benefit in almost every case. ## Performance This change was tested against `many_foxes` in the default configuration. As this change is sensitive to the overhead around scheduling systems, the spans for measuring system timing, system overhead, and system commands were all commented out for these measurements. The median stage timings between `main` and this PR are as follows: |stage|main|this PR| |:--|:--|:--| |First|75.54 us|61.61 us| |LoadAssets|51.05 us|42.32 us| |PreUpdate|54.6 us|55.56 us| |Update|61.89 us|51.5 us| |PostUpdate|7.27 ms|6.71 ms| |AssetEvents|47.82 us|35.95 us| |Last|39.19 us|37.71 us| |reserve_and_flush|57.83 us|48.2 us| |Extract|1.41 ms|1.28 ms| |Prepare|554.49 us|502.53 us| |Queue|216.29 us|207.51 us| |Sort|67.03 us|60.99 us| |Render|1.73 ms|1.58 ms| |Cleanup|33.55 us|30.76 us| |Clear Entities|18.56 us|17.05 us| |**full frame**|**11.9 ms**|**10.91 ms**| For the first few stages, the benefit is small but cumulative over each. For PostUpdate in particular, this allows `parent_update` to run while prepare_systems is running, which is required for the animation and transform propagation systems, which dominate the time spent in the stage, but also frontloads the contention as the other "empty" systems are also running while `parent_update` is running. For Render, where there is just a single large exclusive system, the benefit comes from not waiting on a spuriously scheduled task on the task pool to kick off the system: it's immediately scheduled to run.
# Objective While using the ParallelExecutor, systems do not actually start until `prepare_systems` completes. In stages where there are large numbers of "empty" systems with very little work to do, this delay adds significant overhead, which can add up over many stages. ## Solution Immediately and synchronously signal the start of systems that can run without dependencies inside `prepare_systems` instead of waiting for the first executor iteration after `prepare_systems` completes. Any system that is dependent on them still cannot run until after `prepare_systems` completes, but there are a large number of unconstrained systems in the base engine where this is a general benefit in almost every case. ## Performance This change was tested against `many_foxes` in the default configuration. As this change is sensitive to the overhead around scheduling systems, the spans for measuring system timing, system overhead, and system commands were all commented out for these measurements. The median stage timings between `main` and this PR are as follows: |stage|main|this PR| |:--|:--|:--| |First|75.54 us|61.61 us| |LoadAssets|51.05 us|42.32 us| |PreUpdate|54.6 us|55.56 us| |Update|61.89 us|51.5 us| |PostUpdate|7.27 ms|6.71 ms| |AssetEvents|47.82 us|35.95 us| |Last|39.19 us|37.71 us| |reserve_and_flush|57.83 us|48.2 us| |Extract|1.41 ms|1.28 ms| |Prepare|554.49 us|502.53 us| |Queue|216.29 us|207.51 us| |Sort|67.03 us|60.99 us| |Render|1.73 ms|1.58 ms| |Cleanup|33.55 us|30.76 us| |Clear Entities|18.56 us|17.05 us| |**full frame**|**11.9 ms**|**10.91 ms**| For the first few stages, the benefit is small but cumulative over each. For PostUpdate in particular, this allows `parent_update` to run while prepare_systems is running, which is required for the animation and transform propagation systems, which dominate the time spent in the stage, but also frontloads the contention as the other "empty" systems are also running while `parent_update` is running. For Render, where there is just a single large exclusive system, the benefit comes from not waiting on a spuriously scheduled task on the task pool to kick off the system: it's immediately scheduled to run.
# Objective While using the ParallelExecutor, systems do not actually start until `prepare_systems` completes. In stages where there are large numbers of "empty" systems with very little work to do, this delay adds significant overhead, which can add up over many stages. ## Solution Immediately and synchronously signal the start of systems that can run without dependencies inside `prepare_systems` instead of waiting for the first executor iteration after `prepare_systems` completes. Any system that is dependent on them still cannot run until after `prepare_systems` completes, but there are a large number of unconstrained systems in the base engine where this is a general benefit in almost every case. ## Performance This change was tested against `many_foxes` in the default configuration. As this change is sensitive to the overhead around scheduling systems, the spans for measuring system timing, system overhead, and system commands were all commented out for these measurements. The median stage timings between `main` and this PR are as follows: |stage|main|this PR| |:--|:--|:--| |First|75.54 us|61.61 us| |LoadAssets|51.05 us|42.32 us| |PreUpdate|54.6 us|55.56 us| |Update|61.89 us|51.5 us| |PostUpdate|7.27 ms|6.71 ms| |AssetEvents|47.82 us|35.95 us| |Last|39.19 us|37.71 us| |reserve_and_flush|57.83 us|48.2 us| |Extract|1.41 ms|1.28 ms| |Prepare|554.49 us|502.53 us| |Queue|216.29 us|207.51 us| |Sort|67.03 us|60.99 us| |Render|1.73 ms|1.58 ms| |Cleanup|33.55 us|30.76 us| |Clear Entities|18.56 us|17.05 us| |**full frame**|**11.9 ms**|**10.91 ms**| For the first few stages, the benefit is small but cumulative over each. For PostUpdate in particular, this allows `parent_update` to run while prepare_systems is running, which is required for the animation and transform propagation systems, which dominate the time spent in the stage, but also frontloads the contention as the other "empty" systems are also running while `parent_update` is running. For Render, where there is just a single large exclusive system, the benefit comes from not waiting on a spuriously scheduled task on the task pool to kick off the system: it's immediately scheduled to run.
# Objective This is a follow-up to #7745. An unbounded `async_channel` occasionally allocates whenever it exceeds the capacity of the current buffer in it's internal linked list. This is avoidable. This also used to be a bounded channel before stageless, which was introduced in #4919. ## Solution Use a bounded channel to avoid allocations on system completion. This shouldn't conflict with #7745, as it's impossible for the scheduler to exceed the channel capacity, even if somehow every system completed at the same time.
# Objective This is a follow-up to #7745. An unbounded `async_channel` occasionally allocates whenever it exceeds the capacity of the current buffer in it's internal linked list. This is avoidable. This also used to be a bounded channel before stageless, which was introduced in #4919. ## Solution Use a bounded channel to avoid allocations on system completion. This shouldn't conflict with #7745, as it's impossible for the scheduler to exceed the channel capacity, even if somehow every system completed at the same time.
# Objective This is a follow-up to bevyengine#7745. An unbounded `async_channel` occasionally allocates whenever it exceeds the capacity of the current buffer in it's internal linked list. This is avoidable. This also used to be a bounded channel before stageless, which was introduced in bevyengine#4919. ## Solution Use a bounded channel to avoid allocations on system completion. This shouldn't conflict with bevyengine#7745, as it's impossible for the scheduler to exceed the channel capacity, even if somehow every system completed at the same time.
Objective
While using the ParallelExecutor, systems do not actually start until
prepare_systems
completes. In stages where there are large numbers of "empty" systems with very little work to do, this delay adds significant overhead, which can add up over many stages.Solution
Immediately and synchronously signal the start of systems that can run without dependencies inside
prepare_systems
instead of waiting for the first executor iteration afterprepare_systems
completes. Any system that is dependent on them still cannot run until afterprepare_systems
completes, but there are a large number of unconstrained systems in the base engine where this is a general benefit in almost every case.Performance
This change was tested against
many_foxes
in the default configuration. As this change is sensitive to the overhead around scheduling systems, the spans for measuring system timing, system overhead, and system commands were all commented out for these measurements.The median stage timings between
main
and this PR are as follows:For the first few stages, the benefit is small but cumulative over each. For PostUpdate in particular, this allows
parent_update
to run while prepare_systems is running, which is required for the animation and transform propagation systems, which dominate the time spent in the stage, but also frontloads the contention as the other "empty" systems are also running whileparent_update
is running. For Render, where there is just a single large exclusive system, the benefit comes from not waiting on a spuriously scheduled task on the task pool to kick off the system: it's immediately scheduled to run.