Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Drop for CommandQueue #10746

Merged
merged 4 commits into from
Nov 27, 2023

Conversation

bushrat011899
Copy link
Contributor

Objective

Solution

Implemented Drop for CommandQueue. This has been done entirely in the private API of CommandQueue, ensuring no breaking changes. Also added a unit test, test_command_queue_inner_drop_early, based on the reproduction steps as outlined in #10676.

Notes

I believe this can be applied to 0.12.1 as well, but I am uncertain of the process to make that kind of change. Please let me know if there's anything I can do to help with the back-porting of this change.

Copy link
Contributor Author

@bushrat011899 bushrat011899 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaving a review of my own changes just to provide some additional context to what might be some confusing choices I've made.

@@ -9,8 +9,11 @@ struct CommandMeta {
/// SAFETY: The `value` must point to a value of type `T: Command`,
/// where `T` is some specific type that was used to produce this metadata.
///
/// `world` is optional to allow this one function pointer to perform double-duty as a drop.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this to ensure CommandMeta is still exactly 8 bytes in size. Since the inner Drop implementation for C is specific to C, this is the only way to capture it without making CommandMeta at least 16 bytes large.

/// Returns the size of `T` in bytes.
apply_command_and_get_size: unsafe fn(value: OwningPtr<Unaligned>, world: &mut World) -> usize,
consume_command_and_get_size:
unsafe fn(value: OwningPtr<Unaligned>, world: &mut Option<&mut World>) -> usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using &mut Option<&mut World> to allow for the return of the Option<&mut World> back to the CommandQueue after consume_command_and_get_size is done with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the extra &mut by reborrowing the &mut World before calling consume_command_and_get_size, for example by passing world.as_deref_mut()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reduced the usage of the &mut Option<&mut World>, but I can't fully eliminate it. CommandQueue itself is only passed a mutable reference to World, and I don't see a way to convert &mut World into &mut Option<World>. In fact, I think such a transform is nonsensical.

/// Returns the size of `T` in bytes.
apply_command_and_get_size: unsafe fn(value: OwningPtr<Unaligned>, world: &mut World) -> usize,
consume_command_and_get_size:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this field for clarity.

// Apply command to the provided world...
Some(world) => command.apply(world),
// ...or discard it.
None => drop(command),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explicit drop call is not required, since it will be implicitly called once the scope of this closure ends. I've added it here purely for clarity of purpose.

@ItsDoot ItsDoot added C-Bug An unexpected or incorrect behavior A-ECS Entities, components, systems, and events C-Performance A change motivated by improving speed, memory usage or compile times labels Nov 26, 2023
@james7132 james7132 added the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Nov 26, 2023
@james7132
Copy link
Member

The extra branch here on each command is in the hotpath for command application. Please run the command microbenchmarks and/or the stress tests to see if this has a tangible impact on that perf.

@bushrat011899
Copy link
Contributor Author

Ok so I tried comparing this change to the main branch using the ECS benchmarks, and I'm not really confident in drawing a conclusion any which way. I had pretty extreme run-to-run variance running the bench repeatedly, which is probably just a problem with my machine (it is my general-purpose Window box, so who knows what's happening back there)

Bench Output
> cargo bench --bench ecs -- --load-baseline commandqueue_drop --baseline main .*_command
    Finished bench [optimized] target(s) in 0.20s
     Running benches/bevy_ecs/benches.rs (target\release\deps\ecs-7d4169dd95b56a01.exe)
Gnuplot not found, using plotters backend
empty_commands/0_entities
                        time:   [3.2537 ns 3.2847 ns 3.3315 ns]
                        change: [-0.0206% +0.9463% +1.8464%] (p = 0.05 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) high mild
  8 (8.00%) high severe  
spawn_commands/2000_entities
                        time:   [149.76 µs 151.23 µs 152.78 µs]
                        change: [-1.7627% -0.1753% +1.3376%] (p = 0.83 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
spawn_commands/4000_entities
                        time:   [303.65 µs 306.56 µs 309.66 µs]
                        change: [-2.2620% -0.6257% +1.1928%] (p = 0.47 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
spawn_commands/6000_entities
                        time:   [456.51 µs 461.40 µs 466.27 µs]
                        change: [-0.9910% +0.9277% +2.8949%] (p = 0.37 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
spawn_commands/8000_entities
                        time:   [629.32 µs 636.60 µs 644.87 µs]
                        change: [-0.7069% +0.9323% +2.6418%] (p = 0.28 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe  
insert_commands/insert  time:   [307.34 µs 309.23 µs 312.06 µs]
                        change: [+6.1567% +8.6578% +11.358%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
insert_commands/insert_batch
                        time:   [215.72 µs 217.29 µs 219.28 µs]
                        change: [+0.5404% +2.1128% +3.6292%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe  
fake_commands/2000_commands
                        time:   [6.6523 µs 6.7584 µs 6.8651 µs]
                        change: [+23.412% +24.754% +26.305%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 22 outliers among 100 measurements (22.00%)
  13 (13.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
fake_commands/4000_commands
                        time:   [12.901 µs 13.079 µs 13.262 µs]
                        change: [+12.367% +14.289% +16.331%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
fake_commands/6000_commands
                        time:   [19.175 µs 19.344 µs 19.536 µs]
                        change: [+17.373% +18.577% +19.704%] (p = 0.00 < 0.05)
                        Performance has regressed.
fake_commands/8000_commands
                        time:   [25.453 µs 25.731 µs 26.018 µs]
                        change: [+14.211% +15.489% +16.744%] (p = 0.00 < 0.05)
                        Performance has regressed.  
sized_commands_0_bytes/2000_commands
                        time:   [3.1955 µs 3.2128 µs 3.2332 µs]
                        change: [+4.3901% +5.7755% +7.1056%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
sized_commands_0_bytes/4000_commands
                        time:   [6.3384 µs 6.3899 µs 6.4597 µs]
                        change: [+3.5901% +4.8468% +5.8916%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
sized_commands_0_bytes/6000_commands
                        time:   [9.4952 µs 9.5870 µs 9.7428 µs]
                        change: [+0.8685% +2.0121% +3.1798%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
sized_commands_0_bytes/8000_commands
                        time:   [12.823 µs 12.917 µs 13.092 µs]
                        change: [+1.5821% +3.5291% +5.2057%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe  
sized_commands_12_bytes/2000_commands
                        time:   [4.4384 µs 4.4833 µs 4.5537 µs]
                        change: [+7.1673% +9.5000% +11.495%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
sized_commands_12_bytes/4000_commands
                        time:   [8.8486 µs 8.9161 µs 9.0135 µs]
                        change: [+10.094% +11.156% +12.235%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
sized_commands_12_bytes/6000_commands
                        time:   [13.104 µs 13.202 µs 13.340 µs]
                        change: [+7.4410% +8.7096% +10.010%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) high mild
  16 (16.00%) high severe
sized_commands_12_bytes/8000_commands
                        time:   [17.701 µs 17.827 µs 17.999 µs]
                        change: [+9.4137% +10.645% +11.941%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe  
sized_commands_512_bytes/2000_commands
                        time:   [45.140 µs 45.600 µs 46.118 µs]
                        change: [+1.1514% +2.6101% +4.5087%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
sized_commands_512_bytes/4000_commands
                        time:   [89.803 µs 90.349 µs 91.052 µs]
                        change: [-4.7781% -1.6346% +1.3122%] (p = 0.31 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
sized_commands_512_bytes/6000_commands
                        time:   [135.34 µs 135.93 µs 136.68 µs]
                        change: [-2.2616% +1.9043% +6.5246%] (p = 0.42 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
sized_commands_512_bytes/8000_commands
                        time:   [179.64 µs 180.09 µs 180.65 µs]
                        change: [-3.6188% +0.1834% +4.1426%] (p = 0.93 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

To me, it looks like fake_commands and sized_commands are the two benches that might have regressed, but I saw some benchmarks swing by plus or minus 50% after repeated runs, so I don't know if those are genuine, or just the Windows scheduler getting in the way of a good benchmark. But just looking at the worst one (fake_commands/2000), it does appear to be pretty consistently bad:

relative_pdf_small

It appears that the added branch is being falsely predicted, and the check does add an increase in execution time. If anyone else can replicate this behaviour, I'll mark this as Draft and work on an alternate solution that avoids any changes on the hot path.

If anyone has some more specific benchmarking instructions I could follow I'd really appreciate it. I'm still fairly new to Criterion, and this is my first time benchmarking Bevy as well.

@bushrat011899
Copy link
Contributor Author

Looking into the fake_commands and sized_commands benchmarks, they are both effectively no-op commands, so it makes sense that they'd show the largest relative change in performance in the command queue clearing process. Since the other commands actually perform work on the World, they barely register a change in performance, since the command queue is already such a small component of the performance bottleneck.

@alice-i-cecile alice-i-cecile added this to the 0.12.1 milestone Nov 26, 2023
@bushrat011899
Copy link
Contributor Author

Running just the fake_commands benches on my Intel laptop this morning, the performance difference appears far closer to acceptable in my opinion. I'll have another look on my AMD machine tonight after cleaning up a few things to see if it's more in-line with these measurements.

Bench Output
> cargo bench --bench ecs -- fake_commands --load-baseline impldrop --baseline main
    Finished bench [optimized] target(s) in 0.09s
     Running benches/bevy_ecs/benches.rs (target\release\deps\ecs-cc899ba4903ad805.exe)
Gnuplot not found, using plotters backend
fake_commands/2000_commands
                        time:   [3.9578 µs 3.9911 µs 4.0243 µs]
                        change: [+5.8616% +7.3377% +8.8263%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
  8 (8.00%) low mild
  4 (4.00%) high mild
  11 (11.00%) high severe
fake_commands/4000_commands
                        time:   [8.2784 µs 8.3384 µs 8.3945 µs]
                        change: [+1.7114% +3.5701% +5.2884%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
fake_commands/6000_commands
                        time:   [14.614 µs 14.835 µs 15.037 µs]
                        change: [-3.9313% -1.3772% +1.1835%] (p = 0.30 > 0.05)
                        No change in performance detected.
fake_commands/8000_commands
                        time:   [19.141 µs 19.408 µs 19.686 µs]
                        change: [-7.0809% -4.7707% -2.5649%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

fake_commands/2000
relative_pdf_small

fake_commands/8000
relative_pdf_small

If this is still an unacceptable level of regression, one alternative is to increase the size of CommandMeta and include a dedicated drop function pointer. This would add 8 bytes (usize) to every Command stored in the CommandQueue however, which is why I initially avoided that solution.

@bushrat011899
Copy link
Contributor Author

Ok adding an extra function pointer to CommandMeta is an absolute no-go for the fake_commands benchmark.

A near 100% regression
> cargo bench --bench ecs -- fake_commands --load-baseline impldrop --baseline main
    Finished bench [optimized] target(s) in 0.10s
Gnuplot not found, using plotters backend
fake_commands/2000_commands
                        time:   [7.3787 µs 7.8152 µs 8.2519 µs]
                        change: [+89.014% +97.409% +106.18%] (p = 0.00 < 0.05)
                        Performance has regressed.
fake_commands/4000_commands
                        time:   [15.151 µs 15.906 µs 16.725 µs]
                        change: [+80.068% +88.055% +96.419%] (p = 0.00 < 0.05)
                        Performance has regressed.
fake_commands/6000_commands
                        time:   [21.302 µs 22.041 µs 22.863 µs]
                        change: [+51.893% +58.609% +65.993%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
fake_commands/8000_commands
                        time:   [29.114 µs 30.216 µs 31.431 µs]
                        change: [+53.369% +59.470% +64.795%] (p = 0.00 < 0.05)
                        Performance has regressed.

It makes sense why adding it tanks performance, since most of that benchmark is just reading and writing the CommandMeta to the byte array, so doubling the bytes being shuffled around (fake_command is a ZST) should pretty drastically reduce performance.

@james7132 james7132 removed the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Nov 26, 2023
Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the benchmarks in both cases, really helps cover our bases with the options for fixing this bug. LGTM.

@james7132 james7132 requested a review from maniwani November 26, 2023 23:05
…re possible

Also removed one branch from `apply`.

Revert "Updated `CommandQueue` Changes to avoid `&mut Option<&mut World>` where possible"
@bushrat011899 bushrat011899 force-pushed the ImplementDropForCommandQueue branch from baa628f to 74de7b2 Compare November 26, 2023 23:07
@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Nov 27, 2023
@@ -105,13 +105,14 @@ impl CommandQueue {
// flush the previously queued entities
world.flush();

self.apply_maybe_world(Some(world));
self.apply_or_drop(Some(world));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, that does read a lot better!

Copy link
Member

@cart cart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I renamed apply_maybe_world to apply_or_drop for clarity and made the docs a bit more explicit. The old named seemed dangerous as callers of apply_maybe_world might assume they can pass None with no side effects.

@cart
Copy link
Member

cart commented Nov 27, 2023

Renamed again to apply_or_drop_queued for maximum clarity.

@cart cart enabled auto-merge November 27, 2023 22:22
@cart cart added this pull request to the merge queue Nov 27, 2023
Merged via the queue into bevyengine:main with commit 6e871ab Nov 27, 2023
22 checks passed
cart added a commit that referenced this pull request Nov 30, 2023
# Objective

- Fixes #10676, preventing a possible memory leak for commands which
owned resources.

## Solution

Implemented `Drop` for `CommandQueue`. This has been done entirely in
the private API of `CommandQueue`, ensuring no breaking changes. Also
added a unit test, `test_command_queue_inner_drop_early`, based on the
reproduction steps as outlined in #10676.

## Notes

I believe this can be applied to `0.12.1` as well, but I am uncertain of
the process to make that kind of change. Please let me know if there's
anything I can do to help with the back-porting of this change.

---------

Co-authored-by: Carter Anderson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dropping a CommandQueue leaks all contained Commands
7 participants