Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QueryParIter::map_collect and similar operations #11249

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stepancheg
Copy link
Contributor

@stepancheg stepancheg commented Jan 7, 2024

Objective

par_iter is convenient, unless you need to collect the results into a Vec or HashMap.

Bevy commonly uses thread-local collectors, and aggregates them, for example, here:

for cell in &mut thread_queues {
visible_entities.entities.append(cell.get_mut());
}

it is not convenient because:

  • boilerplate thread-local setup
  • non-deterministic output

Solution

Add QueryParIter functions:

  • flat_map_collect (similar to .flat_map(...).collect()
  • filter_map_collect
  • map_collect

These functions might be not as universally applicable as for_each with thread-local aggregators, because more expensive, but:

  • easy to use
  • deterministic

Changelog

QueryParIter has new functions: flat_map_collect, filter_map_collect, map_collect.

@james7132 james7132 added C-Feature A new feature, making something new possible A-ECS Entities, components, systems, and events labels Jan 8, 2024
@james7132 james7132 self-requested a review January 8, 2024 02:36
@james7132 james7132 added the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Jan 8, 2024
@james7132
Copy link
Member

This can be quite performance sensitive code. Some form of profiling on our existing stress tests would be nice. I'm also hesitant to promote a option that allocates on every call as well, particular if it's a Vec being recreated and regrown every tick of the app.

@stepancheg
Copy link
Contributor Author

@james7132 suggestions how to benchmarks?

Overall, I don't believe, it will be bad. Malloc is cheap. Moreover, it is quite possible, it will actually increase performance, because it gives users an instrument to parallelize code, which was too hard to parallelize before.

@NthTensor
Copy link
Contributor

NthTensor commented Jan 8, 2024

I was helping some who needed this yesterday. This allows us to, for example, buffer events parallel and then dispatch them in sequence while preserving order. Seems super useful to me.

Many useful features can degrade performance if misapplied. If the docs are clear about the performance characteristics, I think allocation shouldn't be a blocking issue.

@stepancheg
Copy link
Contributor Author

I did lame benchmark.

I ran

cargo run --example many_cubes

and patched check_visibility function, because it seems to be a bottleneck.

visible_aabb_query.par_iter_mut().for_each(|query_item| {

By default on my laptop it outputs about 18.5 FPS.

If I replace par_iter with non-parallel iter, FPS drops to about 15.0 FPS. So this function is indeed a bottleneck.

If I rewrite the function with .filter_map_collect(), performance stays at about the same 18.5 FPS. (diff).

Also tried running the same test with --release, performance of all three is roughly the same 200 FPS, so bottleneck is elsewhere.

To get precise perf difference, much better benchmark is needed.

So my conclusion is that if filter_map_collect is slower, it is not significantly slower. So seems to be good enough.

github-merge-queue bot pushed a commit that referenced this pull request Jan 9, 2024
# Objective

Issue #10243: rendering multiple triangles in the same place results in
flickering.

## Solution

Considered these alternatives:
- `depth_bias` may not work, because of high number of entities, so
creating a material per entity is practically not possible
- rendering at slightly different positions does not work, because when
camera is far, float rounding causes the same issues (edit: assuming we
have to use the same `depth_bias`)
- considered implementing deterministic operation like
`query.par_iter().flat_map(...).collect()` to be used in
`check_visibility` system (which would solve the issue since query is
deterministic), and could not figure out how to make it as cheap as
current approach with thread-local collectors (#11249)

So adding an option to sort entities after `check_visibility` system
run.

Should not be too bad, because after visibility check, only a handful
entities remain.

This is probably not the only source of non-determinism in Bevy, but
this is one I could find so far. At least it fixes the repro example.

## Changelog

- `DeterministicRenderingConfig` option to enable deterministic
rendering

## Test

<img width="1392" alt="image"
src="https://github.com/bevyengine/bevy/assets/28969/c735bce1-3a71-44cd-8677-c19f6c0ee6bd">

---------

Co-authored-by: Alice Cecile <[email protected]>
@hymm
Copy link
Contributor

hymm commented Jan 9, 2024

There is one bench that tests parallel iteration. You can run it with cargo bench heavy when in the `benches folder. I get around 250-260us a run on my machine so it is a little noisy.

@stepancheg
Copy link
Contributor Author

Ok, I tried adding another fake benchmark there.

  • Reduce the iterations from base benchmark to 10 (to make it a bit more realistic)
  • But increase batch size 10 times
  • copy-paste base benchmark to map_collect (non-parallel) and par_map_collect

diff

The output is

heavy_compute/base
                        time:   [530.31 µs 531.80 µs 533.31 µs]
heavy_compute/par_map_collect
                        time:   [570.93 µs 572.52 µs 574.15 µs]
heavy_compute/map_collect
                        time:   [2.4066 ms 2.4105 ms 2.4144 ms]

In this artificial setup, parallel map_collect is somewhat more expensive than parallel for_each and as expected significantly faster than non-parallel version.

Which is exactly what is expected: code is doing what it is supposed to do.

More correct benchmark would be comparison with Local<ThreadLocal<Cell<Vec<bool>>>> shenanigans. That would be a bit more effort to write, and this is probably not what we'd recommend to users anyway.

@hymm
Copy link
Contributor

hymm commented Jan 9, 2024

can you compare heavy_compute/base to main branch?

For checking with many_cubes, we usually use tracy https://github.com/bevyengine/bevy/blob/main/docs/profiling.md. And use show the histograms diff for the relevant systems.

@stepancheg
Copy link
Contributor Author

stepancheg commented Jan 9, 2024

I can compare against main branch, but:

  • criterion is not reliable benchmarking tool. It can easily show reliable 5% speedup with no changes at all
  • this PR does not change compilation of for_each implementation, only syntax. fold_impl now returns Vec<()>, but
    • it has no allocations
    • it was there before, only hidden, here:

pub fn scope<'env, F, T>(&self, f: F) -> Vec<T>

@stepancheg
Copy link
Contributor Author

OK, benchmark against main.

On main.

cargo bench -p benches --bench ecs -- heavy_compute --save-baseline main

on this branch:

cargo bench -p benches --bench ecs -- heavy_compute --baseline main
heavy_compute/base      time:   [564.08 µs 565.16 µs 566.09 µs]
                        change: [-0.4176% -0.0620% +0.2685%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

@BenjaminBrienen BenjaminBrienen added D-Straightforward Simple bug fixes and API improvements, docs, test and examples S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ECS Entities, components, systems, and events C-Feature A new feature, making something new possible D-Straightforward Simple bug fixes and API improvements, docs, test and examples S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants