QueryParIter::map_collect and similar operations #11249

stepancheg · 2024-01-07T19:47:46Z

Objective

par_iter is convenient, unless you need to collect the results into a Vec or HashMap.

Bevy commonly uses thread-local collectors, and aggregates them, for example, here:

bevy/crates/bevy_render/src/view/visibility/mod.rs

Lines 452 to 454 in 101037d

    
           for cell in &mut thread_queues { 
        
               visible_entities.entities.append(cell.get_mut()); 
        
           }

it is not convenient because:

boilerplate thread-local setup
non-deterministic output

Solution

Add QueryParIter functions:

flat_map_collect (similar to .flat_map(...).collect()
filter_map_collect
map_collect

These functions might be not as universally applicable as for_each with thread-local aggregators, because more expensive, but:

easy to use
deterministic

Changelog

QueryParIter has new functions: flat_map_collect, filter_map_collect, map_collect.

james7132 · 2024-01-08T02:39:26Z

This can be quite performance sensitive code. Some form of profiling on our existing stress tests would be nice. I'm also hesitant to promote a option that allocates on every call as well, particular if it's a Vec being recreated and regrown every tick of the app.

stepancheg · 2024-01-08T08:22:41Z

@james7132 suggestions how to benchmarks?

Overall, I don't believe, it will be bad. Malloc is cheap. Moreover, it is quite possible, it will actually increase performance, because it gives users an instrument to parallelize code, which was too hard to parallelize before.

NthTensor · 2024-01-08T14:46:18Z

I was helping some who needed this yesterday. This allows us to, for example, buffer events parallel and then dispatch them in sequence while preserving order. Seems super useful to me.

Many useful features can degrade performance if misapplied. If the docs are clear about the performance characteristics, I think allocation shouldn't be a blocking issue.

stepancheg · 2024-01-08T18:44:03Z

I did lame benchmark.

I ran

cargo run --example many_cubes

and patched check_visibility function, because it seems to be a bottleneck.

bevy/crates/bevy_render/src/view/visibility/mod.rs

Line 404 in c1b785c

visible_aabb_query.par_iter_mut().for_each(|query_item| {

By default on my laptop it outputs about 18.5 FPS.

If I replace par_iter with non-parallel iter, FPS drops to about 15.0 FPS. So this function is indeed a bottleneck.

If I rewrite the function with .filter_map_collect(), performance stays at about the same 18.5 FPS. (diff).

Also tried running the same test with --release, performance of all three is roughly the same 200 FPS, so bottleneck is elsewhere.

To get precise perf difference, much better benchmark is needed.

So my conclusion is that if filter_map_collect is slower, it is not significantly slower. So seems to be good enough.

# Objective Issue #10243: rendering multiple triangles in the same place results in flickering. ## Solution Considered these alternatives: - `depth_bias` may not work, because of high number of entities, so creating a material per entity is practically not possible - rendering at slightly different positions does not work, because when camera is far, float rounding causes the same issues (edit: assuming we have to use the same `depth_bias`) - considered implementing deterministic operation like `query.par_iter().flat_map(...).collect()` to be used in `check_visibility` system (which would solve the issue since query is deterministic), and could not figure out how to make it as cheap as current approach with thread-local collectors (#11249) So adding an option to sort entities after `check_visibility` system run. Should not be too bad, because after visibility check, only a handful entities remain. This is probably not the only source of non-determinism in Bevy, but this is one I could find so far. At least it fixes the repro example. ## Changelog - `DeterministicRenderingConfig` option to enable deterministic rendering ## Test <img width="1392" alt="image" src="https://github.com/bevyengine/bevy/assets/28969/c735bce1-3a71-44cd-8677-c19f6c0ee6bd"> --------- Co-authored-by: Alice Cecile <[email protected]>

hymm · 2024-01-09T01:29:44Z

There is one bench that tests parallel iteration. You can run it with cargo bench heavy when in the `benches folder. I get around 250-260us a run on my machine so it is a little noisy.

stepancheg · 2024-01-09T03:51:17Z

Ok, I tried adding another fake benchmark there.

Reduce the iterations from base benchmark to 10 (to make it a bit more realistic)
But increase batch size 10 times
copy-paste base benchmark to map_collect (non-parallel) and par_map_collect

diff

The output is

heavy_compute/base
                        time:   [530.31 µs 531.80 µs 533.31 µs]
heavy_compute/par_map_collect
                        time:   [570.93 µs 572.52 µs 574.15 µs]
heavy_compute/map_collect
                        time:   [2.4066 ms 2.4105 ms 2.4144 ms]

In this artificial setup, parallel map_collect is somewhat more expensive than parallel for_each and as expected significantly faster than non-parallel version.

Which is exactly what is expected: code is doing what it is supposed to do.

More correct benchmark would be comparison with Local<ThreadLocal<Cell<Vec<bool>>>> shenanigans. That would be a bit more effort to write, and this is probably not what we'd recommend to users anyway.

hymm · 2024-01-09T04:04:52Z

can you compare heavy_compute/base to main branch?

For checking with many_cubes, we usually use tracy https://github.com/bevyengine/bevy/blob/main/docs/profiling.md. And use show the histograms diff for the relevant systems.

stepancheg · 2024-01-09T04:19:30Z

I can compare against main branch, but:

criterion is not reliable benchmarking tool. It can easily show reliable 5% speedup with no changes at all
this PR does not change compilation of for_each implementation, only syntax. fold_impl now returns Vec<()>, but
- it has no allocations
- it was there before, only hidden, here:

bevy/crates/bevy_tasks/src/task_pool.rs

Line 287 in 06bf928

pub fn scope<'env, F, T>(&self, f: F) -> Vec<T>

stepancheg · 2024-01-09T04:30:25Z

OK, benchmark against main.

On main.

cargo bench -p benches --bench ecs -- heavy_compute --save-baseline main

on this branch:

cargo bench -p benches --bench ecs -- heavy_compute --baseline main

heavy_compute/base      time:   [564.08 µs 565.16 µs 566.09 µs]
                        change: [-0.4176% -0.0620% +0.2685%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

stepancheg mentioned this pull request Jan 7, 2024

Option to enable deterministic rendering #11248

Merged

stepancheg force-pushed the par-iter-map branch from 68398e6 to c902a13 Compare January 7, 2024 19:52

QueryParIter::map_collect and similar operations

53ca955

stepancheg force-pushed the par-iter-map branch from c902a13 to 53ca955 Compare January 7, 2024 19:52

james7132 added C-Feature A new feature, making something new possible A-ECS Entities, components, systems, and events labels Jan 8, 2024

james7132 self-requested a review January 8, 2024 02:36

james7132 added the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Jan 8, 2024

BenjaminBrienen added D-Straightforward Simple bug fixes and API improvements, docs, test and examples S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QueryParIter::map_collect and similar operations #11249

QueryParIter::map_collect and similar operations #11249

stepancheg commented Jan 7, 2024 •

edited

Loading

james7132 commented Jan 8, 2024

stepancheg commented Jan 8, 2024

NthTensor commented Jan 8, 2024 •

edited

Loading

stepancheg commented Jan 8, 2024

hymm commented Jan 9, 2024 •

edited

Loading

stepancheg commented Jan 9, 2024

hymm commented Jan 9, 2024

stepancheg commented Jan 9, 2024 •

edited

Loading

stepancheg commented Jan 9, 2024

	for cell in &mut thread_queues {
	visible_entities.entities.append(cell.get_mut());
	}

QueryParIter::map_collect and similar operations #11249

Are you sure you want to change the base?

QueryParIter::map_collect and similar operations #11249

Conversation

stepancheg commented Jan 7, 2024 • edited Loading

Objective

Solution

Changelog

james7132 commented Jan 8, 2024

stepancheg commented Jan 8, 2024

NthTensor commented Jan 8, 2024 • edited Loading

stepancheg commented Jan 8, 2024

hymm commented Jan 9, 2024 • edited Loading

stepancheg commented Jan 9, 2024

hymm commented Jan 9, 2024

stepancheg commented Jan 9, 2024 • edited Loading

stepancheg commented Jan 9, 2024

stepancheg commented Jan 7, 2024 •

edited

Loading

NthTensor commented Jan 8, 2024 •

edited

Loading

hymm commented Jan 9, 2024 •

edited

Loading

stepancheg commented Jan 9, 2024 •

edited

Loading