Iteration over continuous intervals of components #462

Raikiri · 2020-04-11T15:27:05Z

Consider a simple bit of code that does multi-component iteration:

registry.view<position, velocity>().each([dt](auto &pos, auto &vel){
  pos += vel * dt;
});

Now I want to make it SIMD. It has to be done in chunks that are known to be continuous in memory. For example, this can be done via an API like this:

registry.view<position, velocity>().each([dt](auto *pos, auto *vel, size_t count){
  for(size_t offset = 0; offset < count; offset += 4)
  {
    MulAdd4(pos + offset, vel + offset, dt);
  }
  for(; offset < count; offset++)
  {
    pos[i] += vel[i] * dt;
  }
});

Is it possible to do something like this already? If no, how complicated would it be to implement?

indianakernick · 2020-04-11T15:45:04Z

This is possible with groups.

entt::registry reg;
auto group = reg.group<position, velocity>();
const std::size_t size = group.size();
position *pos = group.raw<position>();
velocity *vel = group.raw<velocity>();
// `pos` and `vel` are parallel arrays with `size` elements

For more information about the inner workings of groups, see this article.

Raikiri · 2020-04-11T15:51:37Z

Thanks, I have already familiarized myself with the groups. However groups impose strict implicit requirements on what you can and can't do with them, such as a component can't belong to 2 non-nested groups at once. And that is a fair price to pay in order to get perfect SoA. However, I do NOT need a perfect SoA, so I don't want to pay for it, but I want the next best thing: erm, imperfect SoA(?) for free. In the perfect scenario it would return the entire raw arrays, in the worst scenario it would return chunks of size 1, but that's fine.

This is also similar to the type of access flecs provides (although I don't necessarily like their implementation):

flecs::system<Position, Velocity>(world)
        .action([](const flecs::rows& rows, 
            flecs::column<Position> p, 
            flecs::column<Velocity> v) 
        {    
            for (auto row : rows) {
                p[row].x += v[row].x;
                p[row].y += v[row].y;

                std::cout << "Moved " << rows.entity(row).name() << " to {" <<
                    p[row].x << ", " << p[row].y << "}" << std::endl;
            }
        });

skypjack · 2020-04-11T16:03:44Z

That's because of how archetypes work. You pay the cost of having all things grouped somehow, no matter what. The result is that... well, you've things grouped somehow (and possibly fragmented on the other side).
I went with a completely different design, that is to pay only for what you want. In this case, a sparse set based approach fits better than an archetypes one. You don't pay all the times the cost but you've other constraints.

the next best thing: erm, imperfect SoA(?) for free. In the perfect scenario it would return the entire raw arrays, in the worst scenario it would return chunks of size 1, but that's fine

Can you provide more details on this? I'm a bit busy today but I'll do my best to look into it.
I like the idea of adding the next best thing if possible. 😉

Raikiri · 2020-04-11T16:16:35Z

Can you provide more details on this? I'm a bit busy today but I'll do my best to look into it.

Well, imagine we don't create any groups at all (because we don't want to pay for them). We just iterate over a view:

registry.view<position, velocity>().each([dt](auto &pos, auto &vel){
  pos += vel * dt;
});

during the iteration there can be some cases where two or more consecutive calls to this lambda will actually point to pos and vel components that are adjacent to each other in memory. What I'm asking is to group those continuous calls into 1 whenever possible:

[dt](auto *pos, auto *vel, size_t count)

where (auto *pos, auto *vel, size_t count) is a SoA representation of a chunk of components that just happened to be continuous in memory. Worst case this chunk will be length of 1 which will be identical to what we currently have in the view API and this lambda will have to be called as many times as there are such pairs (say, N). But best case scenario (if all components that have position also happen to have velocity) this will call this lambda just once where (auto *pos, auto *vel) will be arrays of size N storing all the needed components in a perfect SoA fashion. For free! Yay!

This provides a low-effort way to get some benefits of perfect SoA without paying any cost or imposing any restrictions, just by providing an alternative way to iterate over groups/views that bundles continuous calls together.

skypjack · 2020-04-11T16:33:38Z

Oh... I must admit that your reasoning surprised me in a way.
Why did you wait so long to come and share this idea? 😄

Actually, by construction, you could spot a lot of these chunks here and there on some types (mainly low level ones, I don't think so for gameplay stuff).
My only concern at the moment is that you iterate two times the entities here: the first one to know how long a chunk is, the second one in user space. Of course, it's not a problem if you don't use the entities on your side since the cost is instead pretty much the same.

Since you seem to know what you're talking about, I take this opportunity to ask you a question.
The only easy way that comes to mind to do this is by looking forward in search for a chunk.
Do you have any other suggestions to explore? All the other options seem to me either complicated or, in any case, intended to consume memory. A look-ahead approach would be for free instead, from all points of views.

Raikiri · 2020-04-11T16:43:12Z

The way I imagine it could be done without having any apriori knowledge of how entt works under the hood would be kinda like this (naive pseudocode as I actually have 0 knowledge of how entt works lol):

template<typename T1, typename T2, typename Func>
void view::each(Func func)
{
  auto chunk_start = begin<T1, T2>();
  auto prev = begin<T1, T2>();
  size_t chunk_size = 1;
  while(auto it = begin<T1, T2>() + 1; it != end<T1, T2>(); it++)
  {
    //checking if previous call is contiguous in memory with the current one
    //this could be implemented way better if the iterator itself keeps track on whether it's continous or not on each component
    if(&prev.get<0>() + 1 == &it.get<0>() && &prev.get<0>() + 1 == &it.get<0>())
      chunk_size++;
    else
    {
      func(&chunk_start.get<0>(), &chunk_start.get<1>(), chunk_size);
      chunk_size = 1;
      chunk_start = it;
    }
    prev = it;
  }
}

Yes, you do have to pay the price of iterating over a range twice (here and in the user function) as you said. But the benefit of manual SIMD optimization in the user function can way offset this cost.

But you're certainly giving me way too much credit for 0 reason as I have never designed a production ECS myself. I was just surprised to find out it does not have something as obvious as detecting continuous chunks that are already there.

indianakernick · 2020-04-12T00:51:20Z

If you care about performance, (since you’re using SIMD I’m sure you do) then you ought to know how EnTT works. For that, I suggest reading the ECS Back and Forth articles. In particular, this one talks about sparse sets and this one talks about groups. If you want to write fast code with EnTT then knowing the implementation details will help.

Raikiri · 2020-04-12T04:03:01Z

Yes, I have already read those after creating this thread. Honestly I did not really like black-boxey feeling that the official documentation left. The implementation details can be arbitrarily complex, but I think at least mentioning sparse sets, they way views and groups are iterated should remove a lot of confusion from the documentation, without overloading it with too much information.

skypjack · 2020-04-12T07:29:50Z

Honestly I did not really like black-boxey feeling that the official documentation left.

Actually there is a PR that I've still to merge and the aim of which is to provide an insight on the internals of the library.
Btw the ECS back and forth series already contains all the details afaik and can be a good starting point. 👍

Raikiri · 2020-04-13T09:19:17Z

After looking up how entt works internally, it's obvious that we can replace this:

if(&prev.get<0>() + 1 == &it.get<0>() && &prev.get<0>() + 1 == &it.get<0>())

with a better implementation that does not have to do an indirection into component array memory. It can be done like this for example:

if(it.prev_entity<0>() == prev.entity() && it.prev_entity<1>() == prev.entity())

This thing just checks that the previous entity in all component arrays is the same as the last processed entity. If it is, means that the chunk is continuous. This is done just by comparing entity indices stored in those dense arrays without reading component pointer's address.

It still means that look-ahead is necessary, but it can be done very effectively by only accessing dense entity arrays.

skypjack · 2020-04-14T16:42:12Z

I've looked into this one. Unfortunately, we are missing something important here.
When you iterate over components A, B and C and use let's say the pool of A because it's the shortest one, it doesn't mean that all entities from that pool have all the components.
Therefore, we can't just make a look ahead and use use eventually the get. We need also to test every entity to know if it has all the components.
I think this ruins a bit all the benefit you could have otherwise from performing operations a chunk at a time.

That said, what you mention could be used to provide a chunked access to partial owning groups.
In that case, the assumption that all entities in the owned pool are valid is true (because at least an owned one exists) and therefore we can further optimize the iteration with the chunked approach you mentioned.

Any thoughts?

Raikiri · 2020-04-14T17:12:24Z

You can look ahead, but you do have to check all the components to make sure all of them are present and continuous. Both of those checks can be accomplished by one read from the sparse array per component. In fact, you already have to read that array to make sure that the component is present, you can use the same value to tell if it's also continuous.

Once you find an entity that does not have one of its components OR one of its components is not adjacent with the previous one (you get all this from just the value you read from the sparse array), you have to call the user function and use get() to actually read where this chunk starts.

PS this whole thing is NOT designed to make iteration over entities faster. Furthermore, calculating chunks does take some additional logic. That being said, benefit comes from the user's lambda that can process things in bulk more effectively than it processes things one by one and the difference can be pretty significant with SIMD.

skypjack · 2020-07-24T13:03:51Z

I made a test during the last days. The result is that iterating elements this way takes more or less three times the time of an each.
I don't think it's worth complicating the codebase for something that is literally unusable. The main problem is that we iterate elements twice: the first time to find the length of the chunk, the second time to visit it.
What was your expectation exactly?

skypjack · 2020-07-25T17:03:41Z

Nvm. I found a way to make it faster then what I expected and even better of what we discussed here in terms of functionalities.
I'm pushing it tomorrow or such, as soon as the doc and all tests are ready. 👍

chengts95 · 2021-04-13T06:16:32Z

Since the chunked function is removed, is there an alternative way to achieve that?

skypjack · 2021-04-13T10:09:06Z

Mmm yes and no, I'd say that it's possible with a custom pool but I would discourage it honestly.
The main problem is that chunked accesses rely on raw pointer/size pairs. On the other side, with custom pools and the default storage that is about to be paged, these elements aren't (well, won't be) available/reliable anymore.
So, roughly speaking, we can't return anymore a T* to the callback. This probably defeats the purpose of chunked access.
In theory, we could return a couple of non-raw iterators that are supposed to return a range that is valid in all pools. However, from what I get this isn't the desired behavior.
I'd rather use a group or any other solution probably, but I'm open to suggestions/feedback.

chengts95 · 2021-04-13T18:23:47Z

Currently, I feel the multi-component view or group is slow when there are many entities compared to perfect SoA solutions.

skypjack · 2021-04-13T19:23:24Z

Well, technically speaking, full owning groups on fully contiguous storages are the only thing that strictly adheres to perfect SoA requirements/definition.
Though, I wouldn't care that much about it in general unless you've measured and spotted a bottleneck in an iteration.
Chunked iteration had its problems too anyway.

Are you experiencing an actual problem at the moment?

chengts95 · 2021-04-13T19:52:13Z

Yes, I don't know why but I found if I have 1000 entities with 5-6 components, and I frequently use 3 or 4 components in a system, the speed is much slower than OOP design or perfect SoA.
I think I need detailed profiling to identify the problem but I have no time to do that. Currently, I am switching to archetype-based ECS to see the difference.

skypjack · 2021-04-13T20:13:23Z

It's definitely unlikely that 1000 entities and 5/6 components can cause problems.
My suggestion is to measure and find the real bottleneck.
Good luck with your migration. 👍

chengts95 · 2021-04-13T22:14:47Z

looks like erasing the type of a component with poly prevents SIMD optimization. I think maybe it frequently accesses the virtual table. Anyway, thank you for your fast response.

skypjack · 2021-04-13T22:32:42Z

Mmm... Not sure I get what you mean.
The poly in the registry is a non-owning wrapper to an externally managed pool.
Also, by default, the poly wrapper has an empty API and therefore is never accessed nor used internally.
So, it's unlikely that it affects any SIMD optimization. Mainly because it doesn't manage any pool nor it is used for anything by the registry.

chengts95 · 2021-04-14T00:34:33Z

I wrapped my component into a poly object. And I make the poly object own the original component and I think it is not good to hold a reference of a component in a pool.

skypjack self-assigned this Apr 11, 2020

skypjack added the question open question label Apr 11, 2020

skypjack closed this as completed in 0e7e36f Jul 25, 2020

InBetweenNames mentioned this issue Feb 20, 2023

Implement batched query support bevyengine/bevy#6161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iteration over continuous intervals of components #462

Iteration over continuous intervals of components #462

Raikiri commented Apr 11, 2020 •

edited

Loading

indianakernick commented Apr 11, 2020 •

edited

Loading

Raikiri commented Apr 11, 2020 •

edited

Loading

skypjack commented Apr 11, 2020 •

edited

Loading

Raikiri commented Apr 11, 2020 •

edited

Loading

skypjack commented Apr 11, 2020

Raikiri commented Apr 11, 2020 •

edited

Loading

indianakernick commented Apr 12, 2020

Raikiri commented Apr 12, 2020

skypjack commented Apr 12, 2020

Raikiri commented Apr 13, 2020 •

edited

Loading

skypjack commented Apr 14, 2020

Raikiri commented Apr 14, 2020 •

edited

Loading

skypjack commented Jul 24, 2020

skypjack commented Jul 25, 2020 •

edited

Loading

chengts95 commented Apr 13, 2021 •

edited

Loading

skypjack commented Apr 13, 2021

chengts95 commented Apr 13, 2021

skypjack commented Apr 13, 2021

chengts95 commented Apr 13, 2021

skypjack commented Apr 13, 2021 •

edited

Loading

chengts95 commented Apr 13, 2021

skypjack commented Apr 13, 2021 •

edited

Loading

chengts95 commented Apr 14, 2021

Iteration over continuous intervals of components #462

Iteration over continuous intervals of components #462

Comments

Raikiri commented Apr 11, 2020 • edited Loading

indianakernick commented Apr 11, 2020 • edited Loading

Raikiri commented Apr 11, 2020 • edited Loading

skypjack commented Apr 11, 2020 • edited Loading

Raikiri commented Apr 11, 2020 • edited Loading

skypjack commented Apr 11, 2020

Raikiri commented Apr 11, 2020 • edited Loading

indianakernick commented Apr 12, 2020

Raikiri commented Apr 12, 2020

skypjack commented Apr 12, 2020

Raikiri commented Apr 13, 2020 • edited Loading

skypjack commented Apr 14, 2020

Raikiri commented Apr 14, 2020 • edited Loading

skypjack commented Jul 24, 2020

skypjack commented Jul 25, 2020 • edited Loading

chengts95 commented Apr 13, 2021 • edited Loading

skypjack commented Apr 13, 2021

chengts95 commented Apr 13, 2021

skypjack commented Apr 13, 2021

chengts95 commented Apr 13, 2021

skypjack commented Apr 13, 2021 • edited Loading

chengts95 commented Apr 13, 2021

skypjack commented Apr 13, 2021 • edited Loading

chengts95 commented Apr 14, 2021

Raikiri commented Apr 11, 2020 •

edited

Loading

indianakernick commented Apr 11, 2020 •

edited

Loading

Raikiri commented Apr 11, 2020 •

edited

Loading

skypjack commented Apr 11, 2020 •

edited

Loading

Raikiri commented Apr 11, 2020 •

edited

Loading

Raikiri commented Apr 11, 2020 •

edited

Loading

Raikiri commented Apr 13, 2020 •

edited

Loading

Raikiri commented Apr 14, 2020 •

edited

Loading

skypjack commented Jul 25, 2020 •

edited

Loading

chengts95 commented Apr 13, 2021 •

edited

Loading

skypjack commented Apr 13, 2021 •

edited

Loading

skypjack commented Apr 13, 2021 •

edited

Loading