Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration over continuous intervals of components #462

Closed
Raikiri opened this issue Apr 11, 2020 · 23 comments
Closed

Iteration over continuous intervals of components #462

Raikiri opened this issue Apr 11, 2020 · 23 comments
Assignees
Labels
question open question

Comments

@Raikiri
Copy link

Raikiri commented Apr 11, 2020

Consider a simple bit of code that does multi-component iteration:

registry.view<position, velocity>().each([dt](auto &pos, auto &vel){
  pos += vel * dt;
});

Now I want to make it SIMD. It has to be done in chunks that are known to be continuous in memory. For example, this can be done via an API like this:

registry.view<position, velocity>().each([dt](auto *pos, auto *vel, size_t count){
  for(size_t offset = 0; offset < count; offset += 4)
  {
    MulAdd4(pos + offset, vel + offset, dt);
  }
  for(; offset < count; offset++)
  {
    pos[i] += vel[i] * dt;
  }
});

Is it possible to do something like this already? If no, how complicated would it be to implement?

@skypjack skypjack self-assigned this Apr 11, 2020
@skypjack skypjack added the question open question label Apr 11, 2020
@indianakernick
Copy link
Contributor

indianakernick commented Apr 11, 2020

This is possible with groups.

entt::registry reg;
auto group = reg.group<position, velocity>();
const std::size_t size = group.size();
position *pos = group.raw<position>();
velocity *vel = group.raw<velocity>();
// `pos` and `vel` are parallel arrays with `size` elements

For more information about the inner workings of groups, see this article.

@Raikiri
Copy link
Author

Raikiri commented Apr 11, 2020

Thanks, I have already familiarized myself with the groups. However groups impose strict implicit requirements on what you can and can't do with them, such as a component can't belong to 2 non-nested groups at once. And that is a fair price to pay in order to get perfect SoA. However, I do NOT need a perfect SoA, so I don't want to pay for it, but I want the next best thing: erm, imperfect SoA(?) for free. In the perfect scenario it would return the entire raw arrays, in the worst scenario it would return chunks of size 1, but that's fine.

This is also similar to the type of access flecs provides (although I don't necessarily like their implementation):

flecs::system<Position, Velocity>(world)
        .action([](const flecs::rows& rows, 
            flecs::column<Position> p, 
            flecs::column<Velocity> v) 
        {    
            for (auto row : rows) {
                p[row].x += v[row].x;
                p[row].y += v[row].y;

                std::cout << "Moved " << rows.entity(row).name() << " to {" <<
                    p[row].x << ", " << p[row].y << "}" << std::endl;
            }
        });

@skypjack
Copy link
Owner

skypjack commented Apr 11, 2020

That's because of how archetypes work. You pay the cost of having all things grouped somehow, no matter what. The result is that... well, you've things grouped somehow (and possibly fragmented on the other side).
I went with a completely different design, that is to pay only for what you want. In this case, a sparse set based approach fits better than an archetypes one. You don't pay all the times the cost but you've other constraints.

the next best thing: erm, imperfect SoA(?) for free. In the perfect scenario it would return the entire raw arrays, in the worst scenario it would return chunks of size 1, but that's fine

Can you provide more details on this? I'm a bit busy today but I'll do my best to look into it.
I like the idea of adding the next best thing if possible. 😉

@Raikiri
Copy link
Author

Raikiri commented Apr 11, 2020

Can you provide more details on this? I'm a bit busy today but I'll do my best to look into it.

Well, imagine we don't create any groups at all (because we don't want to pay for them). We just iterate over a view:

registry.view<position, velocity>().each([dt](auto &pos, auto &vel){
  pos += vel * dt;
});

during the iteration there can be some cases where two or more consecutive calls to this lambda will actually point to pos and vel components that are adjacent to each other in memory. What I'm asking is to group those continuous calls into 1 whenever possible:

[dt](auto *pos, auto *vel, size_t count)

where (auto *pos, auto *vel, size_t count) is a SoA representation of a chunk of components that just happened to be continuous in memory. Worst case this chunk will be length of 1 which will be identical to what we currently have in the view API and this lambda will have to be called as many times as there are such pairs (say, N). But best case scenario (if all components that have position also happen to have velocity) this will call this lambda just once where (auto *pos, auto *vel) will be arrays of size N storing all the needed components in a perfect SoA fashion. For free! Yay!

This provides a low-effort way to get some benefits of perfect SoA without paying any cost or imposing any restrictions, just by providing an alternative way to iterate over groups/views that bundles continuous calls together.

@skypjack
Copy link
Owner

Oh... I must admit that your reasoning surprised me in a way.
Why did you wait so long to come and share this idea? 😄

Actually, by construction, you could spot a lot of these chunks here and there on some types (mainly low level ones, I don't think so for gameplay stuff).
My only concern at the moment is that you iterate two times the entities here: the first one to know how long a chunk is, the second one in user space. Of course, it's not a problem if you don't use the entities on your side since the cost is instead pretty much the same.

Since you seem to know what you're talking about, I take this opportunity to ask you a question.
The only easy way that comes to mind to do this is by looking forward in search for a chunk.
Do you have any other suggestions to explore? All the other options seem to me either complicated or, in any case, intended to consume memory. A look-ahead approach would be for free instead, from all points of views.

@Raikiri
Copy link
Author

Raikiri commented Apr 11, 2020

The way I imagine it could be done without having any apriori knowledge of how entt works under the hood would be kinda like this (naive pseudocode as I actually have 0 knowledge of how entt works lol):

template<typename T1, typename T2, typename Func>
void view::each(Func func)
{
  auto chunk_start = begin<T1, T2>();
  auto prev = begin<T1, T2>();
  size_t chunk_size = 1;
  while(auto it = begin<T1, T2>() + 1; it != end<T1, T2>(); it++)
  {
    //checking if previous call is contiguous in memory with the current one
    //this could be implemented way better if the iterator itself keeps track on whether it's continous or not on each component
    if(&prev.get<0>() + 1 == &it.get<0>() && &prev.get<0>() + 1 == &it.get<0>())
      chunk_size++;
    else
    {
      func(&chunk_start.get<0>(), &chunk_start.get<1>(), chunk_size);
      chunk_size = 1;
      chunk_start = it;
    }
    prev = it;
  }
}

Yes, you do have to pay the price of iterating over a range twice (here and in the user function) as you said. But the benefit of manual SIMD optimization in the user function can way offset this cost.

But you're certainly giving me way too much credit for 0 reason as I have never designed a production ECS myself. I was just surprised to find out it does not have something as obvious as detecting continuous chunks that are already there.

@indianakernick
Copy link
Contributor

If you care about performance, (since you’re using SIMD I’m sure you do) then you ought to know how EnTT works. For that, I suggest reading the ECS Back and Forth articles. In particular, this one talks about sparse sets and this one talks about groups. If you want to write fast code with EnTT then knowing the implementation details will help.

@Raikiri
Copy link
Author

Raikiri commented Apr 12, 2020

Yes, I have already read those after creating this thread. Honestly I did not really like black-boxey feeling that the official documentation left. The implementation details can be arbitrarily complex, but I think at least mentioning sparse sets, they way views and groups are iterated should remove a lot of confusion from the documentation, without overloading it with too much information.

@skypjack
Copy link
Owner

Honestly I did not really like black-boxey feeling that the official documentation left.

Actually there is a PR that I've still to merge and the aim of which is to provide an insight on the internals of the library.
Btw the ECS back and forth series already contains all the details afaik and can be a good starting point. 👍

@Raikiri
Copy link
Author

Raikiri commented Apr 13, 2020

After looking up how entt works internally, it's obvious that we can replace this:

if(&prev.get<0>() + 1 == &it.get<0>() && &prev.get<0>() + 1 == &it.get<0>())

with a better implementation that does not have to do an indirection into component array memory. It can be done like this for example:

if(it.prev_entity<0>() == prev.entity() && it.prev_entity<1>() == prev.entity())

This thing just checks that the previous entity in all component arrays is the same as the last processed entity. If it is, means that the chunk is continuous. This is done just by comparing entity indices stored in those dense arrays without reading component pointer's address.

It still means that look-ahead is necessary, but it can be done very effectively by only accessing dense entity arrays.

@skypjack
Copy link
Owner

I've looked into this one. Unfortunately, we are missing something important here.
When you iterate over components A, B and C and use let's say the pool of A because it's the shortest one, it doesn't mean that all entities from that pool have all the components.
Therefore, we can't just make a look ahead and use use eventually the get. We need also to test every entity to know if it has all the components.
I think this ruins a bit all the benefit you could have otherwise from performing operations a chunk at a time.

That said, what you mention could be used to provide a chunked access to partial owning groups.
In that case, the assumption that all entities in the owned pool are valid is true (because at least an owned one exists) and therefore we can further optimize the iteration with the chunked approach you mentioned.

Any thoughts?

@Raikiri
Copy link
Author

Raikiri commented Apr 14, 2020

You can look ahead, but you do have to check all the components to make sure all of them are present and continuous. Both of those checks can be accomplished by one read from the sparse array per component. In fact, you already have to read that array to make sure that the component is present, you can use the same value to tell if it's also continuous.

Once you find an entity that does not have one of its components OR one of its components is not adjacent with the previous one (you get all this from just the value you read from the sparse array), you have to call the user function and use get() to actually read where this chunk starts.

PS this whole thing is NOT designed to make iteration over entities faster. Furthermore, calculating chunks does take some additional logic. That being said, benefit comes from the user's lambda that can process things in bulk more effectively than it processes things one by one and the difference can be pretty significant with SIMD.

@skypjack
Copy link
Owner

I made a test during the last days. The result is that iterating elements this way takes more or less three times the time of an each.
I don't think it's worth complicating the codebase for something that is literally unusable. The main problem is that we iterate elements twice: the first time to find the length of the chunk, the second time to visit it.
What was your expectation exactly?

@skypjack
Copy link
Owner

skypjack commented Jul 25, 2020

Nvm. I found a way to make it faster then what I expected and even better of what we discussed here in terms of functionalities.
I'm pushing it tomorrow or such, as soon as the doc and all tests are ready. 👍

@chengts95
Copy link

chengts95 commented Apr 13, 2021

Since the chunked function is removed, is there an alternative way to achieve that?

@skypjack
Copy link
Owner

Mmm yes and no, I'd say that it's possible with a custom pool but I would discourage it honestly.
The main problem is that chunked accesses rely on raw pointer/size pairs. On the other side, with custom pools and the default storage that is about to be paged, these elements aren't (well, won't be) available/reliable anymore.
So, roughly speaking, we can't return anymore a T* to the callback. This probably defeats the purpose of chunked access.
In theory, we could return a couple of non-raw iterators that are supposed to return a range that is valid in all pools. However, from what I get this isn't the desired behavior.
I'd rather use a group or any other solution probably, but I'm open to suggestions/feedback.

@chengts95
Copy link

Currently, I feel the multi-component view or group is slow when there are many entities compared to perfect SoA solutions.

@skypjack
Copy link
Owner

Well, technically speaking, full owning groups on fully contiguous storages are the only thing that strictly adheres to perfect SoA requirements/definition.
Though, I wouldn't care that much about it in general unless you've measured and spotted a bottleneck in an iteration.
Chunked iteration had its problems too anyway.

Are you experiencing an actual problem at the moment?

@chengts95
Copy link

Yes, I don't know why but I found if I have 1000 entities with 5-6 components, and I frequently use 3 or 4 components in a system, the speed is much slower than OOP design or perfect SoA.
I think I need detailed profiling to identify the problem but I have no time to do that. Currently, I am switching to archetype-based ECS to see the difference.

@skypjack
Copy link
Owner

skypjack commented Apr 13, 2021

It's definitely unlikely that 1000 entities and 5/6 components can cause problems.
My suggestion is to measure and find the real bottleneck.
Good luck with your migration. 👍

@chengts95
Copy link

looks like erasing the type of a component with poly prevents SIMD optimization. I think maybe it frequently accesses the virtual table. Anyway, thank you for your fast response.

@skypjack
Copy link
Owner

skypjack commented Apr 13, 2021

Mmm... Not sure I get what you mean.
The poly in the registry is a non-owning wrapper to an externally managed pool.
Also, by default, the poly wrapper has an empty API and therefore is never accessed nor used internally.
So, it's unlikely that it affects any SIMD optimization. Mainly because it doesn't manage any pool nor it is used for anything by the registry.

@chengts95
Copy link

I wrapped my component into a poly object. And I make the poly object own the original component and I think it is not good to hold a reference of a component in a pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question open question
Projects
None yet
Development

No branches or pull requests

4 participants