Evaluate parallel design for filter processing #411

XAMPPRocky · 2021-10-04T11:03:18Z

We've been discussing over the past couple of months of moving Quilkin's internal architecture from our current iterative model (One context is created per packet, and we run the context through the filter chains until we have the result). To a parallel model, where filters run through all available packets at once, with apparent order (meaning if &mut X is needed by both the first and second filter in the chain, the the system will give access to the first filter, then the second, etc) when components are shared.

I initially proposed going in actor model, however I now think that the best approach to make this parallel is use an "Entity Component System" (ECS) to represent the processing pipeline. There are number of benefits that are specific to ECS that I think we should consider that the main proposal.

Benefits

Moving to using entities and components replaces the current ReadContext and WriteContext structs. Instead we'd have Read and Write "worlds" which contain the context as components associated which entity which is currently "in-flight".
- This allows filters to be parallel by default, as instead of needing to split up the context, or wrap each field in the context structs in an Arc<RwLock<T>>, each filter would declare a query of what component types they want access to with mutability rules, and the ECS framework will resolve the order for us. This also resolves Possible refactor: Make Packet Filtering + Content Routing separate traits #104 without needing a new trait, as filters who's queries don't conflict will just run in parallel.
- This also enables failing early if a later filter would drop the packet, while the first is still processing.
ECS provides better data locality, as storing entities in a archetype or sparse set generally results in better performance due to the CPU's ability to cache the data. Obviously any PRs should be benched before being accepted, we shouldn't regress.
This would pair really well with Replace listen distributor task with multithreaded SO_REUSEPORT task. #410 since the entire system can run in parallel with the listen distributor task removed.
ECS would also replace our current dynamic_metadata abstraction (closing Replace ReadContext.metadata with typemap. #305), as filters could insert new types as components, and other filters could query for those filter specific types, rather than needing to a lookup and cast from dyn Any.

Drawbacks

This requires changing the core Filter trait, which will subsequently require updating the existing filter implementations, as well as some of the proxy server's logic.
While ECS is generally more performant in practice, the exact improvements over the status quo, if any right now are unknown.

Implementation Plan

If we go with this architecture, I believe there is an approach to implementing this that will help offset some of the drawbacks mentioned.

Finish Add initial version of the test subcommand #371.
Move each filter over to quilkin test in their own PR.
- This will remove work in migrating some of the tests over to ECS, as the end-to-end tests don't rely on Quilkin internals.
Create a new ecs branch, with the initial changes that works for a single filter.
- Open a new draft PR with it for review.
- Once reviewed, we'd migrate over the filter implementations as PRs to the ecs branch.
- Once all filter implementations have been reviewed and merged in the ecs branch, we'd merge that into main.

Unresolved Questions

Which ECS library to go with? Rust has quite a few ECS libraries, and all of them are good. The prominent libraries in my opinion are bevy-ecs, legion, and shipyard. In my current research it seems like Bevy's ECS with sparse set storage is the fastest for the type workload we want, where we want to create, process, and destroy entities as soon as possible (see [Merged by Bors] - Bevy ECS V2 bevyengine/bevy#1525 for some benchmark comparisons)

The text was updated successfully, but these errors were encountered:

markmandel · 2021-10-05T21:38:19Z

This is super interesting, and I'm also finding the idea of applying ECS here a quite fascinating use case.

One thing I've never been 100% clear on:

To a parallel model, where filters run through all available packets at once, with apparent order

What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?

My original theory was that if you have multiple packets flowing through the proxy simultaneously, there's no net performance benefit to being able to Filter in parallel, as the system is going to be using every core to run packets in parallel (as opposed to Filters), so there was no real performance benefit to be had, at least in that aspect -- but I could well be missing something here, or the aim of this design is to optimise on something else (it does seem to me like it would reduce mutex locking?).

XAMPPRocky · 2021-10-06T05:05:11Z

What benefit are we aiming at getting with parallel processing of Filters over a byte packet? Are we expecting higher throughput through this model? Something else?

Well we get parallelism mostly for free with all ECS frameworks, we’ll of course have to measure for performance, but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.

Imagine a filter chain of a compress filter followed by a rate limit filter. With parallelism, the rate limit filter can flag the packet for deletion while the compress filter is still running. Obviously you wouldn’t want to put those two filters in that order, but it’s an example of different filters in the chain having different amounts work. So while it might only provide minimal to no performance benefits for an already optimised configuration, it provides a more intuitive user experience and better performance for naive configurations because there will be no performance difference for having different amounts work in the filter chains in different orders e.g. (compress, rate-limit) and (rate-limit, compress).

markmandel · 2021-10-06T19:12:19Z

but in the abstract running in parallel should reduce work for larger filter chains by allowing earlier termination of the chain.

Aaah yes. That makes a lot of sense 👍🏻 Awesome, that clarifies things for me nicely. Thanks!

We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?

XAMPPRocky · 2021-10-07T05:24:57Z

We should probably write some of our own benchmarks (but from review, it does sound like Sparse Storage in Bevy ECS seems like a winner - and also having pluggable storage seems like something that we could tweak over time as well if we have specific needs of our own based on our use cases). Bevy's documentation isn't as nice as legions' but the docs.rs aren't too bad that we can work it out.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

There’s more documentation on the master branch (a bunch was merged in recently). I’ve been talking with some of the people involved in making each of the frameworks, and right now I think we should be looking at shipyard. As was explained to me, while Bevy’s pluggable storage provides sparse set storage, it doesn’t have the same properties as a fully sparse set backed ECS, which is being able to add/remove components and being able to delete entities without blocking the world.

One thing I'm trying to work out when digging into Bevy's ECS is how to manually run the series of systems over entities and components?

Also, for my own edification, how would the flow work within Quilkin? A traditional ECS would execute on each frame tick, and then match that tick to systems depending on how often they should run -- but we don't have a frame tick, it feels like to me like each system should run as a packet is received (maybe???) ? I'm not quite sure how that would work. What are your thoughts there?

Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet, but I also want to point out that a tick based approach would also work here without async, you would just have a much higher tick rate because you’re not trying to match a display or a game simulation. It’s really no different than what tokio is doing under the hood at the moment since the async I/O APIs are poll rather than push based anyway.

XAMPPRocky · 2021-10-07T22:42:58Z

Also worth mentioning as some prior art is VPP, which @gilesheron brought up in another thread. Just from reading the documentation, it's essentially what we'd be doing here AFAICT.

https://wiki.fd.io/view/VPP/What_is_VPP%3F

markmandel · 2021-10-12T21:59:52Z

Since we also have people that aren't familiar with ECS, this is one of the article series I first learned about Entity Component Systems
http://t-machine.org/index.php/2007/09/03/entity-systems-are-the-future-of-mmog-development-part-1/

Googling shows me there are lots more content out there that explain the concept back when I first started reading about the pattern 😄

This seems a bit more recent: https://ianjk.com/ecs-in-rust/ , and is likely easier to follow.

markmandel · 2021-10-12T23:39:49Z

Using shipyard’s API you would call world.run or world.run_with_data on receiving the packet

I saw that 👍🏻 Shipyard's documentation is really good 👍🏻 (I could never find the actual function on becy-ecs to run the series of systems).

a tick based approach would also work here without async

Sounds like something we should test and compare. But sounds like a definite path through.

The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session? 🤔 This seems like something we might want to diagram up a bit before diving into code.

XAMPPRocky · 2021-10-13T08:29:09Z

This seems like something we might want to diagram up a bit before diving into code.

FWIW I'm still mostly using this diagram from #339 as the my mental model.

The other thing I'm thinking through - how do you see us splitting up Worlds? I don't think we can do one big World instance - maybe we would do one (or several workers Worlds) for the local port, and then a World per Session?

Well other than for logical complexity I do think we could do one world because with sparse storage, a mutable query on T would only block other systems querying T. So I think we should separate them logically and have two worlds (one for read, and one for write). That way we can have components that are shared between the two worlds that could have slightly different meaning based on the context. For example; you could have SocketAddr refer to the downstream clients in read and refer to the upstream server in write (not saying we should do that specifically, just an example of what could be done)

markmandel · 2021-10-13T23:38:09Z

So I think we should separate them logically and have two worlds (one for read, and one for write)

That makes sense. I was also debating having 1 world, with a component that indicated if something was read/write then Filters could batch both operations in a single loop, but I think your point re: components having different meanings depending on if it is read or write makes a lot of sense 👍🏻

This also leans me towards tick based model, so that Filter operations can be batched in a loop, and aren't an invocation for every packet.

The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.

🤔 does it make sense to have multiple worlds and distribute packets among them to also run each one concurrently? Just thinking about if someone has less filters than the number of cores on a machine, seems like we might be leaving some processing power behind?

WDYT?

XAMPPRocky · 2021-10-14T10:20:01Z

The only other things I still can't quite work out is that a world can only have one world.run happening at any given point and time. Do we want to be processing packets concurrently? Each World can process Filters in parallel within the ECS system, but that means we are only able to process one packet at time for read, and the same for write, since they are only a single world.

Are you sure about that? world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.

Either way though, I think the important step in having it be concurrent and parallel is separating the "processing" of the packets from packets entering and exiting. For example, we could have a Status component that determines the current state of the packet. When a packet enters the world it has the Fresh status, we then have a system that collects all of the fresh packets and marks them as Processing. The filter chain runs over all packets marked as Processing, either marking them as Drop or Pass, which determines whether the packet continues, and we have a final system that acts as the sink discarding Drop packets, and forwarding Pass packets.

enum Status {
    Fresh,
    Processing,
    Drop,
    Pass,
}

markmandel · 2021-10-14T18:32:57Z

Are you sure about that?
Not 100% 😄 This was an assumption on how I've treated ECS systems in the past.

world.run's signature is &self, and in shipyard's documentation it says it will borrow whatever "View" you provide, which to me says that if you if you are sharing a world across multiple threads, as long as the views don't require exclusive access on a shared resource, they should run concurrently.

Ooh, that's a good point, and easy to test 👍🏻 SGTM.

markmandel · 2021-10-21T20:06:15Z

Oh to be clear, this looks good to me, and the plan for implementation looks good to me too. The only outstanding question is one of performance comparison, which we can test once we have the initial implementation in place.

@iffyio I know you said you weren't as familiar with ECS, did you have any outstanding thoughts/questions/etc?

iffyio · 2021-10-22T06:49:38Z

SGTM! nothing to add atm

markmandel mentioned this issue Oct 21, 2021

Allow loading external Rust Filters dynamically #420

Open

XAMPPRocky mentioned this issue Jan 25, 2022

Refactor Quilkin's data pipeline into distinct types. #339

Closed

XAMPPRocky self-assigned this May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate parallel design for filter processing #411

Evaluate parallel design for filter processing #411

XAMPPRocky commented Oct 4, 2021 •

edited

Loading

markmandel commented Oct 5, 2021

XAMPPRocky commented Oct 6, 2021 •

edited

Loading

markmandel commented Oct 6, 2021

XAMPPRocky commented Oct 7, 2021 •

edited

Loading

XAMPPRocky commented Oct 7, 2021

markmandel commented Oct 12, 2021

markmandel commented Oct 12, 2021

XAMPPRocky commented Oct 13, 2021

markmandel commented Oct 13, 2021 •

edited

Loading

XAMPPRocky commented Oct 14, 2021 •

edited

Loading

markmandel commented Oct 14, 2021

markmandel commented Oct 21, 2021

iffyio commented Oct 22, 2021

Evaluate parallel design for filter processing #411

Evaluate parallel design for filter processing #411

Comments

XAMPPRocky commented Oct 4, 2021 • edited Loading

Benefits

Drawbacks

Implementation Plan

Unresolved Questions

markmandel commented Oct 5, 2021

XAMPPRocky commented Oct 6, 2021 • edited Loading

markmandel commented Oct 6, 2021

XAMPPRocky commented Oct 7, 2021 • edited Loading

XAMPPRocky commented Oct 7, 2021

markmandel commented Oct 12, 2021

markmandel commented Oct 12, 2021

XAMPPRocky commented Oct 13, 2021

markmandel commented Oct 13, 2021 • edited Loading

XAMPPRocky commented Oct 14, 2021 • edited Loading

markmandel commented Oct 14, 2021

markmandel commented Oct 21, 2021

iffyio commented Oct 22, 2021

XAMPPRocky commented Oct 4, 2021 •

edited

Loading

XAMPPRocky commented Oct 6, 2021 •

edited

Loading

XAMPPRocky commented Oct 7, 2021 •

edited

Loading

markmandel commented Oct 13, 2021 •

edited

Loading

XAMPPRocky commented Oct 14, 2021 •

edited

Loading