Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Visitor overhead #4418

Open
asl opened this issue Feb 13, 2024 · 17 comments
Open

Reduce Visitor overhead #4418

asl opened this issue Feb 13, 2024 · 17 comments
Assignees
Labels
core Topics concerning the core segments of the compiler (frontend, midend, parser) enhancement This topic discusses an improvement to existing compiler code.

Comments

@asl
Copy link
Contributor

asl commented Feb 13, 2024

Visitors are central objects in P4C architecture. However, the current implementation imposes lots of overheads and some inefficiencies.

I would name some of those that draw my attention

  1. They create lots of GC/malloc traffic. One of main sources of this is tracking visited nodes. For example, on P4CParserUnroll.switch_20160512 testcase we are having 50k calls to Inspector::init_apply and 138k calls to Transform::init_apply. On the real-world sources I had cases 10x larger. Each of such calls creates a new map to track visited nodes (just visited plus few bits for Inspector and ChangeTracker for Modifier and Transform). For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied. ForwardChildren takes a reference to it.

What does matter is that majority of these maps are small. For that 50k calls to Inspector we are having 10k maps of size 1 (essentially node with no children), another 10k maps of size 2 (so, single children and nothing else) and approx 45k (so 90% of total) of maps of size <= 16. Rehashing and corresponding memory allocation take quite some visible amount of time in overall profile. Note that std::unordered_map is quite malloc-heavy due to restrictions and guarantees required by C++ standard. Situation with Transforms is similar – 85% of nodes are contained in small maps (< 16 entries).

We definitely need to optimize for small map case, e.g. having space pre-allocated and implementing map adapter on top of plain vector (aka flat_map). The linear scan over few vector elements won't be much slower than hash table lookup as evidence shows us.

The amount of map lookups is also quite large. We'd probably switch from std::unordered_map into some more performant solution.

  1. Transforms are cloning always. Regardless whether node is changed or not. Likely this should be changed to some lazy implementation. However, it seems that some passes rely heavily on the way how nodes are currently traversed and what are the implementation details there.

  2. Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action.

  3. Visitors use textbook double-dispatch. So, every apply essentially costs 2 virtual function calls for each node.

Now this could be solved with static dispatch. With pre-defined typeids we can just switch over them, eliminating one virtual function call. The second virtual function call could be eliminating via CRTP.

That said, these are not something easy or isolated changes as Visitors are everywhere :) I will be contributing patches for some of these issues. Spoiler: for some downstream code we saw 4x reduction in the compile time via removing of various overheads here and there.

@asl asl added enhancement This topic discusses an improvement to existing compiler code. core Topics concerning the core segments of the compiler (frontend, midend, parser) labels Feb 13, 2024
@asl asl self-assigned this Feb 13, 2024
@vlstill
Copy link
Contributor

vlstill commented Feb 13, 2024

For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied. ForwardChildren takes a reference to it.

Could this be because of the SplitFlowVisitor that creates copies (clones) of visitors? That one is a nasty one :-(.

What does matter is that majority of these maps are small. For that 50k calls to Inspector we are having 10k maps of size 1 (essentially node with no children), another 10k maps of size 2 (so, single children and nothing else) and approx 45k (so 90% of total) of maps of size <= 16. Rehashing and corresponding memory allocation take quite some visible amount of time in overall profile. Note that std::unordered_map is quite malloc-heavy due to restrictions and guarantees required by C++ standard. Situation with Transforms is similar – 85% of nodes are contained in small maps (< 16 entries).

Interesting, seems like we would be able to improve it a lot.

We definitely need to optimize for small map case, e.g. having space pre-allocated and implementing map adapter on top of plain vector (aka flat_map). The linear scan over few vector elements won't be much slower than hash table lookup as evidence shows us.

The amount of map lookups is also quite large. We'd probably switch from std::unordered_map into some more performant solution.

Definitely, some open addressing/closed hashing scheme (without overflow buckets) would likely work better for the "large" case based on my experience.

2. Transforms are cloning always. Regardless whether node is changed or not. Likely this should be changed to some lazy implementation. However, it seems that some passes rely heavily on the way how nodes are currently traversed and what are the implementation details there.

I was thinking about this too. The big problem I think is the children forwarding and seeing the right children in postorder. Which seems to be reasonable requirement, but makes the general case quite hard to implement without cloning.

We could possibly have a lightweight transform for preorder-only cases, or we would need some other way of passing the children to the {pre,post}order functions than in the clone of the current node. I was thinking about the latter also for different reason -- the current Transform is a point to work with when one wants to lower IR into nodes that do not neatly fit into the abstract classes defined in the P4 IR (i.e. when one lowers statements and expression into something else). The visitor can't really handle this case and it is necessary to wrap the values into some artificial wrappers to meet the require invariants.

So far I did not found anything that would not require major extension of the IR hirearchies' functionality though (something like a fold from functional programming -- i.e. a postorder function taking not only the original node, but also all the processed children -- would work, but it would be quite tricky and require uglier boilerplate then the visitors).

3. Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action.

4. Visitors use textbook double-dispatch. So, every `apply` essentially costs 2 virtual function calls for each node.

Is this really a problem though? The only alternative would be RTTI-based dispatch, that would have just one huge switch (or something like that), but would that be significantly faster than two virtual calls? And virtual call would be needed to cast to the right type anyway (and to visit children).

Now this could be solved with static dispatch. With pre-defined typeids we can just switch over them, eliminating one virtual function call. The second virtual function call could be eliminating via CRTP.

That said, these are not something easy or isolated changes as Visitors are everywhere :) I will be contributing patches for some of these issues. Spoiler: for some downstream code we saw 4x reduction in the compile time via removing of various overheads here and there.

Nice, thank you. I wonder what you have. I would definitely be interested in improvements around the maps and transform. I am little skeptical about the double-dispatch elimination but I will wait and see what parts you have in your codebase.

@asl
Copy link
Contributor Author

asl commented Feb 13, 2024

The big problem I think is the children forwarding and seeing the right children in postorder. Which seems to be reasonable requirement, but makes the general case quite hard to implement without cloning.

I would say even more. Some of current passes exploits that even childrens are "finalized" (e.g. when processing second children they expect that first one is already transformed and put into the proper place into parent node). But in such case indeed it would be better to have clones.

Is this really a problem though? The only alternative would be RTTI-based dispatch, that would have just one huge switch (or something like that), but would that be significantly faster than two virtual calls? And virtual call would be needed to cast to the right type anyway (and to visit children).

The current round-trip requires 4 function calls, two of them might be inlined and two of them are virtual. CRTP approach would require one virtual call for typeid (actually we can make it non-virtual storing the typeid in the member, but this is a separate story). Casting is static from base to immediate derived both for Visitor and for Node, no virtual bases involved, so here there will be at least 1 virtual call less. Plus empty cases could be potentially inlined into that switch, simplifying it a bit more.

Children are separate and orthogonal. I would say that recursive manner produce gigantic call stacks and essentially hide the flow, making overall reasoning and profiling quite hard.

That said, there are lots of possible small improvements here and there. Some of them likely could contribute ~ 0.5%, so the effect could only be visible combined.

@fruffy
Copy link
Collaborator

fruffy commented Feb 13, 2024

@ChrisDodd

@fruffy
Copy link
Collaborator

fruffy commented Feb 13, 2024

That said, these are not something easy or isolated changes as Visitors are everywhere :)

One thing to start with this is to introduce a new visitor class in visitor.h and selectively replace the derived existing visitors. This way you can isolate the fallout. Could even be a copy of the existing Transform with modifications.

@asl
Copy link
Contributor Author

asl commented Feb 13, 2024

That said, these are not something easy or isolated changes as Visitors are everywhere :)

One thing to start with this is to introduce a new visitor class in visitor.h and selectively replace the derived existing visitors. This way you can isolate the fallout. Could even be a copy of the existing Transform with modifications.

Thanks for the tip! Sure, this is the way forward ;)

@asl
Copy link
Contributor Author

asl commented Feb 22, 2024

So, I tried to understand the logic behind shared_ptr for visited trackers inside Visitors. And it seems it does nothing wrt flow clones, etc. shared_ptrs were introduced in #3168 as way to "better track lifetime". This raises some questions as it is likely I'm missing something:

  1. Why shared_ptr was used and not unique_ptr? Or just plain delete on the pointer? Note that while SplitFlowVisitor pretend to do some "cloning" (though by default these are just shallow clones – we just copy Visitor pointer), the overall logic is not correct wrt the liveness of visited: we explicitly reset it after apply so that if someone else already owned this shared pointer, then the behavior is undefined per standard. Looks like the object is entirely owned by Visitor: its lifecycle is controlled by init_apply and apply_visitor (the latter does not make sense to me, it should be better released in end_apply, currently we do explicit ctx check to ensure that it's the "last"/top-level apply_visitor.

  2. Inspector / Modifier / Transform are not quite deeply copied. Instead pointers to them fly here and there. So maybe we can be more explicit on lifetime of different things? Make visitors move-only things (as they are stateful) and enforce proper semantics when cloning is necessary?

Here is rationale for all this: In the majority of cases these visited trackers are small. They contain less than 16 nodes and often it's only 1-2. Therefore, as stated above, I'd like to introduce small map optimization: let visitor have pre-allocated space to, say, 16 records and switch to larger hashmaps only when necessary. This way we will get rid of these extra memory allocations at all significantly reduce GC traffic (see the numbers above).

But maybe I'm missing something and there is some backstory here?

@vlstill @ChrisDodd @grg @fruffy do you happen to know?

@fruffy
Copy link
Collaborator

fruffy commented Feb 22, 2024

I personally do not know much about the choices made for the visitors, so your guess is as good as mine.

When does a visitor need to be copied, for example?

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

When does a visitor need to be copied, for example?

First case when we're adding Visitor to PassManager via const reference. The second case is the default implementation of ControlFlowVisitor::flow_clone(). In both cases Visitor::clone() is called.

However, the particular clone implementations seem to show different behaviours:

  • DoLocalCopyPropagation invokes default copy constructor.
  • TypeInference only copies itself, not base classes

No particular clone is defined for Inspector / Modifier / Transform. I do not know what's downstream :)

IMO all this is very misleading and could lead to subtle issues everywhere. We certainly need to stop accepting Visitor by const reference in PassManager, it only makes sense for stateless Inspector's and similar visitors (though it seems to be unused in the upstream codebase).

@ChrisDodd
Copy link
Contributor

Lots of points here

For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied.

This comes from flow_clone/flow_merge which clone the Visitor (pointing at the same tracker), though the way thing are structured, the first copy of the visitor (which creates the tracker) should always also be the last (alive), so perhaps that could be leveraged to avoid the shared_ptr overhead if that is significant

Transforms are cloning always. Regardless whether node is changed or not.

Yes, this is a sore point. The original C code this is based on would do that clone on the stack (creating a copy as a local var in apply_visitor) and only copy that to the heap at the end if any changes were made. Unfortunately there's no good way of doing that in C++. Maybe a union of all possible IR types and a placement-new clone? This has a problem if a transform preorder/postorder method tries to stash away a pointer to this clone, which should not be done.

When we were first moving to using GC, I assumed that the garbage collector's malloc would be very efficient (as it just needs to bump a pointer), so the benefit of trying to use a stack clone would be minimal.

Visitors use textbook double-dispatch. So, every apply essentially costs 2 virtual function calls for each node.

Yes. When this code was written, it was assumed that by the time C++17 came out, the language would have mulitple dispatch and implementations would catch up with implementing it efficiently. Never happened. Classic problem with C++ -- lots of things never get implemented efficiently as "noone uses them", but the only reason noone uses them is that they are inefficient so everyone comes up with hacks to work around the problem, rather than just writing clean code that expresses what it is doing.

Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action.

Yes. Referring back to the original C code that inspired this code, it used a manually managed stack (fixed size array) of Context structures (could be a local var in apply_visitor, or allocated elsewhere for systems with limited stack space such as Windows drivers) and would only make a recursive call when that fixed array filled up. Doing the same in C++ should be possible, but non-trivial.

What does matter is that majority of these maps are small. ... Note that std::unordered_map is quite malloc-heavy

Yes this is another place that hvec_map should be an improvement.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

When we were first moving to using GC, I assumed that the garbage collector's malloc would be very efficient (as it just needs to bump a pointer), so the benefit of trying to use a stack clone would be minimal.

Sadly, it is not. BDWGC is not the fastest malloc implementation itself. Also, there are additional overheads as it essentially memset(ptr, size, 0) all allocated / freed memory (so remove possible dangling pointers)...

@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

IMO all this is very misleading and could lead to subtle issues everywhere. We certainly need to stop accepting Visitor by const reference in PassManager

Intuitively I would say a PassManager should own any visitor passed to it. I can't think of a sensible scenario where a visitor is copied into a pass manager and still used independently.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

Intuitively I would say a PassManager should own any visitor passed to it. I can't think of a sensible scenario where a visitor is copied into a pass manager and still used independently.

There are cases when an Inspector is used in a pipeline and later the collected results are reused. See e.g. https://github.com/p4lang/p4c/blob/main/frontends/p4/frontend.cpp#L238

@fruffy
Copy link
Collaborator

fruffy commented Feb 23, 2024

There are cases when an Inspector is used in a pipeline and later the collected results are reused. See e.g. https://github.com/p4lang/p4c/blob/main/frontends/p4/frontend.cpp#L238

Ah, I definitely have used this pattern before. Although I question whether doing that within a PassManager is a good idea. The VisitorRef implementation seems a little muddy, especially considering the unclear semantics of clone as you pointed out.

@asl
Copy link
Contributor Author

asl commented Feb 23, 2024

Ah, I definitely have used this pattern before. Although I question whether doing that within a PassManager is a good idea. The VisitorRef implementation seems a little muddy, especially considering the unclear semantics of clone as you pointed out.

And actually clone there is never used. At least upstream.

@asl
Copy link
Contributor Author

asl commented Nov 3, 2024

To add a bit more to the topic of excessive clones. The majority of these clones are pointless. For P4CParserUnroll.switch_20160512 test from gtest overall there are 8430214 node clones of which only 303880 are used further on. So, less than 4%. I guess for larger apps the ratio is even less.

@ChrisDodd
Copy link
Contributor

I think the question of Visitor::clone vs Node::clone are completely orthogonal. Both have opportunities for improvement, but in completely different ways. I still have some ideas kicking around to reduce Node cloning by implementing a smart-copy-on-write pointer and using that to reimplement Transform, but its pretty low on my priority list.

@asl
Copy link
Contributor Author

asl commented Nov 4, 2024

I think the question of Visitor::clone vs Node::clone are completely orthogonal. Both have opportunities for improvement, but in completely different ways. I still have some ideas kicking around to reduce Node cloning by implementing a smart-copy-on-write pointer and using that to reimplement Transform, but its pretty low on my priority list.

Well, yes. though, I still consider node cloning a part of Visitor overhead as this is how Transform / Modify are implemented. I'm currently playing with COW approach as this recently bumped the problem in my todo list due to excessive memory consumption / compile time of p4c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Topics concerning the core segments of the compiler (frontend, midend, parser) enhancement This topic discusses an improvement to existing compiler code.
Projects
None yet
Development

No branches or pull requests

4 participants