-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Visitor overhead #4418
Comments
Could this be because of the
Interesting, seems like we would be able to improve it a lot.
Definitely, some open addressing/closed hashing scheme (without overflow buckets) would likely work better for the "large" case based on my experience.
I was thinking about this too. The big problem I think is the children forwarding and seeing the right children in postorder. Which seems to be reasonable requirement, but makes the general case quite hard to implement without cloning. We could possibly have a lightweight transform for preorder-only cases, or we would need some other way of passing the children to the So far I did not found anything that would not require major extension of the IR hirearchies' functionality though (something like a fold from functional programming -- i.e. a
Is this really a problem though? The only alternative would be RTTI-based dispatch, that would have just one huge switch (or something like that), but would that be significantly faster than two virtual calls? And virtual call would be needed to cast to the right type anyway (and to visit children).
Nice, thank you. I wonder what you have. I would definitely be interested in improvements around the maps and transform. I am little skeptical about the double-dispatch elimination but I will wait and see what parts you have in your codebase. |
I would say even more. Some of current passes exploits that even childrens are "finalized" (e.g. when processing second children they expect that first one is already transformed and put into the proper place into parent node). But in such case indeed it would be better to have clones.
The current round-trip requires 4 function calls, two of them might be inlined and two of them are virtual. CRTP approach would require one virtual call for typeid (actually we can make it non-virtual storing the typeid in the member, but this is a separate story). Casting is static from base to immediate derived both for Visitor and for Node, no virtual bases involved, so here there will be at least 1 virtual call less. Plus empty cases could be potentially inlined into that switch, simplifying it a bit more. Children are separate and orthogonal. I would say that recursive manner produce gigantic call stacks and essentially hide the flow, making overall reasoning and profiling quite hard. That said, there are lots of possible small improvements here and there. Some of them likely could contribute ~ 0.5%, so the effect could only be visible combined. |
One thing to start with this is to introduce a new visitor class in |
Thanks for the tip! Sure, this is the way forward ;) |
So, I tried to understand the logic behind
Here is rationale for all this: In the majority of cases these But maybe I'm missing something and there is some backstory here? @vlstill @ChrisDodd @grg @fruffy do you happen to know? |
I personally do not know much about the choices made for the visitors, so your guess is as good as mine. When does a visitor need to be copied, for example? |
First case when we're adding However, the particular
No particular IMO all this is very misleading and could lead to subtle issues everywhere. We certainly need to stop accepting |
Lots of points here
This comes from flow_clone/flow_merge which clone the Visitor (pointing at the same tracker), though the way thing are structured, the first copy of the visitor (which creates the tracker) should always also be the last (alive), so perhaps that could be leveraged to avoid the shared_ptr overhead if that is significant
Yes, this is a sore point. The original C code this is based on would do that clone on the stack (creating a copy as a local var in When we were first moving to using GC, I assumed that the garbage collector's malloc would be very efficient (as it just needs to bump a pointer), so the benefit of trying to use a stack clone would be minimal.
Yes. When this code was written, it was assumed that by the time C++17 came out, the language would have mulitple dispatch and implementations would catch up with implementing it efficiently. Never happened. Classic problem with C++ -- lots of things never get implemented efficiently as "noone uses them", but the only reason noone uses them is that they are inefficient so everyone comes up with hacks to work around the problem, rather than just writing clean code that expresses what it is doing.
Yes. Referring back to the original C code that inspired this code, it used a manually managed stack (fixed size array) of
Yes this is another place that hvec_map should be an improvement. |
Sadly, it is not. BDWGC is not the fastest malloc implementation itself. Also, there are additional overheads as it essentially |
Intuitively I would say a PassManager should own any visitor passed to it. I can't think of a sensible scenario where a visitor is copied into a pass manager and still used independently. |
There are cases when an |
Ah, I definitely have used this pattern before. Although I question whether doing that within a PassManager is a good idea. The VisitorRef implementation seems a little muddy, especially considering the unclear semantics of |
And actually |
To add a bit more to the topic of excessive clones. The majority of these clones are pointless. For |
I think the question of Visitor::clone vs Node::clone are completely orthogonal. Both have opportunities for improvement, but in completely different ways. I still have some ideas kicking around to reduce Node cloning by implementing a smart-copy-on-write pointer and using that to reimplement Transform, but its pretty low on my priority list. |
Well, yes. though, I still consider node cloning a part of Visitor overhead as this is how Transform / Modify are implemented. I'm currently playing with COW approach as this recently bumped the problem in my todo list due to excessive memory consumption / compile time of p4c. |
Visitors are central objects in P4C architecture. However, the current implementation imposes lots of overheads and some inefficiencies.
I would name some of those that draw my attention
Inspector::init_apply
and 138k calls toTransform::init_apply
. On the real-world sources I had cases 10x larger. Each of such calls creates a new map to track visited nodes (just visited plus few bits forInspector
andChangeTracker
forModifier
andTransform
). For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied.ForwardChildren
takes a reference to it.What does matter is that majority of these maps are small. For that 50k calls to
Inspector
we are having 10k maps of size 1 (essentially node with no children), another 10k maps of size 2 (so, single children and nothing else) and approx 45k (so 90% of total) of maps of size <= 16. Rehashing and corresponding memory allocation take quite some visible amount of time in overall profile. Note thatstd::unordered_map
is quite malloc-heavy due to restrictions and guarantees required by C++ standard. Situation withTransforms
is similar – 85% of nodes are contained in small maps (< 16 entries).We definitely need to optimize for small map case, e.g. having space pre-allocated and implementing map adapter on top of plain vector (aka
flat_map
). The linear scan over few vector elements won't be much slower than hash table lookup as evidence shows us.The amount of map lookups is also quite large. We'd probably switch from
std::unordered_map
into some more performant solution.Transforms are cloning always. Regardless whether node is changed or not. Likely this should be changed to some lazy implementation. However, it seems that some passes rely heavily on the way how nodes are currently traversed and what are the implementation details there.
Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action.
Visitors use textbook double-dispatch. So, every
apply
essentially costs 2 virtual function calls for each node.Now this could be solved with static dispatch. With pre-defined typeids we can just switch over them, eliminating one virtual function call. The second virtual function call could be eliminating via CRTP.
That said, these are not something easy or isolated changes as Visitors are everywhere :) I will be contributing patches for some of these issues. Spoiler: for some downstream code we saw 4x reduction in the compile time via removing of various overheads here and there.
The text was updated successfully, but these errors were encountered: