-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BYOC] Two helper passes for external codegen using RelayToTIR custom pass machinery #11474
Conversation
f56e498
to
3bb37aa
Compare
ci failures are unrelated |
52dcc90
to
5ece9dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great PR, @mbs-octoml!
It looks good to me. Especially, I like to change towards Extern
.
Just a few clarification question.
} | ||
|
||
def @tvmgen_default_cutlass_main_0(%y_0_i0: Tensor[(1600, 768), float16], %y_0_i1: Tensor[(2304, 768), float16], %y_0_i2: Tensor[(2304), float16], | ||
Inline=1, Compiler="cutlass", global_symbol="tvmgen_default_cutlass_main_0", Primitive=1) -> Tensor[(1600, 2304), float16] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this Inline=1
mean? Just want to double check if the outlining should unset this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIKT only PartitionGraph sets it, and it triggers Inline to actually do the inline. I guess at the time folks were thinking of supporting a kind of inline profit analysis separate from the inline pass itself?
The 'outlining' pass will preserve the attribute, along with all the other attributes. So, nothing stopping someone running Inline to push them back in again if that's what they really want!
The 'make extern' pass however intentionally drops all attributes since these functions are now just stubs and the body is no longer meaningful.
In an ideal world I think:
- "Compiler" only makes sense on global functions
- The graph and aot flows would respect the entire IRModule and not be restricted to seeing just 'main', hence there'd never be a need to inline anything other than for perf.
- IRModule would support the notion of extern functions (along with extern/defined constants, and runtime modules)
But I'm trying to work with what we have :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you for the detailed comment! I think it is very good to know and your approach makes sense to me.
One concern is the confusion from the name.
Especially from the people whom are not familiar with these context (like myself) might wonder why outlining does not matter with Inline
attribute.
I'd like to suggest changing either one of the names or clearly document somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sung, sorry I only just noticed your responses after the merge.
I think a comment saying 'will outline functions even if they have Inline=1' would address your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! It's just a nit. that would be nice. Thank you!
Thanks for taking a look Sung, hope those answers help. I've had to reverse engineer this stuff but think I have it right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @mbs-octoml added some comments
… pass machinery (See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md). For reasons explained in the above thread I'm moving CUTLASS to be IRModule-at-a-time external codegen using a custom RelayToTIR pass instead of the traditional function-at-a-time external codegen using a relay.ext.cutlass registered function. This means some of the rewriing done on-the-fly by LowerTEPass now needs to be done by the custom pass directly. This PR supplies two passes which ease that burden: - Before starting the CUTLASS-specific processing, make sure all "Compiler" attributed functions have unique global definitions (ie are outlined). Though functions start in this form after BYOC partitioning, under Graph and AOT compilation flows those functions are then inlined to pass through the 'codegen' keyhole which assumes the whole model is just one self-contained main function. This pass will undo that. (I gave up trying to just remove the inlining in the first place.) - After the CUTLASS-specific processing the now compiled "Compiler" attributed functions need to marked as 'extern'. The te_compiler.cc uses the "ExternalSymbol" attribute for that, but since a) the symbol name is never needed, on the presense of the attribute is significant downstream and b) "ExternalSymbol" is easy to confuse with "global_symbol", I just replaced "ExternalSymbol" with "Extern" with an Integer(1) (cf "Primitive"). The outlining pass is a little more general than necessary because it (will also) be used by Collage to rewrite the IRModule into optimally partitioned form while making maximal reuse of partition functions. Hence the abstract GlobalSymbolCache.
5ece9dd
to
f88b5a8
Compare
Thanks @areusch, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @mbs-octoml !
@tvm-bot merge |
Cannot merge, these CI jobs are not successful on f88b5a8: |
…hape functions In apache#11474 I got ready to switch CUTLASS from function-at-a-time to IRModule-at-a-time compilation. However my approach didn't handle dynamic shape functions, so I adjust it here. The idea is still that such passes will leave behind calls to 'extern' functions. However, converting those calls to 'call_lowered' form in MarkCompilerFunctionsAsExtern is too soon since only the TECompiler knows how to capture all the attributes necessary to support dynamic shape functions. So stop doing that in MarkCompilerFunctionsAsExtern and instead support this case properly in the TECompiler. While there try to chip away at the chronic lack of structure in te_compiler.cc. Every little bit helps.
…hape functions In apache#11474 I got ready to switch CUTLASS from function-at-a-time to IRModule-at-a-time compilation. However my approach didn't handle dynamic shape functions, so I adjust it here. The idea is still that such passes will leave behind calls to 'extern' functions. However, converting those calls to 'call_lowered' form in MarkCompilerFunctionsAsExtern is too soon since only the TECompiler knows how to capture all the attributes necessary to support dynamic shape functions. So stop doing that in MarkCompilerFunctionsAsExtern and instead support this case properly in the TECompiler. While there try to chip away at the chronic lack of structure in te_compiler.cc. Every little bit helps.
…hape functions In apache#11474 I got ready to switch CUTLASS from function-at-a-time to IRModule-at-a-time compilation. However my approach didn't handle dynamic shape functions, so I adjust it here. The idea is still that such passes will leave behind calls to 'extern' functions. However, converting those calls to 'call_lowered' form in MarkCompilerFunctionsAsExtern is too soon since only the TECompiler knows how to capture all the attributes necessary to support dynamic shape functions. So stop doing that in MarkCompilerFunctionsAsExtern and instead support this case properly in the TECompiler. While there try to chip away at the chronic lack of structure in te_compiler.cc. Every little bit helps.
…hape functions In apache#11474 I got ready to switch CUTLASS from function-at-a-time to IRModule-at-a-time compilation. However my approach didn't handle dynamic shape functions, so I adjust it here. The idea is still that such passes will leave behind calls to 'extern' functions. However, converting those calls to 'call_lowered' form in MarkCompilerFunctionsAsExtern is too soon since only the TECompiler knows how to capture all the attributes necessary to support dynamic shape functions. So stop doing that in MarkCompilerFunctionsAsExtern and instead support this case properly in the TECompiler. While there try to chip away at the chronic lack of structure in te_compiler.cc. Every little bit helps. Add a basic unit test.
…hape functions (#11619) In #11474 I got ready to switch CUTLASS from function-at-a-time to IRModule-at-a-time compilation. However my approach didn't handle dynamic shape functions, so I adjust it here. The idea is still that such passes will leave behind calls to 'extern' functions. However, converting those calls to 'call_lowered' form in MarkCompilerFunctionsAsExtern is too soon since only the TECompiler knows how to capture all the attributes necessary to support dynamic shape functions. So stop doing that in MarkCompilerFunctionsAsExtern and instead support this case properly in the TECompiler. While there try to chip away at the chronic lack of structure in te_compiler.cc. Every little bit helps. Add a basic unit test.
…hape functions (apache#11619) In apache#11474 I got ready to switch CUTLASS from function-at-a-time to IRModule-at-a-time compilation. However my approach didn't handle dynamic shape functions, so I adjust it here. The idea is still that such passes will leave behind calls to 'extern' functions. However, converting those calls to 'call_lowered' form in MarkCompilerFunctionsAsExtern is too soon since only the TECompiler knows how to capture all the attributes necessary to support dynamic shape functions. So stop doing that in MarkCompilerFunctionsAsExtern and instead support this case properly in the TECompiler. While there try to chip away at the chronic lack of structure in te_compiler.cc. Every little bit helps. Add a basic unit test.
(See https://discuss.tvm.apache.org/t/byoc-supporting-cutlass-byoc-with-collage/12796/6 for
context, which in turn is part of Collage (https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md).
For reasons explained in the above thread I'm moving CUTLASS to be IRModule-at-a-time external codegen
using a custom RelayToTIR pass instead of the traditional function-at-a-time external codegen using
a relay.ext.cutlass registered function. This means some of the rewriting done on-the-fly by LowerTEPass now
needs to be done by the custom pass directly. This PR supplies two passes which ease that burden:
unique global definitions (ie are outlined). Though functions start in this form after BYOC partitioning,
under Graph and AOT compilation flows those functions are then inlined to pass through the 'codegen' keyhole
which assumes the whole model is just one self-contained main function. This pass will undo that. (I gave up
trying to just remove the inlining in the first place.)
'extern'. The te_compiler.cc uses the "ExternalSymbol" attribute for that. But since a) the symbol name
on that attribute is never needed, only the presence of the attribute is significant downstream and b) "ExternalSymbol"
is easy to confuse with "global_symbol", I also replace "ExternalSymbol" with "Extern" with an Integer(1) value
(cf "Primitive").
The outlining pass is a little more general than necessary because it (will also) be used by Collage to
rewrite the IRModule into optimally partitioned form while making maximal reuse of structurally equal partitioned
functions. Hence the abstract GlobalSymbolCache.