-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Remove DeviceMap from LowerTE #8788
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay did a full first pass and left some comments.
docs/langref/relay_op.rst
Outdated
@@ -215,7 +215,7 @@ This level support backpropagation of broadcast operators. It is temporary. | |||
tvm.relay.ndarray_size | |||
tvm.relay.layout_transform | |||
tvm.relay.device_copy | |||
tvm.relay.annotation.on_device | |||
tvm.relay.annotation.on_device_op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for renaming the op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that was a clion rename that went too far -- ignore.
include/tvm/runtime/logging.h
Outdated
* TVM_LOG_DEBUG="1;foo.cc=2;bar.cc=0;" | ||
* \endcode | ||
*/ | ||
bool VerboseLoggingEnabled(const char* filename, int level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that we need Verbose logging vs. using DLOG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only advantage is you can switch it on and off per file. Eg I had to disable some existing DLOGs to not get overwhelmed, and found DLOGs I did add during development I then removed before checking since they were similarly verbose. This lets you add debugging logging at whatever detail you need to develop or fix and leave them there for the next time. Not committed, just testing if it speeds things up for me.
// This first phase moves from implicit use of compile engine, | ||
// to instead explicitly lowering the incoming IRModule, and then | ||
// performing the preexisting AOT executor code generation phase. | ||
IRModule mod = IRModule::FromExpr(func); | ||
|
||
// Make explicit which device each expression should be executed on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will get folded away in future refactors as we flow the target/device information around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally all the backends will treat MakeDevicesExplicit as the VM compiler does and it's just another phase. I think the 'choose the default device' should also be common (hopefully can do that in this PR, I think it's just N ways of doing the same thing over the various backends). That leaves the memory planning which I think we should tackle next.
dev.device_id = 0; | ||
dev.device_type = device_types[0]; | ||
device_context_map.insert({expr, dev}); | ||
// Make explicit which device each expression should be executed on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same Q as above, also can we put this code in some kind of helper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this device map hacking has gone away now.
@@ -452,7 +469,7 @@ class StorageAllocator : public StorageAllocaBaseVisitor { | |||
// all the storage resources available | |||
std::vector<StorageToken*> data_; | |||
/*! \brief internal prototype token map */ | |||
std::unordered_map<const ExprNode*, std::vector<StorageToken*> > prototype_; | |||
std::unordered_map<const ExprNode*, std::vector<StorageToken*>> prototype_; | |||
}; | |||
|
|||
StaticMemoryPlan GraphPlanMemory(const Function& func) { return StorageAllocator().Plan(func); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A related question should we do the Array of struct vs. Struct of array swap here? the memory plan data structure is actually kind of funky right now due to the fact it maps ops to list of storages per device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this one :-) But I had to go through all of this pretty closely to debug a breakage so it doesn't look intimidating at all.
@@ -913,17 +917,14 @@ std::pair<IRModule, Map<String, IRModule>> Prepare(IRModule mod, Device device, | |||
// We only have one device-specific target. | |||
tec::TargetMap targets = {{device.device_type, target}}; | |||
|
|||
// All calls to primitives will use the unique target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
src/relay/backend/te_compiler.cc
Outdated
// The target corresponding to the call_node expression's annotation. | ||
device = GetDevice(expr); | ||
// TODO(mbs): device id | ||
target = GetTargetFromInteger(device.device_type, targets_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will become a lookup after we refactor the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one will go away with @mikepapadim 's #9134
src/relay/backend/vm/compiler.cc
Outdated
@@ -1041,13 +1013,31 @@ IRModule VMCompiler::OptimizeModule(IRModule mod, const TargetsMap& targets_arg, | |||
|
|||
Array<Pass> pass_seqs = relay::backend::GetPassPrefix(targets, true); | |||
|
|||
Device default_device; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for us to compute this outside the OptimizeModule, ideally this should just build the optimization pipeline and have minimal logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alas it's still there, I've run out of steam and we should centralize this and the (now reigged) default device logic in build_module.cc
TVM_REGISTER_NODE_TYPE(OnDeviceAttrs); | ||
|
||
TVM_REGISTER_GLOBAL("relay.op.annotation._make.on_device") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above Q.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bogus rename, ignore.
@@ -320,191 +326,11 @@ class AnnotatationVisitor : private ExprVisitor { | |||
Map<Expr, Integer> annotations_; | |||
}; | |||
|
|||
/* | |||
* \brief Return device allocation map based on the post order traversed graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we replace this with a similar updated comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all dead now.
bda53e2
to
d0c7a31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a few comments I forgot to post from a few days ago, I'm planning on taking another look next week
"FirstOrderGradient module pass", | ||
DeprecationWarning, | ||
) | ||
warnings.warn("using transform.gradient for first-order AD is deprecated, please use the" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you got a lot of reformatting changes in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed I think.
for (const auto& a : c->args) { | ||
Depend(n, a); | ||
if (c->op == on_device_op_) { | ||
DependencyGraph::Node* body = NewNode(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is adding a new scope if the op is on_device_
. Can you add a comment explaining this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up reverting this -- it was a bit too subtle an approach. All the "on_device" special casing is now handled explicitly by transforms which derive from one of the three (!) device aware visitor classes.
}; | ||
|
||
std::ostream& operator<<(std::ostream& os, StorageToken tok) { | ||
return os << "StorageToken: " << std::endl | ||
<< "ref_counter: " << tok.ref_counter << std::endl | ||
<< "max_bytes: " << tok.max_bytes << std::endl | ||
<< "tttype: " << tok.ttype | ||
<< "tttype: " << tok.ttype << std::endl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are printing tttype
2x here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zapped.
@@ -913,17 +917,14 @@ std::pair<IRModule, Map<String, IRModule>> Prepare(IRModule mod, Device device, | |||
// We only have one device-specific target. | |||
tec::TargetMap targets = {{device.device_type, target}}; | |||
|
|||
// All calls to primitives will use the unique target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
a4242b1
to
e7b393a
Compare
24fee01
to
fa46888
Compare
e445176
to
651b848
Compare
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.
651b848
to
f5e2448
Compare
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.
4fc104d
to
df99244
Compare
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.
df99244
to
77477d1
Compare
…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game, I'll be peeling PRs out of that.
77477d1
to
607a49f
Compare
…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game, I'll be peeling PRs out of that.
607a49f
to
ed31ec7
Compare
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.
…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game, I'll be peeling PRs out of that.
ed31ec7
to
1bffecd
Compare
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.
3c917f0
to
dc77562
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is by far my favourite change proposed in TVM 😸 I left a few questions but I'm excited to see this landed.
* \param expr the graph. | ||
* | ||
* \return The transformed program. | ||
*/ | ||
TVM_DLL Expr ToANormalForm(const Expr& expr); | ||
TVM_DLL Expr ToANormalForm(const Optional<IRModule>& maybe_mod, const Expr& expr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optional argument should probably be second to allow for the default behaviour to be just passing expr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it again I don't need this argument -- if during conversion to ANF the pass asks for the device for a global var then it is only for the purposes of maybe wrapping the same global var with an "on_device" annotation, but that is always a no-op. I''ll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably change this to be a real pass?
@@ -1028,7 +1029,7 @@ Pass FuseOps(int fuse_opt_level) { | |||
auto max_fuse_depth = pc->GetConfig("relay.FuseOps.max_depth", Integer(kMaxFusedOps)); | |||
return Downcast<Function>(FuseOps(f, opt_level, max_fuse_depth.value(), m)); | |||
}; | |||
return CreateFunctionPass(pass_func, 1, "FuseOps", {"InferType"}); | |||
return CreateFunctionPass(pass_func, 0, "FuseOps", {"InferType"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be 1
as it's an optimisation in itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, I forgot to explain this change.
FuseOps is called directly in quite a few places which means it bypasses the 'skip pass' logic in the SequentialPass combinator. As part of this change I ended up just slotting FuseOps in with the rest of the passes, but that broke things since, eg, the VM and Interpreter (at least) assume FuseOps have been run. So it seemed this pass is considered opt level 0 by default.
Thanks for the encouragement @Mousius, this one has been a bit of a long slog! |
I am going to land this for sake of future pain, feel free to follow up if any other changes need to happen after this. |
We don't need the Optional<IRModule> on ToANormalForm and friends.
…ecutor This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.
…ecutor (apache#9227) This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.
apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA).
apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)
…nts (#9345) #8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)
…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in
* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic
We don't need the Optional<IRModule> on ToANormalForm and friends.
…ecutor (apache#9227) This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.
…nts (apache#9345) apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)
…che#9012) * [Relay] VLOG for finer grained control of hyper-detailed logging I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose. * [checkpoint] lints * [checkpoint] missing \n lint Gotta get my docker setup going * [checkpoint] woops, we don't support gmock.h * [checkpoint] Address Hua Jiang's comments. * [checkpoint] strlen not avail on all toolchains? * [checkpoint] Rework TVM_LOG_DEBUG spec and parser * [checkpoint] woops, forgot the static modifier on map * [checkpoint] * -> DEFAULT for wildcard. Andrew pointed out *=9 suggests foo/*.cc=9 would work but it is not supported. * [checkpoint] constexpr length * [checkpoint] length is not constexpr on all targets, reverting * [checkpoint] minimize VLOG overhead when in legacy DLOG-only mode
…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in
* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic
We don't need the Optional<IRModule> on ToANormalForm and friends.
…ecutor (apache#9227) This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.
…nts (apache#9345) apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)
[Relay] This PR switches the graph, VM and AOT executors to use the merged
device_planner.cc from #9038, thus finally removing DeviceMap from the LowerTE Pass.
transforms/device_annotation.cc (and their tests). That
includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
ContextAnalysis entry points. These are all subsumed by the
PlanDevices pass and the device aware visitors.
recover the device for every Relay sub-expression:
encoded by the device planner within "on_device" annotations and
"param_device_types"/"result_device_type" function attributes:
in favor of just asking for the device. Also removed a lot of ad-doc
encodings of the 'default' device.
lowering). Device planning is also decoupled from memory planning.
(which was the problem which kicked this series of PRs off in the first place).