[Relay] Remove DeviceMap from LowerTE #8788

mbs-octoml · 2021-08-19T00:11:04Z

[Relay] This PR switches the graph, VM and AOT executors to use the merged
device_planner.cc from #9038, thus finally removing DeviceMap from the LowerTE Pass.

We retire analysis/context_analysis.cc and
transforms/device_annotation.cc (and their tests). That
includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
ContextAnalysis entry points. These are all subsumed by the
PlanDevices pass and the device aware visitors.
The following passes now use the new 'Device Aware' visitors to
recover the device for every Relay sub-expression:
- backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
- backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
- backend/te_compiler.cc (LowerTensorExprMutator)
- transforms/memory_alloc.cc (DialectRewriter)
- backend/vm/compiler.cc (VMFunctionCompiler)
The following passes/utils must maintain the device information
encoded by the device planner within "on_device" annotations and
"param_device_types"/"result_device_type" function attributes:
- backend/vm/lambda_lift.cc (LambdaLifter)
- transforms/to_a_normal_form.cc (Fill)
- ir/expr_functior.cc (Bind)
Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
in favor of just asking for the device. Also removed a lot of ad-doc
encodings of the 'default' device.
We no longer need to run device-planning twice (before and after
lowering). Device planning is also decoupled from memory planning.
The LowerTE Pass no longer needs an expression-to-device side table
(which was the problem which kicked this series of PRs off in the first place).

jroesch

Okay did a full first pass and left some comments.

jroesch · 2021-08-20T17:37:39Z

docs/langref/relay_op.rst

@@ -215,7 +215,7 @@ This level support backpropagation of broadcast operators. It is temporary.
   tvm.relay.ndarray_size
   tvm.relay.layout_transform
   tvm.relay.device_copy
-   tvm.relay.annotation.on_device
+   tvm.relay.annotation.on_device_op


What is the reason for renaming the op?

Sorry that was a clion rename that went too far -- ignore.

jroesch · 2021-08-20T17:39:26Z

include/tvm/runtime/logging.h

+ * TVM_LOG_DEBUG="1;foo.cc=2;bar.cc=0;"
+ * \endcode
+ */
+bool VerboseLoggingEnabled(const char* filename, int level);


Is there a reason that we need Verbose logging vs. using DLOG?

The only advantage is you can switch it on and off per file. Eg I had to disable some existing DLOGs to not get overwhelmed, and found DLOGs I did add during development I then removed before checking since they were similarly verbose. This lets you add debugging logging at whatever detail you need to develop or fix and leave them there for the next time. Not committed, just testing if it speeds things up for me.

jroesch · 2021-08-20T23:52:46Z

src/relay/backend/aot_executor_codegen.cc

    // This first phase moves from implicit use of compile engine,
    // to instead explicitly lowering the incoming IRModule, and then
    // performing the preexisting AOT executor code generation phase.
    IRModule mod = IRModule::FromExpr(func);
+
+    // Make explicit which device each expression should be executed on.


This code will get folded away in future refactors as we flow the target/device information around?

Ideally all the backends will treat MakeDevicesExplicit as the VM compiler does and it's just another phase. I think the 'choose the default device' should also be common (hopefully can do that in this PR, I think it's just N ways of doing the same thing over the various backends). That leaves the memory planning which I think we should tackle next.

jroesch · 2021-08-20T23:54:43Z

src/relay/backend/graph_executor_codegen.cc

-      dev.device_id = 0;
-      dev.device_type = device_types[0];
-      device_context_map.insert({expr, dev});
+    // Make explicit which device each expression should be executed on.


Same Q as above, also can we put this code in some kind of helper?

Most of this device map hacking has gone away now.

jroesch · 2021-08-20T23:55:57Z

src/relay/backend/graph_plan_memory.cc

@@ -452,7 +469,7 @@ class StorageAllocator : public StorageAllocaBaseVisitor {
  // all the storage resources available
  std::vector<StorageToken*> data_;
  /*! \brief internal prototype token map */
-  std::unordered_map<const ExprNode*, std::vector<StorageToken*> > prototype_;
+  std::unordered_map<const ExprNode*, std::vector<StorageToken*>> prototype_;
 };

 StaticMemoryPlan GraphPlanMemory(const Function& func) { return StorageAllocator().Plan(func); }


A related question should we do the Array of struct vs. Struct of array swap here? the memory plan data structure is actually kind of funky right now due to the fact it maps ops to list of storages per device.

Not in this one :-) But I had to go through all of this pretty closely to debug a breakage so it doesn't look intimidating at all.

jroesch · 2021-08-20T23:56:24Z

src/relay/backend/interpreter.cc

@@ -913,17 +917,14 @@ std::pair<IRModule, Map<String, IRModule>> Prepare(IRModule mod, Device device,
  // We only have one device-specific target.
  tec::TargetMap targets = {{device.device_type, target}};

-  // All calls to primitives will use the unique target.


This is great.

jroesch · 2021-08-20T23:57:32Z

src/relay/backend/te_compiler.cc

+      // The target corresponding to the call_node expression's annotation.
+      device = GetDevice(expr);
+      // TODO(mbs): device id
+      target = GetTargetFromInteger(device.device_type, targets_);


This will become a lookup after we refactor the map?

I think this one will go away with @mikepapadim 's #9134

jroesch · 2021-08-21T00:03:15Z

src/relay/backend/vm/compiler.cc

@@ -1041,13 +1013,31 @@ IRModule VMCompiler::OptimizeModule(IRModule mod, const TargetsMap& targets_arg,

  Array<Pass> pass_seqs = relay::backend::GetPassPrefix(targets, true);

+  Device default_device;


Is it possible for us to compute this outside the OptimizeModule, ideally this should just build the optimization pipeline and have minimal logic.

Alas it's still there, I've run out of steam and we should centralize this and the (now reigged) default device logic in build_module.cc

jroesch · 2021-08-21T00:06:09Z

src/relay/op/annotation/annotation.cc

 TVM_REGISTER_NODE_TYPE(OnDeviceAttrs);

-TVM_REGISTER_GLOBAL("relay.op.annotation._make.on_device")


Same as above Q.

Bogus rename, ignore.

jroesch · 2021-08-21T00:10:43Z

src/relay/transforms/device_annotation.cc

@@ -320,191 +326,11 @@ class AnnotatationVisitor : private ExprVisitor {
  Map<Expr, Integer> annotations_;
 };

-/*
- * \brief Return device allocation map based on the post order traversed graph.


Should we replace this with a similar updated comment?

This is all dead now.

electriclilies

Here are a few comments I forgot to post from a few days ago, I'm planning on taking another look next week

electriclilies · 2021-08-24T18:59:29Z

python/tvm/relay/transform/transform.py

-            "FirstOrderGradient module pass",
-            DeprecationWarning,
-        )
+        warnings.warn("using transform.gradient for first-order AD is deprecated, please use the"


Looks like you got a lot of reformatting changes in this file

Fixed I think.

electriclilies · 2021-08-24T19:12:26Z

src/relay/analysis/dependency_graph.cc

-    for (const auto& a : c->args) {
-      Depend(n, a);
+    if (c->op == on_device_op_) {
+      DependencyGraph::Node* body = NewNode(true);


If I understand correctly, this is adding a new scope if the op is on_device_. Can you add a comment explaining this?

Ended up reverting this -- it was a bit too subtle an approach. All the "on_device" special casing is now handled explicitly by transforms which derive from one of the three (!) device aware visitor classes.

electriclilies · 2021-08-24T20:08:37Z

src/relay/backend/graph_plan_memory.cc

 };

 std::ostream& operator<<(std::ostream& os, StorageToken tok) {
  return os << "StorageToken: " << std::endl
            << "ref_counter: " << tok.ref_counter << std::endl
            << "max_bytes: " << tok.max_bytes << std::endl
-            << "tttype: " << tok.ttype
+            << "tttype: " << tok.ttype << std::endl


Looks like you are printing tttype 2x here

electriclilies · 2021-08-24T22:54:56Z

src/relay/backend/interpreter.cc

@@ -913,17 +917,14 @@ std::pair<IRModule, Map<String, IRModule>> Prepare(IRModule mod, Device device,
  // We only have one device-specific target.
  tec::TargetMap targets = {{device.device_type, target}};

-  // All calls to primitives will use the unique target.


I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.

…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game, I'll be peeling PRs out of that.

I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.

…tation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game, I'll be peeling PRs out of that.

I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose.

Mousius

This is by far my favourite change proposed in TVM 😸 I left a few questions but I'm excited to see this landed.

Mousius · 2021-10-04T09:25:50Z

include/tvm/relay/transform.h

 * \param expr the graph.
 *
 * \return The transformed program.
 */
-TVM_DLL Expr ToANormalForm(const Expr& expr);
+TVM_DLL Expr ToANormalForm(const Optional<IRModule>& maybe_mod, const Expr& expr);


The optional argument should probably be second to allow for the default behaviour to be just passing expr ?

Now that I think about it again I don't need this argument -- if during conversion to ANF the pass asks for the device for a global var then it is only for the purposes of maybe wrapping the same global var with an "on_device" annotation, but that is always a no-op. I''ll remove it.

We should probably change this to be a real pass?

Mousius · 2021-10-04T10:00:34Z

src/relay/transforms/fuse_ops.cc

@@ -1028,7 +1029,7 @@ Pass FuseOps(int fuse_opt_level) {
        auto max_fuse_depth = pc->GetConfig("relay.FuseOps.max_depth", Integer(kMaxFusedOps));
        return Downcast<Function>(FuseOps(f, opt_level, max_fuse_depth.value(), m));
      };
-  return CreateFunctionPass(pass_func, 1, "FuseOps", {"InferType"});
+  return CreateFunctionPass(pass_func, 0, "FuseOps", {"InferType"});


Shouldn't this be 1 as it's an optimisation in itself?

Woops, I forgot to explain this change.
FuseOps is called directly in quite a few places which means it bypasses the 'skip pass' logic in the SequentialPass combinator. As part of this change I ended up just slotting FuseOps in with the rest of the passes, but that broke things since, eg, the VM and Interpreter (at least) assume FuseOps have been run. So it seemed this pass is considered opt level 0 by default.

mbs-octoml · 2021-10-04T16:22:13Z

Thanks for the encouragement @Mousius, this one has been a bit of a long slog!

jroesch · 2021-10-04T19:17:58Z

I am going to land this for sake of future pain, feel free to follow up if any other changes need to happen after this.

mbs-octoml · 2021-10-04T19:47:22Z

Thanks @jroesch. @Mousius I didn't push the changes for your comments before the merge but will get them in somehow.

We don't need the Optional<IRModule> on ToANormalForm and friends.

…ecutor This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.

…#9227) This tutorial disabled the FuseOps pass, but before #8788 that was ignored since FuseOps was applied directly.

…ecutor (apache#9227) This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.

apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA).

apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)

…nts (#9345) #8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)

…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in

* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic

We don't need the Optional<IRModule> on ToANormalForm and friends.

…ecutor (apache#9227) This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.

…nts (apache#9345) apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)

…che#9012) * [Relay] VLOG for finer grained control of hyper-detailed logging I've been making very heavy use of DLOG and PrettyPrint to trace, understand and debug Relay transforms. I've found myself deleting log statements to reduce the output verbosity, only to have to re-add them a few days later. Better would be to support leaving all those statements in place but have finer control over when they are enabled. This PR introduces a 'VLOG(level)' macro to that end. The log is ignored unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the the current verbosity level is >= level. The current verbosity level can be set globally and/or overridden per source file (see 'VerboseLoggingEnabled'). (Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.) I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal per-thread stack. Each VLOG message includes that stack as its prefix, which is a very handy way to keep track of the (often recursive) program context in which each VLOG is executed. I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them alone for now. See the draft apache#8788 for use in the wild. I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so fixed that. I've also made changes to the Relay text printer to dump attributes in a human readable format rather than the indirect but machine readable 'meta' representation. This is gated by the show_meta_data_ flag, and I think this use is consistent with it's original purpose. * [checkpoint] lints * [checkpoint] missing \n lint Gotta get my docker setup going * [checkpoint] woops, we don't support gmock.h * [checkpoint] Address Hua Jiang's comments. * [checkpoint] strlen not avail on all toolchains? * [checkpoint] Rework TVM_LOG_DEBUG spec and parser * [checkpoint] woops, forgot the static modifier on map * [checkpoint] * -> DEFAULT for wildcard. Andrew pointed out *=9 suggests foo/*.cc=9 would work but it is not supported. * [checkpoint] constexpr length * [checkpoint] length is not constexpr on all targets, reverting * [checkpoint] minimize VLOG overhead when in legacy DLOG-only mode

…tation.cc (apache#9038) * [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it depends on a side-input DeviceMap. We'd like to remove that side-input, and instead recover the Device (and, ultimately, Target) for each (fused) primitive call from the AST alone. By doing so we also avoid needing to perform device planning twice: - It needs to be done before lowering so we know which primitives need to be compiled for which devices. - It then needs to be re-done after lowering and optimization as a prelude to memory planning. By baking the device plan into the AST we can simply do device planning before lowering, and run memory planning later, both as ordinary passes. While working on that issue we realized we currently have 3 'device planners': - transforms/device_annotation.cc, which supports only a small subset of Relay and uses a simple top-down algorithm to assign a device to every sub-expression. - analysis/context_analysis.cc, which makes a galant effort to support most of Relay, is based on unification rather than a top-down algorithm, but handles higher order functions by ad-hoc and fragile inlining. - transforms/annotate_target.cc, which works on Targets instead of Devices, but is otherwise like 'device planning'. We'd like to bring these together. In this PR we introduce a new transforms/device_planner.cc intended to replace transforms/device_annotation.cc and analysis/context_analysis.cc. We don't delete those two just yet since we need to switch all users off of them in the next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC to bring devices and targets together sensibly, but have it firmly in our sights. transforms/device_planner.cc is based on analysis/context_analysis.cc, but is heavily reworked to: 1. Handle starting from existing "on_device" annotations as well as existing "device_copy" calls. 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine' device planning to account for storge scopes. 3. Robustly handle all of Relay, particularly higher-order functions. For that we replace the inlining approach in analysis/context_analysis.cc with a higher-order unification domain. 4. Be a little more systematic with defaulting. 5. Capture the result of the analysis within the AST as new "device_copy" calls at device boundaries, and new/replaced "on_device" calls wherever the device for a sub-expression is not already 'obvious' from the sub-expression's lexical scope. 6. Provide helper visitors for passes which need to ask for the device for any sub-expression they are processing and/or preserve device information on rewrites. Those passes include: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/memory_alloc.cc (DialectRewriter) - transforms/to_a_normal_form.cc (Fill) - backend/vm/compiler.cc (VMFunctionCompiler) However we won't change any of those in this PR. See the draft apache#8788 for the end game. * [checkpoint] Use Relay script for all unit tests. * [checkpoint] Hoist out DeviceDomain and DeviceDomains. * [checkpoint] Hoist out visitors * [checkpoint] Woops, left debug-only code in

* [Relay] Switch the graph, VM and AOT executors to use the merged device_planner.cc from apache#9038, and finally remove DeviceMap from the LowerTE Pass. - We retire analysis/context_analysis.cc and transforms/device_annotation.cc (and their tests). That includes the CollectDeviceInfo, CollectDeviceAnnotationOps and ContextAnalysis entry points. These are all subsumed by the PlanDevices pass and the device aware visitors. - The following passes now use the new 'Device Aware' visitors to recover the device for every Relay sub-expression: - backend/aot_executor_codegen.cc (AOTOnDemandAllocator) - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc) - backend/te_compiler.cc (LowerTensorExprMutator) - transforms/memory_alloc.cc (DialectRewriter) - backend/vm/compiler.cc (VMFunctionCompiler) - The following passes/utils must maintain the device information encoded by the device planner within "on_device" annotations and "param_device_types"/"result_device_type" function attributes: - backend/vm/lambda_lift.cc (LambdaLifter) - transforms/to_a_normal_form.cc (Fill) - ir/expr_functior.cc (Bind) - Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals in favor of just asking for the device. Also removed a lot of ad-doc encodings of the 'default' device. - We no longer need to run device-planning twice (before and after lowering). Device planning is also decoupled from memory planning. - The LowerTE Pass no longer needs an expression-to-device side table (which was the problem which kicked this series of PRs off in the first place). * [checkpoint] Revert unnecessary changes - Started down multi-target handling in interpreter but didn't finish - Some one-off debug stuff * [checkpoint] TODO's for default device logic

We don't need the Optional<IRModule> on ToANormalForm and friends.

…ecutor (apache#9227) This tutorial disabled the FuseOps pass, but before apache#8788 that was ignored since FuseOps was applied directly.

…nts (apache#9345) apache#8788 introduced a perf regression since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing due to the extra `on_device` annotation on the constant. Fixed that, and introduced some helpers to make this situation easier to deal with. (This is CORE-102 in OctoML JIRA). (Second try -- test_crp.py failure seems unrelated)

jroesch reviewed Aug 21, 2021

View reviewed changes

mbs-octoml force-pushed the mbs-memory-planning branch 6 times, most recently from bda53e2 to d0c7a31 Compare August 27, 2021 23:13

electriclilies reviewed Aug 27, 2021

View reviewed changes

mbs-octoml force-pushed the mbs-memory-planning branch 2 times, most recently from a4242b1 to e7b393a Compare September 2, 2021 19:48

mbs-octoml force-pushed the mbs-memory-planning branch 2 times, most recently from 24fee01 to fa46888 Compare September 10, 2021 17:08

mbs-octoml force-pushed the mbs-memory-planning branch from e445176 to 651b848 Compare September 14, 2021 18:54

mbs-octoml mentioned this pull request Sep 14, 2021

[Relay] VLOG for finer grained control of hyper-detailed logging #9012

Merged

mbs-octoml force-pushed the mbs-memory-planning branch from 651b848 to f5e2448 Compare September 15, 2021 21:16

mbs-octoml force-pushed the mbs-memory-planning branch from 4fc104d to df99244 Compare September 16, 2021 22:56

mbs-octoml force-pushed the mbs-memory-planning branch from df99244 to 77477d1 Compare September 17, 2021 15:42

mbs-octoml mentioned this pull request Sep 17, 2021

[Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc #9038

Merged

mbs-octoml force-pushed the mbs-memory-planning branch from 77477d1 to 607a49f Compare September 17, 2021 21:40

mbs-octoml force-pushed the mbs-memory-planning branch from 607a49f to ed31ec7 Compare September 17, 2021 23:52

mbs-octoml force-pushed the mbs-memory-planning branch from ed31ec7 to 1bffecd Compare September 20, 2021 17:49

mbs-octoml force-pushed the mbs-memory-planning branch from 3c917f0 to dc77562 Compare October 2, 2021 13:32

Mousius approved these changes Oct 4, 2021

View reviewed changes

jroesch approved these changes Oct 4, 2021

View reviewed changes

jroesch merged commit 779a506 into apache:main Oct 4, 2021

mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 4, 2021

Address Christopher's comments from apache#8788

17aff1c

We don't need the Optional<IRModule> on ToANormalForm and friends.

mbs-octoml mentioned this pull request Oct 5, 2021

[Relay] Remove unnecessary Optional<IRModule> argument to ToANormalForm and friends #9197

Merged

masahi pushed a commit that referenced this pull request Oct 6, 2021

Address Christopher's comments from #8788 (#9197)

01771ab

We don't need the Optional<IRModule> on ToANormalForm and friends.

mbs-octoml mentioned this pull request Oct 7, 2021

BUG #9216: Don't disable FuseOps pass since required by GraphExecutor #9227

Merged

masahi pushed a commit that referenced this pull request Oct 8, 2021

BUG #9216: Don't disable FuseOps pass since required by GraphExecutor (…

88b2be8

…#9227) This tutorial disabled the FuseOps pass, but before #8788 that was ignored since FuseOps was applied directly.

mbs-octoml mentioned this pull request Oct 8, 2021

Documentation Refactor #9203

Merged

mbs-octoml mentioned this pull request Oct 21, 2021

BUG: Look through on_device annotations when looking for shape constants #9345

Merged

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022

Address Christopher's comments from apache#8788 (apache#9197)

647ec47

We don't need the Optional<IRModule> on ToANormalForm and friends.

ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022

Address Christopher's comments from apache#8788 (apache#9197)

72bbca1

We don't need the Optional<IRModule> on ToANormalForm and friends.

		@@ -1041,13 +1013,31 @@ IRModule VMCompiler::OptimizeModule(IRModule mod, const TargetsMap& targets_arg,

		Array<Pass> pass_seqs = relay::backend::GetPassPrefix(targets, true);

		Device default_device;

		TVM_REGISTER_NODE_TYPE(OnDeviceAttrs);

		TVM_REGISTER_GLOBAL("relay.op.annotation._make.on_device")

[Relay] Remove DeviceMap from LowerTE #8788

[Relay] Remove DeviceMap from LowerTE #8788

Conversation

mbs-octoml commented Aug 19, 2021 • edited Loading

jroesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

electriclilies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mousius left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbs-octoml Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

mbs-octoml commented Oct 4, 2021

jroesch commented Oct 4, 2021

mbs-octoml commented Oct 4, 2021

mbs-octoml commented Aug 19, 2021 •

edited

Loading

mbs-octoml Oct 4, 2021 •

edited

Loading