Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Remove DeviceMap from LowerTE #8788

Merged
merged 3 commits into from
Oct 4, 2021

Conversation

mbs-octoml
Copy link
Contributor

@mbs-octoml mbs-octoml commented Aug 19, 2021

[Relay] This PR switches the graph, VM and AOT executors to use the merged
device_planner.cc from #9038, thus finally removing DeviceMap from the LowerTE Pass.

  • We retire analysis/context_analysis.cc and
    transforms/device_annotation.cc (and their tests). That
    includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
    ContextAnalysis entry points. These are all subsumed by the
    PlanDevices pass and the device aware visitors.
  • The following passes now use the new 'Device Aware' visitors to
    recover the device for every Relay sub-expression:
    • backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
    • backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
    • backend/te_compiler.cc (LowerTensorExprMutator)
    • transforms/memory_alloc.cc (DialectRewriter)
    • backend/vm/compiler.cc (VMFunctionCompiler)
  • The following passes/utils must maintain the device information
    encoded by the device planner within "on_device" annotations and
    "param_device_types"/"result_device_type" function attributes:
    • backend/vm/lambda_lift.cc (LambdaLifter)
    • transforms/to_a_normal_form.cc (Fill)
    • ir/expr_functior.cc (Bind)
  • Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
    in favor of just asking for the device. Also removed a lot of ad-doc
    encodings of the 'default' device.
  • We no longer need to run device-planning twice (before and after
    lowering). Device planning is also decoupled from memory planning.
  • The LowerTE Pass no longer needs an expression-to-device side table
    (which was the problem which kicked this series of PRs off in the first place).

Copy link
Member

@jroesch jroesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay did a full first pass and left some comments.

@@ -215,7 +215,7 @@ This level support backpropagation of broadcast operators. It is temporary.
tvm.relay.ndarray_size
tvm.relay.layout_transform
tvm.relay.device_copy
tvm.relay.annotation.on_device
tvm.relay.annotation.on_device_op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for renaming the op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that was a clion rename that went too far -- ignore.

* TVM_LOG_DEBUG="1;foo.cc=2;bar.cc=0;"
* \endcode
*/
bool VerboseLoggingEnabled(const char* filename, int level);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we need Verbose logging vs. using DLOG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only advantage is you can switch it on and off per file. Eg I had to disable some existing DLOGs to not get overwhelmed, and found DLOGs I did add during development I then removed before checking since they were similarly verbose. This lets you add debugging logging at whatever detail you need to develop or fix and leave them there for the next time. Not committed, just testing if it speeds things up for me.

// This first phase moves from implicit use of compile engine,
// to instead explicitly lowering the incoming IRModule, and then
// performing the preexisting AOT executor code generation phase.
IRModule mod = IRModule::FromExpr(func);

// Make explicit which device each expression should be executed on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will get folded away in future refactors as we flow the target/device information around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally all the backends will treat MakeDevicesExplicit as the VM compiler does and it's just another phase. I think the 'choose the default device' should also be common (hopefully can do that in this PR, I think it's just N ways of doing the same thing over the various backends). That leaves the memory planning which I think we should tackle next.

dev.device_id = 0;
dev.device_type = device_types[0];
device_context_map.insert({expr, dev});
// Make explicit which device each expression should be executed on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Q as above, also can we put this code in some kind of helper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this device map hacking has gone away now.

@@ -452,7 +469,7 @@ class StorageAllocator : public StorageAllocaBaseVisitor {
// all the storage resources available
std::vector<StorageToken*> data_;
/*! \brief internal prototype token map */
std::unordered_map<const ExprNode*, std::vector<StorageToken*> > prototype_;
std::unordered_map<const ExprNode*, std::vector<StorageToken*>> prototype_;
};

StaticMemoryPlan GraphPlanMemory(const Function& func) { return StorageAllocator().Plan(func); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question should we do the Array of struct vs. Struct of array swap here? the memory plan data structure is actually kind of funky right now due to the fact it maps ops to list of storages per device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this one :-) But I had to go through all of this pretty closely to debug a breakage so it doesn't look intimidating at all.

@@ -913,17 +917,14 @@ std::pair<IRModule, Map<String, IRModule>> Prepare(IRModule mod, Device device,
// We only have one device-specific target.
tec::TargetMap targets = {{device.device_type, target}};

// All calls to primitives will use the unique target.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

// The target corresponding to the call_node expression's annotation.
device = GetDevice(expr);
// TODO(mbs): device id
target = GetTargetFromInteger(device.device_type, targets_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will become a lookup after we refactor the map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one will go away with @mikepapadim 's #9134

@@ -1041,13 +1013,31 @@ IRModule VMCompiler::OptimizeModule(IRModule mod, const TargetsMap& targets_arg,

Array<Pass> pass_seqs = relay::backend::GetPassPrefix(targets, true);

Device default_device;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to compute this outside the OptimizeModule, ideally this should just build the optimization pipeline and have minimal logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas it's still there, I've run out of steam and we should centralize this and the (now reigged) default device logic in build_module.cc

TVM_REGISTER_NODE_TYPE(OnDeviceAttrs);

TVM_REGISTER_GLOBAL("relay.op.annotation._make.on_device")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above Q.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bogus rename, ignore.

@@ -320,191 +326,11 @@ class AnnotatationVisitor : private ExprVisitor {
Map<Expr, Integer> annotations_;
};

/*
* \brief Return device allocation map based on the post order traversed graph.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we replace this with a similar updated comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all dead now.

@mbs-octoml mbs-octoml force-pushed the mbs-memory-planning branch 6 times, most recently from bda53e2 to d0c7a31 Compare August 27, 2021 23:13
Copy link
Contributor

@electriclilies electriclilies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few comments I forgot to post from a few days ago, I'm planning on taking another look next week

"FirstOrderGradient module pass",
DeprecationWarning,
)
warnings.warn("using transform.gradient for first-order AD is deprecated, please use the"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you got a lot of reformatting changes in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed I think.

for (const auto& a : c->args) {
Depend(n, a);
if (c->op == on_device_op_) {
DependencyGraph::Node* body = NewNode(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is adding a new scope if the op is on_device_. Can you add a comment explaining this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up reverting this -- it was a bit too subtle an approach. All the "on_device" special casing is now handled explicitly by transforms which derive from one of the three (!) device aware visitor classes.

};

std::ostream& operator<<(std::ostream& os, StorageToken tok) {
return os << "StorageToken: " << std::endl
<< "ref_counter: " << tok.ref_counter << std::endl
<< "max_bytes: " << tok.max_bytes << std::endl
<< "tttype: " << tok.ttype
<< "tttype: " << tok.ttype << std::endl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are printing tttype 2x here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zapped.

@@ -913,17 +917,14 @@ std::pair<IRModule, Map<String, IRModule>> Prepare(IRModule mod, Device device,
// We only have one device-specific target.
tec::TargetMap targets = {{device.device_type, target}};

// All calls to primitives will use the unique target.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

@mbs-octoml mbs-octoml force-pushed the mbs-memory-planning branch 2 times, most recently from a4242b1 to e7b393a Compare September 2, 2021 19:48
@mbs-octoml mbs-octoml force-pushed the mbs-memory-planning branch 2 times, most recently from 24fee01 to fa46888 Compare September 10, 2021 17:08
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 14, 2021
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 14, 2021
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 16, 2021
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 17, 2021
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 17, 2021
…tation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game, I'll
be peeling PRs out of that.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 17, 2021
…tation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game, I'll
be peeling PRs out of that.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 18, 2021
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 20, 2021
…tation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game, I'll
be peeling PRs out of that.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Sep 20, 2021
I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.
@mbs-octoml mbs-octoml force-pushed the mbs-memory-planning branch from 3c917f0 to dc77562 Compare October 2, 2021 13:32
Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by far my favourite change proposed in TVM 😸 I left a few questions but I'm excited to see this landed.

* \param expr the graph.
*
* \return The transformed program.
*/
TVM_DLL Expr ToANormalForm(const Expr& expr);
TVM_DLL Expr ToANormalForm(const Optional<IRModule>& maybe_mod, const Expr& expr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional argument should probably be second to allow for the default behaviour to be just passing expr ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it again I don't need this argument -- if during conversion to ANF the pass asks for the device for a global var then it is only for the purposes of maybe wrapping the same global var with an "on_device" annotation, but that is always a no-op. I''ll remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably change this to be a real pass?

@@ -1028,7 +1029,7 @@ Pass FuseOps(int fuse_opt_level) {
auto max_fuse_depth = pc->GetConfig("relay.FuseOps.max_depth", Integer(kMaxFusedOps));
return Downcast<Function>(FuseOps(f, opt_level, max_fuse_depth.value(), m));
};
return CreateFunctionPass(pass_func, 1, "FuseOps", {"InferType"});
return CreateFunctionPass(pass_func, 0, "FuseOps", {"InferType"});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 1 as it's an optimisation in itself?

Copy link
Contributor Author

@mbs-octoml mbs-octoml Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, I forgot to explain this change.
FuseOps is called directly in quite a few places which means it bypasses the 'skip pass' logic in the SequentialPass combinator. As part of this change I ended up just slotting FuseOps in with the rest of the passes, but that broke things since, eg, the VM and Interpreter (at least) assume FuseOps have been run. So it seemed this pass is considered opt level 0 by default.

@mbs-octoml
Copy link
Contributor Author

Thanks for the encouragement @Mousius, this one has been a bit of a long slog!

@jroesch
Copy link
Member

jroesch commented Oct 4, 2021

I am going to land this for sake of future pain, feel free to follow up if any other changes need to happen after this.

@jroesch jroesch merged commit 779a506 into apache:main Oct 4, 2021
@mbs-octoml
Copy link
Contributor Author

Thanks @jroesch. @Mousius I didn't push the changes for your comments before the merge but will get them in somehow.

mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 4, 2021
We don't need the Optional<IRModule> on ToANormalForm and friends.
masahi pushed a commit that referenced this pull request Oct 6, 2021
We don't need the Optional<IRModule> on ToANormalForm and friends.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 7, 2021
…ecutor

This tutorial disabled the FuseOps pass, but before apache#8788 that was
ignored since FuseOps was applied directly.
masahi pushed a commit that referenced this pull request Oct 8, 2021
…#9227)

This tutorial disabled the FuseOps pass, but before #8788 that was
ignored since FuseOps was applied directly.
masahi pushed a commit to Laurawly/tvm-1 that referenced this pull request Oct 14, 2021
…ecutor (apache#9227)

This tutorial disabled the FuseOps pass, but before apache#8788 that was
ignored since FuseOps was applied directly.
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 21, 2021
apache#8788 introduced a perf regression
since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing
due to the extra `on_device` annotation on the constant. Fixed that,
and introduced some helpers to make this situation easier to deal with.

(This is CORE-102 in OctoML JIRA).
mbs-octoml added a commit to mbs-octoml/mbs-tvm that referenced this pull request Oct 22, 2021
apache#8788 introduced a perf regression
since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing
due to the extra `on_device` annotation on the constant. Fixed that,
and introduced some helpers to make this situation easier to deal with.

(This is CORE-102 in OctoML JIRA).

(Second try -- test_crp.py failure seems unrelated)
masahi pushed a commit that referenced this pull request Oct 22, 2021
…nts (#9345)

#8788 introduced a perf regression
since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing
due to the extra `on_device` annotation on the constant. Fixed that,
and introduced some helpers to make this situation easier to deal with.

(This is CORE-102 in OctoML JIRA).

(Second try -- test_crp.py failure seems unrelated)
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…tation.cc (apache#9038)

* [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game.

* [checkpoint] Use Relay script for all unit tests.

* [checkpoint] Hoist out DeviceDomain and DeviceDomains.

* [checkpoint] Hoist out visitors

* [checkpoint] Woops, left debug-only code in
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* [Relay] Switch the graph, VM and AOT executors to use the merged
device_planner.cc from apache#9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).

* [checkpoint] Revert unnecessary changes

- Started down multi-target handling in interpreter but didn't finish
- Some one-off debug stuff

* [checkpoint] TODO's for default device logic
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
We don't need the Optional<IRModule> on ToANormalForm and friends.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…ecutor (apache#9227)

This tutorial disabled the FuseOps pass, but before apache#8788 that was
ignored since FuseOps was applied directly.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…nts (apache#9345)

apache#8788 introduced a perf regression
since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing
due to the extra `on_device` annotation on the constant. Fixed that,
and introduced some helpers to make this situation easier to deal with.

(This is CORE-102 in OctoML JIRA).

(Second try -- test_crp.py failure seems unrelated)
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…che#9012)

* [Relay] VLOG for finer grained control of hyper-detailed logging

I've been making very heavy use of DLOG and PrettyPrint to trace, understand and
debug Relay transforms. I've found myself deleting log statements to reduce the
output verbosity, only to have to re-add them a few days later. Better would be to
support leaving all those statements in place but have finer control over when they
are enabled.

This PR introduces a 'VLOG(level)' macro to that end. The log is ignored
unless TVM_LOG_DEBUG is enabled (both as #define and an environment var), and the
the current verbosity level is >= level. The current verbosity level can be set
globally and/or overridden per source file (see 'VerboseLoggingEnabled').

(Those familiar with the origin of the LOG and DLOG family will also recognize VLOG.)

I also introduce a 'VLOG_CONTEXT' macro which pushes a string onto an internal
per-thread stack. Each VLOG message includes that stack as its prefix, which is a
very handy way to keep track of the (often recursive) program context in which each
VLOG is executed.

I've rejigged some existing DLOGs to VLOGs to illustrate, but left most of them
alone for now.  See the draft apache#8788 for use in the
wild.

I noticed the DCHECK macros *disabled* instead enabled with TVM_LOG_DEBUG defined, so
fixed that.

I've also made changes to the Relay text printer to dump attributes in a human
readable format rather than the indirect but machine readable 'meta' representation.
This is gated by the show_meta_data_ flag, and I think this use is consistent with it's
original purpose.

* [checkpoint] lints

* [checkpoint] missing \n lint

Gotta get my docker setup going

* [checkpoint] woops, we don't support gmock.h

* [checkpoint] Address Hua Jiang's comments.

* [checkpoint] strlen not avail on all toolchains?

* [checkpoint] Rework TVM_LOG_DEBUG spec and parser

* [checkpoint] woops, forgot the static modifier on map

* [checkpoint] * -> DEFAULT for wildcard.

Andrew pointed out *=9 suggests foo/*.cc=9 would work but it is not supported.

* [checkpoint] constexpr length

* [checkpoint] length is not constexpr on all targets, reverting

* [checkpoint] minimize VLOG overhead when in legacy DLOG-only mode
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…tation.cc (apache#9038)

* [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc

Currently LowerTEPass (backend/te_compiler.cc) is a 'special' pass because it
depends on a side-input DeviceMap. We'd like to remove that side-input, and
instead recover the Device (and, ultimately, Target) for each (fused) primitive
call from the AST alone.

By doing so we also avoid needing to perform device planning twice:
 - It needs to be done before lowering so we know which primitives need
   to be compiled for which devices.
 - It then needs to be re-done after lowering and optimization as a prelude
   to memory planning.
By baking the device plan into the AST we can simply do device planning before
lowering, and run memory planning later, both as ordinary passes.

While working on that issue we realized we currently have 3 'device planners':
 - transforms/device_annotation.cc, which supports only a small subset of Relay
   and uses a simple top-down algorithm to assign a device to every
   sub-expression.
 - analysis/context_analysis.cc, which makes a galant effort to support most of
   Relay, is based on unification rather than a top-down algorithm, but handles
   higher order functions by ad-hoc and fragile inlining.
 - transforms/annotate_target.cc, which works on Targets instead of Devices, but
   is otherwise like 'device planning'.
We'd like to bring these together.

In this PR we introduce a new transforms/device_planner.cc intended to replace
transforms/device_annotation.cc and analysis/context_analysis.cc. We don't
delete those two just yet since we need to switch all users off of them in the
next PR. We also leave transforms/annotate_target.cc alone pending a proper RFC
to bring devices and targets together sensibly, but have it firmly in our
sights.

transforms/device_planner.cc is based on analysis/context_analysis.cc, but
is heavily reworked to:
 1. Handle starting from existing "on_device" annotations as well as existing
    "device_copy" calls.
 2. Be idempotent, with the idea we'll probably need to re-run it to 'refine'
    device planning to account for storge scopes.
 3. Robustly handle all of Relay, particularly higher-order functions. For that
    we replace the inlining approach in analysis/context_analysis.cc with a
    higher-order unification domain.
 4. Be a little more systematic with defaulting.
 5. Capture the result of the analysis within the AST as new "device_copy" calls
    at device boundaries, and new/replaced "on_device" calls wherever the device
    for a sub-expression is not already 'obvious' from the sub-expression's
    lexical scope.
 6. Provide helper visitors for passes which need to ask for the device for
    any sub-expression they are processing and/or preserve device information
    on rewrites. Those passes include:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/memory_alloc.cc (DialectRewriter)
     - transforms/to_a_normal_form.cc (Fill)
     - backend/vm/compiler.cc (VMFunctionCompiler)
    However we won't change any of those in this PR.

See the draft apache#8788 for the end game.

* [checkpoint] Use Relay script for all unit tests.

* [checkpoint] Hoist out DeviceDomain and DeviceDomains.

* [checkpoint] Hoist out visitors

* [checkpoint] Woops, left debug-only code in
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* [Relay] Switch the graph, VM and AOT executors to use the merged
device_planner.cc from apache#9038, and finally remove DeviceMap from the
LowerTE Pass.

- We retire analysis/context_analysis.cc and
  transforms/device_annotation.cc (and their tests). That
  includes the CollectDeviceInfo, CollectDeviceAnnotationOps and
  ContextAnalysis entry points. These are all subsumed by the
  PlanDevices pass and the device aware visitors.
- The following passes now use the new 'Device Aware' visitors to
  recover the device for every Relay sub-expression:
     - backend/aot_executor_codegen.cc (AOTOnDemandAllocator)
     - backend/graph_plan_memory.cc (StorageAllocaBaseVisitor etc)
     - backend/te_compiler.cc (LowerTensorExprMutator)
     - transforms/memory_alloc.cc (DialectRewriter)
     - backend/vm/compiler.cc (VMFunctionCompiler)
- The following passes/utils must maintain the device information
  encoded by the device planner within "on_device" annotations and
  "param_device_types"/"result_device_type" function attributes:
     - backend/vm/lambda_lift.cc (LambdaLifter)
     - transforms/to_a_normal_form.cc (Fill)
     - ir/expr_functior.cc (Bind)
- Remove a lot ad-hoc 'homogeneous' vs 'hetrogeneous' conditionals
  in favor of just asking for the device. Also removed a lot of ad-doc
  encodings of the 'default' device.
- We no longer need to run device-planning twice (before and after
  lowering). Device planning is also decoupled from memory planning.
- The LowerTE Pass no longer needs an expression-to-device side table
  (which was the problem which kicked this series of PRs off in the first place).

* [checkpoint] Revert unnecessary changes

- Started down multi-target handling in interpreter but didn't finish
- Some one-off debug stuff

* [checkpoint] TODO's for default device logic
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
We don't need the Optional<IRModule> on ToANormalForm and friends.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…ecutor (apache#9227)

This tutorial disabled the FuseOps pass, but before apache#8788 that was
ignored since FuseOps was applied directly.
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…nts (apache#9345)

apache#8788 introduced a perf regression
since a `shape.as<ConstantNode>` in `alloc_tensor` was always failing
due to the extra `on_device` annotation on the constant. Fixed that,
and introduced some helpers to make this situation easier to deal with.

(This is CORE-102 in OctoML JIRA).

(Second try -- test_crp.py failure seems unrelated)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants