diff --git a/rfcs/0062-collage.md b/rfcs/0062-collage.md new file mode 100644 index 00000000..0e7a5c47 --- /dev/null +++ b/rfcs/0062-collage.md @@ -0,0 +1,1041 @@ +``` +Feature Name: Collage [Draft 0.81] +Start Date: Mar 2022 +Authors: Mark Shields (mbs@octoml.ai) +RFC PR: https://github.com/apache/tvm-rfcs/pull/62 + +History: +- v0.7: First draft. +- v0.8: Rework to emphasise 'partitioning' (quite early in pipeline) instead of 'fusion' (quite late in pipeline). +``` + +# Summary + +This design doc (with accompanying +['v2' prototype implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-sketch)) +shows how to bring tuning to TVM's BYOC partitioning passes. The tuning search explores the choice of sub-graphs (aka ' +partitions') and toolchains (aka 'backends') so as to minimize the expected model inference latency. Both 'graph +style' (eg TensorRT) and 'library style' (eg DNNL) BYOC integrations are supported. We call the result an 'optimal +partitioning'. This new tuning layer complements the tuning traditionally done by TVM and other toolchains during +lowering. It can also complement any global tuning, for example to explore the choice of layout convention or device +assignment. + +The approach is based on the [preprint](https://arxiv.org/pdf/2111.00655.pdf): + +> *Collage: Automated Integration of Deep Learning Backends* +> Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, Zhihao Jia + +(See Appendix A for a comparison of this proposal and the paper's implementation. See Appendix D for TODO items in the ' +v2' prototype.) + +When Collage is enabled it subsumes the existing `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/ +`PartitionGraph` passes embedded within each `partition_for_` function with a single new +`CollagePartitioner` pass. The pass is guided by the list of available `Target`s and three existing sources: + +1. The `"TOpPattern"` attributes provided for every Relay operator and used by TVM's built-in `FuseOps`. +2. The BYOC `"target."` operator predicates provided for some operator/toolchain pairs by + 'operator-based' BYOC integrations. + TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT has been transitioned to be + predicate based. +3. The BYOC operator pattern/predicates (usually) registered in the pattern table by 'pattern-based' BYOC integrations. + +The pass is run as early in the compilation flow as possible (see Appendix C). + +Only some boilerplate aspects of existing BYOC integrations need to be adjusted to support Collage (patterns must +be registered in the standard pattern table, 'preamble' passes need to be split out as per Appendix C, and any +mandatory post lowering helpers must be folded into the custom lowering function. We'll make sure these changes are +part of or coordinated with the UMA project). However Collage may require more robustness from the BYOC integrations, +see Appendix F. + +Note however that we are **not** proposing to deprecate the existing `partition_for_` operations (or their +UMA equivalent). This is mostly because Collage is inherently a tuning-based system which is not practical for users who +need a stand-alone compiler. But it is also because of challenges with establishing a common pass ordering which will +work for both TVM and all BYOC toolchains (see Appendix C for more details). + +# Motivation + +This tuning approach contrasts with TVM's existing "greedy" and "manual" approaches to partitioning: + +- Greedy: Currently only the largest possible supported sub-graphs are used for partitions, irrespective of their + execution time. With Collage many more candidate sub-graphs are explored, and it is possible for two smaller + sub-graphs to yield better overall latency than one large sub-graph if they mix toolchains. +- Manual: Currently the TVM user must commit to a BYOC toolchain and invoke the corresponding + `partition_for_` function before the main TVM compilation flow begins. With Collage the choice of toolchain + can be automated based on measured latency. Collage will also explore mixing and matching between multiple BYOC + toolchains as well as TVM's native backend. + + +Collage offers three advantages: + +- **Latency**: Overall model latency may be reduced compared to TVM native, TVM with a single + `partition_for_` call, or a non-TVM stand-alone compiler such as TensorRT. +- **Automation**: The choice of which BYOC toolchains to enable can be automated. +- **Economy and modularity of implementation**: Four standalone passes using two separate mechanisms for expressing + partitioning rules/algorithms can be replaced with one, which itself is built from compositional primitives. (The + machinery is also reusable for the very similar problem of choosing TVM fusion kernels, which we'll tackle in the + future). + +See Appendix H for some frequently asked questions. + +# Success Metrics + +1. Collage offers at least a 10% latency improvement for a selection of standard ONNX models and NVIDIA hardware using + targets which include the CuDNN and CuBlas libraries, the CUTLASS library (with tuning, via BYOC), the TensorRT + compiler (via BYOC), and (obviously!) TVM native. +2. Collage does not require new per-target or per-model patterns or rules to be implemented independently of the BYOC + integrations. +3. Collage with a `Target` list enabling just one BYOC toolchain is never worse than using the the existing + `partition_for_` function directly. (Since partitioning for multiple toolchains in sequence should never + improve the result for any single toolchain we consider just the single BYOC case.) + + +# Project Milestones + +- [Done] M0: Port paper prototype to recent TVM main and validate paper results. +- [Done] M1: Internal design doc. +- [Done] M2: Use 'v2' prototype to test design doc, and rework ready for TVM community. +- [In progress] M3: RFC +- [2022Q1] M4: Re-validate results on 'v2' prototype for larger models (eg GPT2) and more NVIDIA targets. +- [2022Q2] M5: Implementation in TVM main, including 'sub-projects' listed below. +- [OctoML internal] M6: Estimator integrated into OctoML platform, validation against OctoML test suite. +- [OctoML internal] M7: Productionization for OctoML. + +# Check-in plan + +Though the 'v2' prototype is in a personal branch we'd like to transition to main ASAP and rely on directory/namespace +separation, maintaining backwards compat, and a new `PassConfig` flag to isolate all Collage changes from the rest of +TVM. A rough PR progression is: + +- TensorRT and CUTLASS BYOC changes are backwards compat. The existing `partition_for_` functions remain. The + CUTLASS-specific tuning and codegen functions will either continue to be supported or we'll work with users to account + for them being folded into the function-at-a-time `relay.ext.cutlass` codegen function. +- The `DFPattern` and friends changes are all mostly just for improving the robustness of the + `IndexedGraph` class and can go into main independently. +- Some basic `Expr` improvements can go into main independently. +- The design allows for multiple `Target`s for the same `DLDeviceType`. That requires the various + `build` interfaces which currently accept `Union[Target,Dict]` to also accept a list of `Target`s, and can be + backwards compat. +- The new Collage code can go in bottom-up as we develop unit tests: + - Support utils, including `NameSupply`, `IndexSet`, `PriorityQueue`, `Cost`, `CostEstimator`. + - The core `SubGraph` datatype. + - `CandidatePartition`. + - The `PartitionRule` class hierarchy, as a series of PRs, ending with `PartitionSpec`. + - `GatherPartitionSpecs` helper for bridging the existing BYOC world with the Collage `PartitionRule` world. + - The `CollagePartitioner` driver pass itself. + +# Guide-level explanation + +Collage allows the choice and partitioning for BYOC toolchains to be determined automatically +so as to minimize overall (expected) model execution latency. + +To compile with Collage it's necessary to set a `PassContext` flag, and include +'Collage aware' `Targets` in the build's `target` argument. + + +For example, assume `mod` is bound to [MNIST](https://github.com/onnx/models/tree/main/vision/classification/mnist): + +``` +def @main(%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] { + %0 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %1 = nn.conv2d(%0, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]); + %2 = add(%1, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/); + %3 = nn.relu(%2); + %4 = nn.max_pool2d(%3, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]); + %5 = nn.pad(%4, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + %6 = nn.conv2d(%5, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]); + %7 = add(%6, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/); + %8 = nn.relu(%7); + %9 = nn.max_pool2d(%8, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 0]); + %10 = reshape(%9, newshape=[1, 256]); + %11 = nn.dense(%10, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, units=None, out_dtype="float32"); + add(%11, meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/) +} +``` + +We can compile this with Collage enabled for a variety of NVIDIA toolchains/libraries with +the following fragment: + +``` +with tvm.transform.PassContext(config={"relay.fallback_device_type": 2, "relay.collage.enable_collage": True}): + host_target = tvm.target.Target("llvm") + generic_target = tvm.target.Target("cuda", host_target) + cutlass_target = tvm.target.Target("cuda -compiler=cutlass", host_target) + tensorrt_target = tvm.target.Target("cuda -compiler=tensorrt", host_target) + cudnn_target = tvm.target.Target("cuda -compiler=cudnn", host_target) + cublas_target = tvm.target.Target("cuda -compiler=cublas", host_target) + targets = [generic_target, cutlass_target, tensorrt_target, cudnn_target, cublas_target] + exe = tvm.relay.vm.compile(mod, target=targets) +``` + +(Note that `cudnn` and `cublas` are not yet supported in the 'v2' prototype, see Appendix B.) + +After the `CollagePartitioner` pass, the intermediate `"main"` global function could resemble the following +(though we've modified this "optimal" partitioning by hand for illustration so don't take it as representative of actual +performance): + +``` +def @main(%x: Tensor[(1, 1, 28, 28), float32]) -> Tensor[(1, 10), float32] { + # Operators left behind in the function body are intended for TVM. + # The usual Relay passes may rewrite them, then FuseOps will push them + # into "Primitive" functions (without any "Compiler" attribute) ready + # for TVM lowering. + %4 = nn.pad(%x, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + # This conv2d will be offloaded to cudnn. However the main TVM compilation + # flow is responsible for emitting the call. + %6 = fn (%FunctionVar_5: Tensor[(1, 1, 32, 32), float32], + Composite="cudnn.conv2d") -> Tensor[(1, 8, 28, 28), float32] { + nn.conv2d(%FunctionVar_5, meta[relay.Constant][0] /*Tensor[(8, 1, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=8, kernel_size=[5, 5]) + }; + # Back to vanilla TVM. + %7 = %6(%4); + %3 = add(%7, meta[relay.Constant][1] /*Tensor[(8, 1, 1), float32]*/); + %9 = nn.relu(%3); + %11 = nn.max_pool2d(%9, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]); + %13 = nn.pad(%11, 0f, pad_width=[[0, 0], [0, 0], [2, 2], [2, 2]]); + # Use TensorRT. The "Primitive" function deleniates the partition. + %14 = fn (%FunctionVar_03: Tensor[(1, 8, 18, 18), float32], + %FunctionVar_11: Tensor[(16, 1, 1), float32], + Primitive=1, + Compiler="tensorrt", + global_symbol="collage_nn_conv2d_add_nn_relu_1") -> Tensor[(1, 16, 14, 14), float32] { + %1 = nn.conv2d(%FunctionVar_03, meta[relay.Constant][2] /*Tensor[(16, 8, 5, 5), float32]*/, + padding=[0, 0, 0, 0], channels=16, kernel_size=[5, 5]); + %2 = add(%1, %FunctionVar_11); + nn.relu(%2) + }; + %15 = %14(%13, meta[relay.Constant][3] /*Tensor[(16, 1, 1), float32]*/); + # Back to vanilla TVM. + %17 = nn.max_pool2d(%15, pool_size=[3, 3], strides=[3, 3], padding=[0, 0, 0, 0]); + %19 = reshape(%17, newshape=[1, 256]); + # Use CUTLASS. Note the double function nesting: the outer "Primitive" function + # deleniates the partition and the inner "Composite" function maps the original + # Relay operators to a tag to be used during compilation/build/lowering with the + # CUTLASS BYOC integration. + %20 = fn (%FunctionVar_0: Tensor[(1, 256), float32], + %FunctionVar_1: Tensor[(10, 256), float32], + %FunctionVar_2: Tensor[(1, 10), float32], + Primitive=1, + Compiler="cutlass", + global_symbol="collage_cutlass_dense_bias_nn_dense_add") -> Tensor[(1, 10), float32] { + %1 = fn (%FunctionVar_01: Tensor[(1, 256), float32], + %FunctionVar_11: Tensor[(10, 256), float32], + %FunctionVar_21: Tensor[(1, 10), float32], + Composite="cutlass.dense_bias") -> Tensor[(1, 10), float32] { + %0 = nn.dense(%FunctionVar_01, %FunctionVar_11, units=None, out_dtype="float32"); + add(%0, %FunctionVar_21) + }; + %1(%FunctionVar_0, %FunctionVar_1, %FunctionVar_2) + }; + %20(%19, meta[relay.Constant][4] /*Tensor[(10, 256), float32]*/, + meta[relay.Constant][5] /*Tensor[(1, 10), float32]*/) +} +``` + +The remainder of the compilation will respect the partitioning found by Collage without +any further user involvement. + +# Reference-level explanation + +The implementation is mostly under `src/relay/collage/...` (namespace `tvm::relay::collage`), with just a few Python +helper functions under `python/tvm/relay/collage`. + +If the `relay.collage.enable_collage` `PassConfig` attribute is true then a new `CollagePartitioner` pass is inserted +before all other Relay passes. The result of the pass is: + +- All Relay sub-graphs in all global functions which are to be handed off to a BYOC toolchain are replaced by calls to + an inline `"Primitive"` function with `"Compiler"` and `"global_symbol"` attributes. +- Relay operators, or groups of operators, which are to be translated to particular library or BYOC-supplied function + are replaced by calls to an inline `"Composite"` function. (This encoding is supported for both BYOC and external + libraries.) + +TODO(mbs): We need to also support +[RFC10](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0010-target-registered-compiler-flow-customisation.md) style BYOC extensions in the partitioning encoding. + +Note that no `"Primitive"` functions denoting TVM kernels are produced -- the existing `FuseOps` pass is still required. + +The `CollagePartitioner` pass has four phases: + +- **Phase 1**: The available `Target`s are scanned to build a list of rules describing how to find possible partitions ( + see `PartitionSpec` and `PartitionRule` below). Depending on the `Target` the rules may incorporate entries from the + BYOC pattern table. (The remaining phases execute on each global function separately.) +- **Phase 2**: A dataflow graph is constructed for the global function (which is just an `IndexedGraph`). The + available rules from phase 1 are evaluated on the dataflow graph to yield a (possibly overlapping) set of candidate + partitions for each target (see `CandidatePartition` below). Each candidate efficiently describes a sub-graph of the + global function's body without the need to construct any new expressions (see `SubGraph` below). +- **Phase 3**: A least cost path is found in the following (implicit and lazily constructed) search graph: + - Search Node: Each node represents the set of 'covered' dataflow nodes which have been assigned to a + candidate partition on every path to the node from the starting node. + - Starting node: The search node with empty 'covered' set. + - Ending node: The search node with every dataflow node in the 'covered' set. + - Search Edge X->Y: A candidate partition P does not overlap X's 'covered' nodes. Y's 'covered' nodes are + those of X union P. To avoid an unnecessary search space explosion the candidate must also include the + next yet-to-be-covered dataflow node in X. + - Edge cost: The estimated latency of the candidate partition, plus a partition transition penalty. Note + that though we need to be able to extract the candidate's sub-graph in order to build a function + representing the candidate to measure with, we do not yet need to partition the overall function body + expression. + + Other search algorithms are certainly possible, eg the paper uses an evolutionary search to refine the partitioning + found by the dynamic-programming search. We can easily abstract away the search interface to support multiple + implementations in the future. +- **Phase 4**: The function body is partitioned according to the candidate kernels on the shortest path. This phase + can be run independently of the first three so that additional inspection or optimization may be applied to + the intmediate optimal partitioning. + +In the following we introduce the new datatypes, then expand on the phases. + +### Util Datatypes + +- `PostDfsIndex`: The integer index of a Relay sub-expression in a post-dfs traversal of the overall Relay expression. + If index i is less than index j then we know the sub-expression for j cannot influence the value of the sub-expression + for i. +- `DataflowGraph`: As alias for the existing `IndexedGraph` from the `DFPatternMatcher` suite (which in turn is a + reworked copy of the `IndexedGraph` private to `fuse_ops.cc`). It is used throughout to manage the three-way bijection + from Relay `ExprNode`s to `PostDfsIndex`s to + `DataflowGraph::Node`s. Each `DataflowGraph::Node` describes the sub-expression's dataflow inputs, outputs, dominator + and inverse-dominators. +- `IndexSet`: A bit vector indexed by `PostDfsIndex`s. These are used as a compact representation for an arbitrary set + of dataflow nodes in a dataflow graph. +- `Cost`: A `double` representing a candidate partition (or kernel) 'cost', which currently is just mean execution + latency in seconds. Collage only cares that costs are additive and a total order, so in the future we could support + cost functions which balance execution time against high memory watermark or other measures. Costs may be `Unknown` + (ie NaN) to signal some other heuristic should be used to compare kernel costs. Costs may be `Invalid` (ie +inf) + to signal the toolchain could not compile and run a candidate kernel. + +### SubGraph + +A `SubGraph` is an `IndexSet` of the `PostDfsIndex`s of all dataflow nodes 'inside' an arbitrary sub-graph of the +overall dataflow graph. This and `PartitionRule` below are the core Collage datatypes. The following illustrates +the dataflow graph, indexes and one sub-graph for 'mini' MNIST (MNIST with the second layer removed): + +![dataflow graphs and sub-graphs](assets/0062/dataflow_graphs_and_sub_graphs.png) + +Sub-graphs can be used to represent partitions/kernels/composite functions without having to pay the cost of +constructing or rewriting any expressions. We also allow 'extracting' a function to use for measuring a +partition/kernel's latency independently from 'rewriting' the overall Relay expression since only a tiny subset of +candidate partitions will end up being needed after Collage has completed its search. + +We expect O(thousands) of sub-graphs to be in flight while processing a given model, so are mindful of space overhead. + +A sub-graph classifies every dataflow node of the overall expression as either 'inside' or +'outside' the sub-graph. Obviously not all such divisions make sense, for example it is not valid for an inside node to +feed into another inside node via outside nodes. We provide an +`IsValid` method to check for validity, and `SubGraphConfig` to control which validity rules apply (such as maximum +depth). + +We generally work with the `DataflowGraph` representation of the overall Relay expression rather than the expression +itself. We use the post-dfs visit index to uniquely refer to expression nodes. + +As well as 'inside' and 'outside' we have four other flavors of dataflow nodes, all uniquely determined from the ' +inside' nodes: + +- 'entry' nodes are those inside with at least one dataflow input outside. +- 'exit' nodes are those inside with at least one dataflow output outside, or which are considered 'external' in the + underlying dataflow graph (eg because they represent the result of the overall function). +- 'input' nodes are those outside with at least one dataflow output inside. +- 'output' nodes are those outside with at least one dataflow input inside. + +Index sets for these are cached with the sub-graph for performance. + +It is valid to have multiple entry nodes (we can bind a parameter for each). It may be valid to have multiple exit +nodes (we can build a tuple of all such). It may be valid to have exit nodes which also contribute to other inside +nodes (ie represent a 'tap' on an intermediate result). + +Sub-graphs are closed under: + +- Disjoint union. +- Wrapping by a function with given attributes. This can be used to encode "Composite" functions, or to represent a + candidate kernel within a "Primitive" function. (By combining 'wrapping' with + 'union' we can encode, eg, 'this sub-graph should be placed inside a primitive function which itself may have calls to + composite functions). +- Substitution, which allows a sub-graph w.r.t. one dataflow graph to be transformed to match some other (typically + smaller) dataflow graph. + +Note that the Relay `PatternPartitoner` goes directly from `Expr` to partitioned `Expr` without stopping at any +intermediate representation. It may be worth 'promoting' `SubGraph` out of Collage and into the standard `DFPattern` +suite, we leave that to future work. + +### CandidatePartition + +A `CandidatePartition` pairs a `SubGraph` with a `Target`. All Collage search and measurement is in terms of candidate +partitions. + +### PartitionRule + +A `PartitionRule` describes how to find a set of `CandidatePartitions`s for a `DataflowGraph`. This and `SubGraph` +above are the core Collage datatypes. All partition rules implement the method: + +``` +virtual std::vector AllCandidates(const DataflowGraph& dataflow_graph, + const PartitionSpec& spec) const; +``` + +The candidates are allowed to overlap, and ultimately it is the job of the Collage searcher to find a selection of +candidates which cover the whole Relay expression without overlap. There may be many thousands of candidates in flight +during the Collage search. + +We currently have three distinct flavors of partitions: + +- For pattern-based BYOC integrations, individual `DFPattern`s are used to select the `"Composite"` functions to + offload, and those are grouped into a `"Primitive"` Relay function with a `"Compiler"` attribute. +- For operator-based BYOC integrations, per-operator predicates indicate operators to offload, and again those are + grouped into a `"Primitive"` Relay function with a `"Compiler"` attribute. + TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT has been transitioned to be + predicate based. +- For TVM, obviously all of Relay can go into a single partition, however for search efficiency the partitions should + roughly mimic the Relay `FuseOps`. That pass uses the `"TOpPattern"` (of type `OPPatternKind`) attribute on all Relay + operators, and rules for when operators of one kind can be folded into another (typically by moving scalar ops from + elementwise operators into the output position of an earlier operator). This is implemented as a + stand-alone analysis which encodes its result using `"Primitive"` functions. + +Two new flavors are also showing up: + +- For easy external library integration we would like to borrow the `DFPattern`-with-composite-functions approach from + pattern-based BYOC integrations. But we'd like to leave those composite functions outside of any `"Primitive"` + function so that the library calls could end up within larger TVM kernels. +- `FuseOps` is generally considered too inflexible, and we've sought a more flexible way to express target-dependent + fusion rules. + +So in the same way `DFPattern`s provide a small library of 'base' and 'combinator' pattern rules supporting a wide +variety of examples, we seek the same economy and flexibility from `PartitionRule`s. + +An obvious question is whether all partition rules should be expressed with `DFPattern`s, possibly by extending +the `DFPattern` library itself. Indeed, though it does not appear to be used in prod, the `DominatorPattern` is +an attempt to use `DFPattern`s to subsume the existing `FuseOps` machinery. We actually went down this path but +decided to back out: +- We'd need a new pattern combinator to associate predicates with sub-patterns. +- Since we are interested in searching over possibly overlapping candidate partitions we'd need the `DFPattern` + machinery to all enumeration over all matching sub-expressions. That would require a rewrite of the + `DFPatternMatcher`. +- Some of the more subtle fusion rules are difficult to express as patterns. +- `DFPattern`s are widely used outside of just partitioning, so any change would need to ensure no efficiency + or cognitive overhead for those common cases. + +That pushed us to the present design, which builds on `DFPatterns`, but introduces new 'base' and 'combinator' +partition rules which can be combined to match the desired partition flavors: +- The 'base' rules produce candidates from the dataflow graph directly. Eg we have a base rule to produce + all sub-graphs matching a given `DFPattern`. +- The 'combinator' rules combine the candidates found by one or more sub-rules into a new set of + candidates. The sub-rule(s) can be 'base' or 'candidate' rules. We call the candidates produced by + a sub-rule 'sub-candidates'. Eg we have a combinator rule which wraps all sub-candidates in a + `"Composite"` function (when the overall expression is rewritten). + +The following illustrates some base and combinator patterns on the earlier mini MNIST dataflow graph: + +![partition rules](assets/0062/partition_rules.png) + +The base rules are: + +- `DFPatternPartitionRule`: Given a `DFPattern` and expression predicate, produces a candidate for every sub-graph + matched by the pattern and predicate. Unlike the `PatternRewriter`, candidates are free to overlap. Mainly used + to bring BYOC patterns into the Collage framework. +- `OpPredicatePartitionRule`: Given an attribute name, produces a candidate for every call to a primitive Relay + operator where the operator i) has predicate bound to that attribute which ii) returns true given the + call sub-expression. Generally this will result in a singleton sub-graph containing only the call, but it may also + pull in constant arguments to the call should they be required. Used to bring BYOC operator predicates into the + Collage framework. + TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT has been transitioned to be + predicate based. +- `OpCallByKindPartitionRule`: Uses the `"TOpPattern"` attribute provided for every Relay operator to produce a + candidate for every call to a 'fusable Relay operator'. Used to select the operators which `FuseOps` will consider + parts of kernels. + +The combinator rules are: + +- `CompositePartitionRule`: Indicates all sub-candidates matched by the sub-rule should be wrapped by a `"Composite"` + function. The `"Composite"` name is taken from the rule name. Used to indicate Relay operators (or groups of Relay + operators) should be mapped to target-specific operators, both for BYOC and TVM external library integrations. +- `PrimitivePartitionRule`: Indicates all sub-candidates matched by the sub-rule should be wrapped by a `"Primitive"` + function, possibly with an additional `"Compiler"` attribute. Used to delineate a partition (or kernel). +- `UnionPartitionRule`: Simply unions all the sub-candidates from all sub-rules together. Used to combine + individual `DFPatternPartitionRules`. +- `CombinePartitionRule`: Given a sub-rule and a list of 'combiner' rules (see below), finds all possible ways of + combining the sub-candidates to yield even larger candidates. Note that the sub-candidates may also be directly + included in the results. The 'combiner' rules allow combining by `OpPatternKinds`, combining the arguments to + tuples which themselves are arguments to Relay operator calls, and so on. This rule is intended to mimic the + existing TVM `FuseOps` pass, though: i) all candidates are found rather than just the largest, ii) the starting + set of candidates can be provided by any other rule, and iii) we rely on `SubGraph` validity checking to weed out + infeasible candidates. +- `OnlyValidPartitionRule`: Given a `SubGraphConfig`, ignores candidates with 'invalid' sub-graphs. Used to limit the + maximum candidate depth, the number of independent outputs, and whether intermediate 'taps' are allowed. +- `HostPartitionRule`: Produces candidates for all Relay expressions which could be + 'left behind' for execution by the host (eg on the VM). This rule lets us move special case handling out of the + core search algorithm and into a simple rule. + +Here are some typical ways to combine `PartitionRules` for different partition flavors. (These combinations +may be generated during phase 1 by inspection of the `Target` and BYOC registration -- see 'Phase 1' below.) + +- Classic operator-predicate based BYOC with + `AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes (eg see `tensorrt.py`): + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + CombinePartitionRule (with a join-anything combiner rule) + OpPredicatePartitionRule + ``` + TODO(mbs): Consider removing predicate based BYOC integrations once TensorRT has been transitioned to be + predicate based. + +- Classic pattern-based BYOC with `MergeComposite`/`AnnotateTarget`/`PartitionGraph` passes + (eg see `cutlass.py`)`: + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + CombinePartitionRule (with join-anything combiner rule) + UnionPartitionRule + CompositePartitionRule(label1) + DFPatternPartitionRule(pattern1) + : + CompositePartitionRule(labeln) + DFPatternPartitionRule(patternn) + ``` + + The `CompositePartitionRule`/`DFPatternPartitionRule` combination is repeated for each entry in the pattern table for + the BYOC toolchain name, eg: + ``` + CompositePartitionRule( + rule_name="cutlass.conv2d_bias_residual_multiply_relu" + sub_rule=DFPatternPartitionRule( + pattern=CallPatternNode(Op(nn.relu), + [AltPattern(CallPatternNode(Op(multiply), + [CallPatternNode(AltPattern(Op(add) | Op(nn.bias_add)), + [CallPatternNode(Op(nn.conv2d), [*, *]), *]), + *]) | + CallPatternNode(Op(multiply), + [*, + CallPatternNode(AltPattern(Op(add) | Op(nn.bias_add)), + [CallPatternNode(Op(nn.conv2d), [*, *]), *]) + ])) + ]))) + ``` + +- "Consider this library implementation for these sub-expressions", using `DFPatterns` to pick out which Relay operators + are supported (a new scheme): + ``` + OnlyValidPartitionRule + CombinePartitionRule (with default TVM combiner rules) + UnionPartitionRule + OpCallByKindPartitionRule + CompositePartitionRule(lable1) + DFPatternPartitionRule(pattern1) + : + CompositePartitionRule(lablen) + DFPatternPartitionRule(patternn) + ``` + +- Classic TVM `FuseOps`: + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + CombinePartitionRule (with default TVM combiner rules) + OpCallByKindPartitionRule + ``` + +- "Just fuse what I tell you to fuse", using `DFPatterns` to directly select candidates (a new scheme): + ``` + PrimitivePartitionRule + OnlyValidPartitionRule + UnionPartitionRule + DFPatternPartitionRule(pattern1) + : + DFPatternPartitionRule(patternn) + ``` + +### PartitionSpec + +A `PartitionSpec` pairs a a `PartitionRule` with one or more `Target`s. + +### Phase 1 + +We build on the existing TVM support for heterogeneous devices and targets. The available `Targets` are extracted from +the compilation configuration (eg using the existing `CompilationConfig` helper class). Each target is inspected to +decide on how to construct a `PartitionRule`, which will guide Collage in the selection of candidate kernels to explore +for that target. (See Appendix G for the requirements which motivated this part of the design.) + +- If the `Target` has a `"partition_rule"` attribute, use that directly. This would allow users to directly control + partitioning/fusion for the target's they care about. +- If the `Target` has a `"compiler"` attribute (eg `"cutlass"`), and the global pattern table has an entry for that + attribute value, assume the `Target` denotes a pattern-based BYOC integration to explore. The `PartitionRule` + will import all the BYOC patterns and predicates automatically. +- As above, but if global pattern has no matching entry, assume the `Target` denotes a predicate-based BYOC integration + to explore (eg `"tensorrt"`). The `PartitonRule` will look for and evaluate predicates with the + `"target."` attribute on all Relay operators. +- Otherwise, assume the `Target` denotes a TVM-native target. The `PartitionRule` mimics `FuseOps`, but now generalized + to explore multiple candidates so as to leave room for possible BYOC candidates. + +Note that to make this approach work we need to allow for multiple `Target`s with the same `DLDeviceKind`. For the VM +simply switching the `target` argument from dictionary to list and removing some redundant Python preprocessing code was +all that was required to support this. + +The user can use `on_device` annotations to constrain sub-graphs to particular devices. When Collage is considering +candidate partitions, it should be sure to choose a candidate `Target` which 'refines' the `Target` for every +sub-expression discovered by the `PlanDevicesPass`. Given targets T and U we say 'T refines U' if T has a +'"compiler"' and/or '"partition_rule"' attributes, U has no such attributes, and T and U otherwise agree on all other +fields. + +### Phase 2 + +Most of the hard work for this phase is carried by the `AllCandidates` implementations of the `PartitionRule`s. The main +driver simply needs to index all the found `CandidatePartitions` by their minimum 'inside' `PostDfsIndex` +for rapid retrieval during the shortest path search. + +### Phase 3 + +We find it most natural to use Dijkstra to find the optimal partitioning. A `SearchState` is a node in the search +graph, and contains: + +- An `IndexSet` of the dataflow nodes already 'covered' by candidates on every path to this state. This is the + identifying key for the state. +- The predecessor `SearchState` in the best path to this state. +- The `Cost` of the best path to this state. This is the order for the Dijkstra priority queue. +- The `CandidatePartition` for the transition from the best predecessor to this state. + +The starting state has no covered nodes. The final state has all nodes covered. The following is an example search +graph fragment for the mini MNIST example: + +![search graph](assets/0062/search_graph.png) + +When expanding a state we could choose any `CandidatePartition` collected from phase 2 provided it doesn't overlap with +the state's covered set. However, a search path applying candidates C then D is equivalent to one applying D then C, so +we only consider candidates which intersect the next yet-to-be-covered dataflow node. For each such candidate we use +the `CostEstimator` (with it's assumed cache) to get the candidate's cost, build the successor state, and 'relax' the +successor state in the usual way. (See Appendix E for more details on `CostEstimator`.) + +The `HostPartitionRule` is used to allow some dataflow nodes to be 'left behind' for execution by the host. + +The result at this stage is an `Array`, which can be materialized and restored using the standard +TVM object graph machinery if desired. An example least-cost path for the mini MNIST example could be the following: + +![optimal placement](assets/0062/optimal_placement.png) + +### Phase 4 + +The overall Relay expression is partitioned over all the `CandidatePartition`s on the lowest cost path 'in parallel'. +Since all the candidates are expressed using `SubGraph`s w.r.t. the original dataflow graph, we must be careful not to +invalidate yet-to-be-partitioned candidates as we go. Working backwards in dataflow order avoids this problem. + +# Drawbacks + +- **Some BYOC boilerplate changes required**: TVM's current BYOC integration API only requires the 'lowering/codegen' + function to be registered to a well-known global function name. Everything else is up to the BYOC author. + - Collage requires pattern-based BYOC integrations to register their patterns in the global pattern table. Some BYOC + integrations use the table but many do not, but it's an easy fix. + - Collage requires the BYOC lowering function to yield a valid `runtime::Module` without requiring any additional + BYOC-specific passes to be run. Some BYOC integrations require the user to run separate passes to tune and/or + compile the partitioned, those need to be moved into the lowering function itself. + - Collage requires the BYOC integration to either correctly test for which operators are supported in the + pattern/operator predicate, or gracefully propagate failure rather than CHECK-fail if an unsupported operator is + included in a candidate kernel. Thus a BYOC integration will need to be 'robustified' to become 'Collage + compatible'. + + Overall we've tried to make as few changes as possible. Collage will happily follow along with any improvements to the + BYOC integration API (eg via the UMA project). + +- **Non-compositional BYOC toolchains**: BYOC partitioning functions often run global passes to get the Relay graph into + a state better aligned with the toolchain on the assumption they are the exclusive partitioning pass. Most obvious is + the choice of layout, and if two BYOC integrations have a different choice of layout then there's currently no way for + them to be used concurrently. All of those passes must either be i) pushed up to global configuration (which could be + explored by a search layer outside of TVM), ii) pushed into the BYOC lowering/codegen function (to prepare the + sub-graph for further compilation) or iii) moved into the standard Relay optimization passes run before + `CollagePartitioner`. + +- **Higher tuning cost**: Obviously Collage needs to estimate the latency of partitions. For TVM this can trigger + turning of schedules for novel kernels, which can require O(thousands) of trials and take O(hours), so we'll be very + dependent on cached tuning logs to amortize this cost between models for the same target. + +- **Task extraction vs Tuning**: Traditionally TVM has had three phases: i) Task extraction (find the fused sub-graphs + to tune), ii) Tuning (find a good schedule for those sub-graphs), and iii) Compilation (re-compile the model, now + retrieving schedules for all the anticipated sub-graphs from the cache.) However the Collage 'v2' prototype collapses + all these phases. This lets us lazily explore the implied search graph (nodes = partially rewritten models, edges = + selected of sub-graph and toolchain as a candidate partition, cost = estimated sum of partition costs plus transition + penalties), and thus only pay the cost of measuring (and tuning) candidates which could possibly influence the final + partitioning. + +- **No non-local optimization**: Though Collage can explore the choice of sub-graph and toolchain, it cannot explore any + choices which require the arguments and/or result of the sub-graph to be rewritten, or the overall `IRModule` to be + changed. Thus Collage **cannot** be used to search over: + - Choice of layout for arguments/results (may require insertion of layout transforms), + - Choice of memory scope for arguments/results (may require insertion of device copies), + - Choice of device on which to host the kernel (ditto), + - Choice of quantization scheme, + + To support this efficiently we'd need to abandon the simple-minded but fast `SubGraph` representation we describe here + in favor of something like an EGraph representation, which seems like a very large change for TVM. + +- **Dependency management**: Currently BYOC integrations tend to assume they are the only non-TVM toolchain in use. So + it's possible two toolchains introduce runtime dependencies which can't be satisfied. Collage has no notion of + dependencies or incompatibilities and may attemt to mix candidate kernels we can't support in prod. It's also possible + for two BYOC integrations to have incompatible runtimes. + +- **Additive cost assumption**: Collage as per this design assumes the cost of running candidate partitions is additive, + plus a small transition penalty. However cache effects can dominate measured latency, particularly for + 'lightweight' kernels. Thus there may be a **additive error** in the final result: + + > additive_error = measured_latency(collage_partitioning) - sum_{partition} (estimated_latency(partition) + penalty) + + The evolutionary search explored by the paper can help here since it uses measured end-to-end model latency as its + cost function, but we're deferring that to future work. + +- **Limited search space**: Naively exploring all sub-graphs is O(n!), so we need to constrain the search. The easiest + approach is just to limit candidates to sub-graphs of just a few operators. This can mean significantly faster + candidates are not explored, yielding a partitioning with high **optimality loss**: + + > optimality_loss = measured_latency(collage_partitioning) - measured_latency(true_optimal_partitioning) + + Though the 'true' optimal partitioning may be infeasible to find, the user may easily discover a high + **apparent loss**, eg by comparing the Collage result with a traditional BYOC partitioning result: + + > apparent_loss = measured_latency(collage_partitioning) - measured_latency(users_own_partitioning) + +- **Fragile toolchains**: Some BYOC toolchains are intended to be stand-alone compilers in their own right, and have + been tuned against common models and include global flags to guide optimizations such as reducing precision. However + Collage will only feed these toolchains smaller sub-graphs, thus making the limited search space problem more severe. + +- **High variance in lightweight kernels**: Small kernels can have high variance, thus the choice of which toolchain to + use can be arbitrary. We probably want to i) validate our variance estimator is accurate, ii) choose a percentile + slightly above 50% for the estimated candidate kernel latency, and iii) fall back to hard-coded priorities when the + measured variance is too high. + +- **Explainability**: It's easy to show the user the final partitioning and estimated times for each kernel, but harder + to show why that partitioning won out over all others during search. + +- **Does not subsume `partition_for_`**: We don't have any plan to deprecate the existing patterns of each + BYOC integration supplying a `partiion_for_` function. If the user has a specific toolchain in mind then + making the partition explicit enjoys both faster compilation and can incorporate global optimization passes which + Collage cannot currently account for (eg enforcing a particular layout). + +# Prior art + +- The [Cascading Scheduler](https://github.com/apache/tvm-rfcs/blob/main/rfcs/0037-arm-ethosu-cascading-scheduler.md) + combines i) dynamic-programming to find an optimal grouping of TE sub-expressions, ii) an analytic model of cost to + guide the search, and iii) cascading scheduling of the TE sub-expressions so as to reduce memory high-watermark. By + contrast Collage i) also uses dynamic-programming, but to find an optimal grouping of Relay sub-expressions, ii) + uses (very much slower!) measurement to guide the search and iii) has no influence over how either TVM or BYOC + toolchains actually lower the sub-graphs given to them. +- The [Universal modular Accelerator Interface](https://github.com/apache/tvm-rfcs/pull/60) proposal adds a layer on top + of the existing and separate TVM BYOC, operator strategy, operator scheduling, target-specific passes and + target-specific code generation extension points. Collage currently relies only on the global pattern registry and + global `relay.ext.` function to integrate with BYOC integrations, but this is easy to rework should UMA + change the source of truth. + +# Appendix A: Differences between the paper's implementation and this proposal + +The results of the paper were derived in a [branch](https://github.com/cmu-catalyst/collage) from +[TVM](https://github.com/apache/tvm) at `461d06eb5cfc7954f1983779acd05c47cea269f1`. We ported/rebased that code onto +main, and refer to it as the +['v1' prototype implementation](https://github.com/mbs-octoml/mbs-tvm/tree/mbs-collage-port). + +The 'v1' prototype has nine main parts: + +1. A new + [backend](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/include/tvm/ir/expr.h#L208) + field on every Relay `Expr` to capture the pattern name and backend name chosen during search. Downstream compilation + must be forced to respect that choice. +2. An + [intercept](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/src/relay/transforms/fuse_ops.cc#L1392) + in `fuse_ops.cc` which redirects to the main Collage fuser/searcher before TVM’s fusion rules kick in. +3. The main fuser/searcher + [implementation](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/python/collage/optimizer/comp_graph_optimizer.py#L221) + (for the simpler DP algorithm). This implementation: +4. Uses both Relay `Pattern` s and its own path-based fusion algorithm to find candidate sub-graphs. +5. Uses the DP algorithm to find the best assignment of fused sub-graphs and backends to cover the whole Relay graph. +6. Applies the resulting assignment to the `IRModule` using the new `backend` field on every expression. +7. An evolutionary search algorithm may optionally run after the above, and will attempt to replace ‘op’ kernels + (which use a library) with ‘graph’ kernels (arbtrary sub-graph), but only if there’s a unique graph backend enabled. +8. An intercept + ([here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/src/relay/transforms/fuse_ops.cc#L1402) + and + [here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/python/collage/optimizer/_optimizer.py#L48)) + in `fuse_ops.cc` to actually effect the fusion for BYOC backends depending on the new `backend` field. +9. An intercept + ([here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/src/relay/backend/te_compiler_cache.cc#L284) + and + [here](https://github.com/mbs-octoml/mbs-tvm/blob/52d8780e879a9115b8a93e505bcd3a6c2646c61f/python/collage/backend/collage_strategies.py#L18)) + in `te_compiler.cc` to take over the selection of `OpStrategy` based on the `backend` field. + +Note that the 'v1' prototype only supports `IRModules` with a single `"main"` whose body is in the ‘pure dataflow’ Relay +subset. Ie only calls, tuples, tuple projections, function variables and constants are supported. + +In comparison to the 'v1' prototype, this design: + +1. Shifts the search to occur at the very start of the compilation flow rather than just before `FuseOps` so that + Collage sees the same input model as would an existing `partition_for_` BYOC function. This change allows + us to use the existing BYOC patterns/predicates to guide the selection of candidate partitions instead of requiring + new patterns to be added to support the combination of models and BYOC toolchains. It also ensures the existing BYOC + lowering functions see partitions before any TVM-lowering specific passes have been applied, such as + `qnn::transform::Legalize` and `transform::CombineParallelConv2D`. +2. Builds on the existing support for heterogeneous `Target`s to represent the menu of available toolchains to use + during search. In particular, we want to allow users to blend `on_device` annotations (to express preferences for + which devices should execute which sub-graphs) with Collage (to find the best partitions respecting those device + preferences). +3. Uses the existing convention for `"Primitive"`, `"Composite"` and `"Compiler"` attributes on Relay `Function`s to + encode partitioning choices. +4. Does not treat library integrations (eg for `CuDnn`) differently from toolchain integrations (eg for `TensorRT`). See + Appendix B for a sketch of the issues. +5. Supports all of Relay. +6. Is implemented almost entirely in C++. + +However: + +6. We have only re-implemented the 'op-level' dynamic-programming based search strategy from the paper. Though the paper + reports encouraging results with the 'graph-level' evolutionary-search strategy we leave that to future work. + +# Appendix B: Easier Library Integration + +TVM has two very different ways to make external library implementations available for use as (or in) kernels: +The pattern-based BYOC approach and the TVM `te.extern` approach. + +The pattern-based approach allows library implementations to match with more than one Relay operator, such as for biased +convolution with an activation function. For example, for +[DNNL](https://oneapi-src.github.io/oneDNN/v1.3/index.html) the global pattern table is extended +in `python/tvm/relay/op/contrib/dnnl.py`, and the pattern labels encode the intended corresponding DNNL functions. The +user is responsible for partitioning using the usual `MergeComposite`/`AnnotateTarget`/`PartitionGraph` +sequence (surprisingly there's no `partition_for_dnnl` convenience function). The `relay.ext.dnnl` BYOC function +in `src/relay/backend/contrib/dnnl/codegen.cc` looks for calls to `"Composite"` functions in the overall `"Primitive"` +function, and dispatches based on the `"Composite"` label. C code is emitted to target the DNNL library, and the +standard C compiler helper is invoked to produce a `runtime::Module`. + +Note that it is not possible for a TVM-generated kernel to call a library function integrated this way. In effect every +library function must go into a library-specific kernel (though kernels may group calls to multiple library function). + +The `te.extern` approach only allows library implementations which are 1:1 with Relay operators. However the library may +be used as part of a larger TVM-generated kernel, and the usual TVM tuning machinery may choose to use the library based +on overall kernel performance measured during TVM tuning. For example, `batch_matmul` can be implemented using +[CuBLAS](https://developer.nvidia.com/cublas) via the strategy `batch_matmul` in `python/tvm/contrib/cublas.py`, which +is made available to the operator's `OpStrategy` using `batch_matmul_stategy_cuda` in +`python/tvm/relay/op/strategy/cuda.py` when `cublas` appears in the `Target`s `libs` attribute. That strategy simply +calls the `PackedFunc` registered as `tvm.contrib.cublas.batch_matmul` and implemented in +`src/runtime/contrib/cublas/cublas.cc` as part of the TVM runtime. + +The `te.extern` approach also supports integrating 'micro-kernels' which may be invoked as part of the TVM schedule for +some larger Relay operator. + +Collage as presented can work with either approach. For the pattern-based BYOC approach Collage doesn't need to know +what's going on under the BYOC integration hood, it only needs to see a `Target` with the appropriate +`compiler` attribute. For the `te.extern` approach Collage similarly doesn't need to know that the TVM partition may +result in a kernel who's schedule includes a call to the linked library provided the `Target` has the appropriate +`libs` attribute. + +However, we'd to make library integration with Collage as seamless as possible since we expect it to be the common case. +The requirements are roughly: + +- Support library functions which match sub-graphs as well as single Relay operators. +- Allow library calls from within TVM-generated kernels. +- Avoid the boilerplate needed for full BYOC integrations, but retain the familiar BYOC pattern-based mechanism. +- Express the choice of external library in the same way we express other partitioning choices. + +One possibility is: + +- Like the `te.extern` approach, libraries can be made available to the TVM runtime via registered `PackedFunc`s. +- Like the pattern-based BYOC approach, labelled patterns can be supplied which indicate how Relay operators could be + mapped to registered `PackedFunc`s. +- Like the BYOC custom lowering approach, a distinguished compiler name controls when the library is available and + causes lowering to go via a different path. +- But unlike the BYOC custom lowering approach, the rewrite to an external library call is made available in TE or TIR + form so that it can be incorporated into larger TVM kernels. + +We'll follow up on this separately, but mention it here since it motivates why we've tried to handle "Composite" +patterns fairly generally. + +# Appendix C: When to run Collage? + +There's a few cross-cutting issues when it comes to when Collage should run in the compilation flow: + +- The current TVM convention is for BYOC integrations to supply a `partition_for_` function which can be + called by the user on the original input model, ie before **any** Relay passes are run. +- Many `partition_for_` functions run their own 'preamble' passes before the standard + `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` passes. The preamble passes are sometimes + just generic Relay passes such as `FoldConstant`. But some partitioning functions impose specific global rewrites, + such as for layout. All the BYOC op patterns and op predicates are thus written expecting the Relay model in 'vanilla' + form with only those preamble passes applied. +- There's no reason to expect the preamble passes are compositional in any sensible way between different BYOC + integrations. +- Some BYOC integrations also supply additional passes which are expected to be run after partitioning and lowering, for + example to finish tuning or compilation. +- The standard Relay pass prefix includes many passes which are either target dependent (for example to + 'legalize' quantized version of ops depending on the intended target), or which prepare the model for Relay fusion and + subsequent lowering. These passes are all run before `FuseOps`. Those passes are not universally applicable to all + BYOC integrations, and BYOC patterns are not guaranteed to be invariant over them. +- Relay's default `FuseOps` is currently hard-coded to greedily find the largest possible kernels using fixed + `OpPatternKind` based rules. Those rules are intended to predict exactly what TVM's scheduling can support. There's + interest in bringing customization (eg limiting fusion patterns to directly match hardware supported primitives, + supporting custom 'horizontal' and 'vertical' fusion) and search (to reduce the strong coupling of fusion with + lowering) to `FuseOps`, which all looks very similar to customization and search Collage brings to partitioning. +- Finally(!), Collage should obviously explore candidate partitions which both TVM and the BYOC toolchains can do well + on. That encourages large partitions with fusion opportunities. But a naive search over all O(n!) possibilities is + also obviously not feasible. This means Collage should limit its search to candidates which more or less correspond to + the kernels each backend would choose using their own fusion rules. This in turn requires Collage's partitioning rules + to roughly match the backend fusion rules. + +This puts us in a bit of a pickle since there's no obvious single point in the compilation flow for Collage: + +1. We could run before any Relay passes at all, just as `partition_for_` functions are run today. However + there's no opportunity to apply any BYOC preamble passes which may be needed before patterns are used. +2. We could run just after the BYOC preamble passes. However that's prematurely committing to a particular BYOC + integration, and there's no way to detect when two BYOC integrations have incompatible preambles. +3. We could run instead of `FuseOps` to collapse partitioning with fusion. However, by that stage too many TVM-specific + optimizations have been applied for the BYOC integrations to work. + +Our compromise is to i) run Collage at the very beginning of compilation (ie option 1), ii) require the user manually +apply global passes which may assist particular BYOC integrations (such as to choose a particularly favorable layout), +and iii) leave `FuseOps` unchanged. (Note the first draft of the RFC instead chose option 3.) + +In more detail, here's a taxonomy of pass instances and how we can handle them in Collage: + +- **BYOC global** (eg `ConvertLayout`): These passes are currently run in the preamble of some of the + `partition_for_` functions to apply a global rewrite to improve efficiency for the intended target. + - Under Collage, these passes should be made available as a new `optimize_for_` function. The user + (or some top-level search outside of TVM) can apply this function in the same way they currently apply + `partition_for_`. + - Ideally this would also be a well-known UMA extension point. +- **BYOC partitioning** (`MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/`PartitionGraph` or a subset thereof): + These passes are currently run after the premable of the `partition_for_` function to effect the desired + partitioning and composite function labelling. + - Under Collage, these passes are subsumed by `CollagePartitioner`. + - They will also be called automatically by the UMA 'partitioner'. +- **BYOC lowering** (eg `Function->runtime::Module` function registered as `relay.ext.`). These passes invoke + the BYOC-specific compilation toolchain. + - Under Collage, the standard `TELower` pass will continue to invoke these functions depending on partitioning + annotations. Collage will also need to support the other per-target compilation overrides. +- **BYOC post-lowering** (eg `tune_cutlass_kernels`): These follow-on passes are supplied by some BYOC integrations to + further prepare the `runtime::Module` after lowering. + - Under Collage, these passes need to be folded back into the generic BYOC lowering extension point. +- **Safe global** (eg `FoldConstant`): These passes are run within the standard Relay pass prefix, but may also be + included in the BYOC preamble. However the pass is universally applicable to all BYOC and TVM toolchains. + - Under Collage this 'safe' prefix of passes can be run before `CollagePartitioner`. If any BYOC predicates/patterns + are not invariant to these safe passes then we'll need to generalize them. Note that currently this pass set is + empty. +- **Target specific** (eg `qnn::transform::Legalize`): These passes are also within the standard Relay pass prefix. They + apply per-operator or other rewrites which may be target-dependent. Clearly the target must already be known. + Technically they should be run after `PlanDevices` to support heterogeneous execution this is not currently the case + (and a few are disabled in the heterogeneous case). + - Under Collage these passes are run after `PlanDevices` (which may use `on_device` annotations to enforce some + target constraints) and `CollagePartitioner` (which will choose targets for all partitions subject to any existing + constraints). But they are only run on non-BYOC partitions, ie on everything other than `"Primitive"` + functions with a `"Compiler"` attribute. +- **Lowering specific** (eg `CanonicalizeOps`): These passes apply optimizations preparing for `FuseOps` and subsequent + TVM lowering. They are also within the standard Relay pass prefix. + - Under Collage, same as for the target specific passes above. + +# Appendix D: TODO items in the 'v2' prototype before proceeding + +- Implement the penalty for transitioning execution between partitions. +- Bring in `cudnn` and `cublas` to support measurement (ideally as Appendix B, but probably as heavyweight BYOC + integration pending better support). +- Support RFC10-style integrations in the partitioning rules / sub-graph realization. +- Implement TVM-tuning during Collage search. +- Cross-check measurements against some of the 'v1' models. +- Bring up on `GPT2` and other 'larger' models. +- Measure additive error. +- Measure apparent loss with `partition_for_tensorrt`. +- Explore `float16` performance mixing `CUTLASS` and `TensorRT`. +- Find model+target combination that shows compelling speedup from mixing w.r.t. all other options, including stand + alone `TensorRT`. +- If needed, implement 'lookahead' from the current search state to find the 'next' dataflow node(s) which have + candidates crossing multiple `PartitionSpec`s. That defines a sub-graph. There's no need to search over all possible + candidates within that sub-graph since almost certainly the maximal candidates will be best. Somehow prune the + candidates to implement that. +- If needed, build indexes in `CombinePartitionRule` to avoid O(n^2) iteration over candidates. +- Reconcile with the 'RFC10' style BYOC extension methods -- we should support them all but for simplicity have just + focussed on the traditional "Compiler" annotation. + +# Appendix E: Robust candidate kernel latency measurement + +Collage requires an implementation of a `CostEstimator`: + +``` +class CostEstimator { + public: + /*! + * \brief Returns the estimated cost (possibly after many many minutes of training time) of + * running "main" in mod using target, which represents a possible partitioning of some overall + * Relay expression. + */ + virtual Cost Estimate(const IRModule& mod, const Target& target) const; +} +``` + +The 'v2' prototype has implemented this with an in-memory cache and a small Python driver which defers to +TVM's `tvm.runtime.vm.VirtualMachine`s `benchmark` helper. The following needs to be designed and implemented: + +- The recent MetaSchedule work has provided `BuilderInput` (`include/tvm/meta_schedule/builder.h`), + `RunnerInput` (`include/tvm/meta_schedule/runner.h`) and `Database` (`include/tvm/meta_schedule/database.h`) + interfaces. The latter is for `TuningRecord`s of `Workload`s. It looks like these interfaces can support the + measurement of Collage `CandidatePartitions`s with no or minor changes. +- Collage converts measured 50th %ile latencies to costs in seconds. We may need to consider taking a slightly higher + %ile to be more robust against variance on small kernels. We need to validate the estimated variance reflects true + variance. +- For TVM-native targets, we would like the `Estimate` call to perform any TVM tuning required for a novel candidate + kernel. +- Collage needs an estimate of the cost of transitioning between partitions (or kernels). Ideally that estimate would be + based on measurement rather than hard coded. + +# Appendix F: Robust BYOC integrations for targets of interest + +Overall any BYOC toolchain which could be supported by Collage needs to be brought to a high standard: + +- It should support the latest toolchain/library versions. +- It should support as much of Relay (both operators and dtypes) as feasible. In particular, Collage will only find + interesting mixes when BYOC toolchains have overlapping operator and dtype support. +- It should correctly report which operators/patterns are supported. +- It should have good unit test coverage in CI. +- Dependencies should be documented and installation scripted (hopefully this is an easy consequence of the above). +- The translation scheme should give the BYOC toolchain the best chance to do well. In particular, if Collage reports + toolchain X 'is better' than toolchain Y for a candidate sub-graph we want to have confidence that's not just because + toolchain Y has been hobbled by a poor translation, API misuse, or other 'holding it wrong' issue. +- Where feasible, using `partition_for_` (ie using TVM but not Collage) should not be worse than using the + toolchain directly (ie not using TVM at all). + +Our current focus is on TensorRT, CUTLASS, CuDnn and CuBlas. + +# Appendix G: Visualization + +A [netron](https://netron.app/) style visualization for Relay which clearly shows the partitioning and cost for all the +kernels would be very valuable. The paper prototype produces such a visualization but we've lost that functionality in +the transition to 'v2'. + +# Appendix H: FAQ + +- **Are you deprecating `FuseOps`?** No. `FuseOps` will be run along with all the other Relay passes on the TVM + partitions (ie all Relay expressions not partitioned onto a BYOC backend). +- **Are you deprecating the BYOC `partition_for_` functions?** No. Collage does not yet have a way to handle + any global passes invoked before partitioning in those functions. Those functions are still the best approach for + users who cannot tolerate long search/tuning times. +- **Can I use Collage for optimizing layout? Device placement? Quantization strategy?** No. Collage only explores + partitionings, and cannot currently explore rewrites. Though Collage could allow sub-graphs to be rewritten as part of + a partitioning choice (eg to insert `device_copy` nodes on inputs and outputs), there's little utility to doing so + since Collage won't be able to measure the effect of those rewrites on the overall model latency after further + downstream passes (eg to collapse unnecessary `device_copy` nodes). These sorts of global optimization problems can be + tackled by additional analysis passes before `CollagePartitioner`. +- **Won't this increase tuning time?** Yes. Collage will explore significantly more candidate partitions, and for the + TVM backend the resulting kernels will themselves require schedule tuning. +- **Does this clash with the UMA proposal?** No. Though Collage takes over partitioning, it can still greatly benefit + from the UMA proposal to better organize all the BYOC extension points. Any changes made by the UMA project should be + easy to account for in Collage. +- **Why replace the existing `MergeComposite`/`AnnotateTarget`/`MergeCompilerRegions`/ + `PartitionGraph` passes?** Those passes don't allow us to represent the search space over all partitionings. +- **Why not just build on `DFPattern`s instead of introducing the PartitionRule family?** We actually started in that + direction but found the complexity overwhelming. We believe it's best to keep `DFPattern`s focussed on the simple and + common case of deterministically matching specific sub-expressions. +- **Why should partitioning have anything to do with fusion?** For efficiency Collage should only explore candidate + partitions which roughly match kernel boundaries. This means Collage's partitioning rules need to roughly predict the + fusion rules of each backend, TVM included. + +# Appendix G: Representing Collage 'backends' + +The paper introduced an explicit representation for 'backends'. In this design we've chosen to merge this notion back +into the existing `Target` machinery: + +- All the backends we know of are dependent on a family of `Target`s. For example, `TensorRT` obviously only applies to + CUDA targets, `DNNL` only applies to CPU targets, and so on. +- So it seems most natural to just extend `TargetKind`s with the ability to specify a particular BYOC toolchain, and + allow the user to supply as many `Target`s as needed to cover all the BYOC toolchains they'd like to include in the + Collage search. +- There's then two subtleties which are easy to handle: + - A `Target` which is specific about it's BYOC toolchain should be considered a refinement of the same `Target` + without any such detail. + - The user may supply multiple `Target`s for the same `DLDeviceType`. There's a few places in device planning where + the `DLDeviceType` to `Target` mapping needs to choose the least-refined `Target`. diff --git a/rfcs/assets/0062/dataflow_graphs_and_sub_graphs.png b/rfcs/assets/0062/dataflow_graphs_and_sub_graphs.png new file mode 100644 index 00000000..4af6497e Binary files /dev/null and b/rfcs/assets/0062/dataflow_graphs_and_sub_graphs.png differ diff --git a/rfcs/assets/0062/optimal_placement.png b/rfcs/assets/0062/optimal_placement.png new file mode 100644 index 00000000..fbacff26 Binary files /dev/null and b/rfcs/assets/0062/optimal_placement.png differ diff --git a/rfcs/assets/0062/partition_rules.png b/rfcs/assets/0062/partition_rules.png new file mode 100644 index 00000000..2efd15c0 Binary files /dev/null and b/rfcs/assets/0062/partition_rules.png differ diff --git a/rfcs/assets/0062/search_graph.png b/rfcs/assets/0062/search_graph.png new file mode 100644 index 00000000..50ac85c1 Binary files /dev/null and b/rfcs/assets/0062/search_graph.png differ