** Collage v2 sketch ***

- Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
mbs-octoml · Jun 7, 2022 · 1432d57 · 1432d57
1 parent d8f57ed
commit 1432d57
Show file tree

Hide file tree

Showing 84 changed files with 13,284 additions and 694 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -292,6 +292,7 @@ tvm_file_glob(GLOB_RECURSE RELAY_OP_SRCS
     )
 tvm_file_glob(GLOB_RECURSE RELAY_PASS_SRCS
     src/relay/analysis/*.cc
+    src/relay/collage/*.cc
     src/relay/transforms/*.cc
     src/relay/quantize/*.cc
     )

diff --git a/collage_autotvm_rtx3070.tuninglog b/collage_autotvm_rtx3070.tuninglog
diff --git a/include/tvm/relay/expr.h b/include/tvm/relay/expr.h
@@ -39,6 +39,16 @@
 #include "./type.h"
 
 namespace tvm {
+
+/*!
+ * \brief Returns \p global_var with the given properties. A null property denotes 'no change'.
+ * Returns \p global_var if all properties are unchanged. Otherwise, returns a copy with the new
+ * fields.
+ */
+GlobalVar WithFields(GlobalVar global_var, Optional<String> opt_name_hint = {},
+                     Optional<Type> opt_type = {}, Optional<VirtualDevice> opt_virtual_device = {},
+                     Optional<Span> opt_span = {});
+
 namespace relay {
 
 using Expr = tvm::RelayExpr;
@@ -97,8 +107,17 @@ class Constant : public Expr {
   TVM_DLL explicit Constant(runtime::NDArray data, Span span = Span());
 
   TVM_DEFINE_OBJECT_REF_METHODS(Constant, RelayExpr, ConstantNode);
+  TVM_DEFINE_OBJECT_REF_COW_METHOD(ConstantNode);
 };
 
+/*!
+ * \brief Returns \p constant with the given properties. A null property denotes 'no change'.
+ * Returns \p constant if all properties are unchanged. Otherwise, returns a copy with the new
+ * fields.
+ */
+Constant WithFields(Constant constant, Optional<runtime::NDArray> opt_data = {},
+                    Optional<VirtualDevice> opt_virtual_device = {}, Optional<Span> opt_span = {});
+
 /*! \brief Tuple of multiple Exprs */
 class Tuple;
 /*! \brief Tuple container */

diff --git a/include/tvm/relay/expr_functor.h b/include/tvm/relay/expr_functor.h
@@ -240,6 +240,8 @@ class MixedModeVisitor : public ::tvm::relay::ExprVisitor {
    */
   explicit MixedModeVisitor(int visit_limit = 1);
 
+  using ExprVisitor::VisitExpr_;
+
   /*!
    * \brief VisitExpr is finalized to preserve call expansion of dataflow regions
    */

diff --git a/include/tvm/relay/op_attr_types.h b/include/tvm/relay/op_attr_types.h
@@ -41,24 +41,37 @@ using tir::BijectiveLayoutNode;
 using tir::Layout;
 using tir::LayoutAxis;
 
-/*! \brief operator pattern used in graph fusion */
+/*!
+ * \brief Operator pattern used to guide fusion.
+ */
 enum OpPatternKind {
-  // Elementwise operation
+  // Elementwise operator, eg relu.
+  // \code
+  //   out[i, j, k] = op(in[i, j, k])
+  // \endcode
+  // The underlying scalar op can always be moved to the point the input tensor was created.
   kElemWise = 0,
-  // Broadcasting operator, can always map output axis to the input in order.
-  // for example :code:`out[i, ax1, j, ax2] = input[i, j]`.
-  // Note that the axis need to be in order so transpose is not a bcast operator.
+  // Broadcasting operator, eg add.
+  // As for kElemWise, but some output axes may be broadcasted, and the remaining must correspond
+  // to input axes in order.
+  // \code
+  //   out[i, j, k] = op(in[i, j])
+  // \endcode
+  // (So transpose is not a kBroadcast).
   kBroadcast = 1,
-  // Injective operator, can always injectively map output axis to a single input axis.
-  // All injective operator can still be safely fused to injective and reduction.
+  // Injective operator, eg concat.
+  // Can always injectively map output axis to a single input axis.
+  // All kInjecting operators can be fused to kInjective and kCommReduce operators.
+  // Eg: concatenate
   kInjective = 2,
-  // Communicative reduction operator.
+  // Communicative reduction operator, eg sum.
   kCommReduce = 3,
-  // Complex operation, can still fuse elemwise operations into its output.
-  // but cannot chain another complex op
+  // Complex operation, eg conv2d. Often called the fused sub-graph's 'anchor node'.
+  // Can fuse kElemWise operations into its output, but cannot fuse additional kOutEWiseFusable
+  // operations.
   kOutEWiseFusable = 4,
-  // The pattern for tuple nodes. Can fuse into subsequent injective ops,
-  // but treated specially
+  // A tuple.
+  // Can fuse into subsequent injective ops, but treated specially.
   kTuple = 7,
   // Opaque operation, cannot fuse anything.
   kOpaque = 8

diff --git a/include/tvm/relay/transform.h b/include/tvm/relay/transform.h
@@ -281,6 +281,11 @@ TVM_DLL Pass InferType();
  */
 TVM_DLL Type InferTypeLocal(const Expr& expr);
 
+/*!
+ * \brief Infer the types of all sub-expression of expr.
+ */
+TVM_DLL Expr InferTypeExpr(const Expr& expr);
+
 /*!
  * \brief Search and eliminate common subexpression. For example, if there are
  * two expressions evaluated to an identical value, a single variable is created

diff --git a/python/tvm/autotvm/task/dispatcher.py b/python/tvm/autotvm/task/dispatcher.py
@@ -58,6 +58,10 @@ class DispatchContext(object):
     def __init__(self):
         self._old_ctx = DispatchContext.current
 
+    # TODO(mbs): Collage only: Allow cache query
+    def contains(self, target, workload):
+        raise NotImplementedError()
+
     def query(self, target, workload):
         """
         Query the context to get the specific config for a template.
@@ -297,8 +301,9 @@ def load(self, records):
         counter = 0
         for inp, res in joint_records:
             counter += 1
-            if res.error_no != 0:
-                continue
+            # TODO(mbs): Collage only: Cache the error so don't re-tune
+            # if res.error_no != 0:
+            #     continue
 
             # use target keys in tvm target system as key to build best map
             for k in inp.target.keys:
@@ -320,7 +325,14 @@ def load(self, records):
                 if np.mean(other_res.costs) > np.mean(res.costs):
                     best_by_model[key] = (inp, res)
 
-        logger.debug("Finish loading %d records", counter)
+        # TODO(mbs): Collage only: Too verbose
+        # logger.info("Finished loading %d records", counter)
+
+    # TODO(mbs): Collage only: Allow cache query
+    def contains(self, target, workload):
+        #logger.info(
+        #    f"look for match with {target} and {workload} with {len(self._best_user_defined)} user-defined, {len(self.best_by_model)} model and {len(self.best_by_targetkey)} target entries")
+        return self._query_inside(target, workload) is not None
 
     def _query_inside(self, target, workload):
         if target is None:

diff --git a/python/tvm/contrib/cutlass/__init__.py b/python/tvm/contrib/cutlass/__init__.py
@@ -15,4 +15,4 @@
 # specific language governing permissions and limitations
 # under the License.
 """BYOC support for CUTLASS."""
-from .build import tune_cutlass_kernels, build_cutlass_kernels, build_cutlass_kernels_vm
+from .build import num_cutlass_partitions, finalize_modules, finalize_modules_vm
-Original file line number
+Diff line change
@@ Expand Up / @@ -240,6 +240,8 @@ class MixedModeVisitor : public ::tvm::relay::ExprVisitor { @@
        */
       explicit MixedModeVisitor(int visit_limit = 1);
+      using ExprVisitor::VisitExpr_;
       /*!
        * \brief VisitExpr is finalized to preserve call expansion of dataflow regions
        */
@@ Expand Down @@