** Collage v2 sketch ***

- VLOG in vm runner - lints - Get test_pass_collage_partition.py going - one more rollback - fix relay.collage ffi prefix. - zap all unnecessary changes - test (and minor cleanup) of CandidatePartition::EstimateCost - More partition rule tests - tuple arg test - Test ByKind CombinerRule - Move the TOpPattern attributes from Python to C++ so visible to C++ unit tests. - wibble - wibble - starting to add CombinerRule unit tests - sync with mbs-collage-subgraph changes - rebase - sync - Clarify dataflow_graph.expr() vs expr constraints - Beef up test_sub_graph - Polish - False alarm, reverting unnecessary const fiddles - Bad merge, still have bug with missing const. - Fix rebase - Prepare for rebase - Move CaptureIndexInSpans to generic tvm.relay.transform - Fix test_sub_graph.py unit tests - Make PartitionSpecs 1:1 with Targets - Fix tests - Finish merging Matthew's changes - First pass merging Matthew's changes - finish fixing lints - test_tensorrt.py runs - some lint fixes while waiting - test annotation fiddles, disable pytorch test - fix constant handling - update tests for new API - Switch TensorRT BYOC integration to IRModule-at-a-time - [bug] index out of range - don't need InferTypeExpr - revert unnecessary changes - revert unnecessary changes - fix accumulate bug - sync with 11481 - Eta-expand tuple ars in candidate partitions (so measurements does not need to worry about constructing tuple arguments) - Polish compiler_function_utils for splitting out - Mark functions as extern. - Get rid of relay.ext.cutlass - kExternalSymbol:String ----> kExtern:Bool - Host glitch if PlanDevices run before CollagePartition - Fix unit test - Make load_static_library first class python func - Get CUTLASS going on graph executor as well as vm - Include export_library in estimate_seconds - Rollback DSOLibrary changes. - Add StaticLibraryNode and switch CUTLASS to use it This avoids the crazy serialize/deserialize/load hackery, which I'll now remove. - Get running again - CUTLASS picks up all options from 'cutlass' external codegen target. - Revert false starts with cutlass handling - Get CUTLASS going with program-at-a-time tuning and compilation instead of function at a time. - Save DSOLibraries by contents rather than by reference. - futzing with libraries - revert unnecessary cutlass changes - starting unit test for dsolibrary save - Prepare scalar changes for PR. - Eager candidate cost measurement. - More conv2d_cudnn.cuda training records. - cleanup before rebase - Use 'regular' target when build, not external codegen target - Tuned for -libs=cudnn - Tune before collage not during - Bring over target changes - Fix GetSpecName - Try again on python target changes, this time leave check_and_update_host_consist unchanged - Revert python target changes to try again less agressively - Few other cleanups - Switch to 'external codegen targets' style - Woops, run just_tvm after collage to pick up tuning logs - Finish tuning for rtx3070 - Run them all! - Update tuning logs - Share global vars in the candidate function cache - Finished tuning mobilenet, started on resnet50. - Include model name in logs to make sure we don't get anything mixed up - Drop -arch=sm_80 - Fix MaxCoalesce - Attach external_symbol to lifted functions - Add missing node registration, but leave VisitAttrs empty for now - Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing. - Finish tuning resnext50 - Improve coelescing - Account for coelesced functions when outlining final module - Fix caching, for real this time. - More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d. - OutlineExternalFunction both when preparing to estimate cost and after optimal partitioning applied. - Use fp16 in TensorRT only if model's 'main_dtype' is float16. - Fix CostEstimator caching issue - More Target cleanup (while waiting for tuning runs) - Better logging of candidates - Support export to ONNX - Fix merge - Part-way through tuning for mobilenet. - Add resnext50_32x4d - Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them - Still trying - Trying to track down weird failure in conv2d compute. - Switch tensorrt to be fully pattern & composite function based - Combiner rule for tuple projection - Allow build to fail in estimate_seconds - Add mobilenetv2 and resnet50v2 to menagerie - Update CompilationConfig to handle target refinement - Nuke remaining uses of TargetMap in favor of CompilationConfig (still needs to be pushed into python side) - Save/Load dso libraries (needed for Cutlass with separated run) - Move models into separate file - gpt2_extract_16 and autotvm tuning log - Handle missing tuning log files - fp16 support in scalars and the tensorrt runtime. - Wrap runner in nsys nvprof if requested - Enforce strict compile/run time separation in preparation for profiling - Better logging of final optimal partitioning and state of all candidates - Fix handling of tuples and InlineComposites fixup pass. - Fix TensorRT pattern bugs - Pass max_max_depth via PassContext - Better logging so can quickly compare specs - BUG: Benchmark the partitioned rather than original model!!! - Use median instead of mean - Back to GPT2 - Make sure all function vars have a type - Don't extract tasks if estimating BYOC-only (Was double-tuning every cutlass kernel). - Make sure cudnn pattern table is registered - Enable cudnn, get rid of support for op-predicate based BYOC integrations - Enable cublas - And yet another go at pruning unnecessary candidates. - Another go at pruning unnecessary candidates - Fix CompositePartitionRule use - Fix a few bugs with new TensorRT pattern-based integration - Rework RemoveSubCandidatesCombinerRule for soundness - Better logging - Bug fixes - Implement critical nodes idea for avoiding obviously unnecessary candidates - Promote DataflowGraph from alias to class so can cache downstream index set - Quick check to avoid unioning candidates which would create a cycle - Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates - GetFunction can legitimately return nullptr - rename tuning log - Support for int64 literals - Switch GPT2 to plain model - Fix library cloberring issue for cutlass - actually checkin 'built in' tuning log (covers mnist & gpt2 only) - trying to debug gpt2 - Update TargetKind attribute name - working through gpt2 issues - checkin tuning records for MNIST (with hack to not retry failed winograd) - Autotvm tuning disabled if log file empty (default) - Autotvm tuning during search working - tune during search (but does not load tuned records after search!) - About to add tuning to estimate_seconds - Split out the combiner rules & make them FFI friendly - Rework comments - Estimate IRModule instead of Function (closer to meta_schedule iface) - Add 'host' as first-class partitioning spec (Avoids special casing for the 'leave behind for the VM' case) - Move CollagePartitioner to very start of VM compiler flow (not changing legacy) - Fix bugs etc with new SubGraph::Rewrite approach Ready for updating RFC to focus on partitioning instead of fusion. - Working again after partition<->fusion split. - Add PrimitivePartitionRule - Refactor SubGraph Extract/Rewrite - Rename kernel->partition, fusion->partition - Next: make nesting in "Primitive" an explicit transform - respect existing target constraints from device planner - make 'compiler' and 'fusion_rule' attributes avail on all target kinds - moved design to tvm-rfcs, apache/tvm-rfcs#62 - incorporate comments - avoid repeated fusion - fix trt type checking - better logs - pretty print primitive rules - fix tensorrt - multiple targets per spec - don't extract candidate function until need cost Need to bring CombineByPrimitives back under control since lost depth limit. - cleaned up fusion rule names - added 'fuse anything touching' for BYOC - Finish dd example - Add notion of 'MustLower', even if a candidate fires may still need to consider leaving node behind for VM (especially for constants). - starting example - finished all the dd sections - documentation checkpoint - docs checkpoint - more design - starting on dd - runs MNIST with TVM+CUTLASS+TRT - cutlass function-at-a-time build - need to account for build_cutlass_kernels_vm - move cutlass tuning into relay.ext.cutlass path to avoid special case - add utils - don't fuse non-scalar constants for tvm target. - stuck on cuda mem failure on conv2d, suspect bug in main - where do the cutlass attrs come from? - running, roughtly - pretty printing, signs of life - wire things up again - Switch SubGraph and CandidateKernel to TVM objects - naive CombineByKindFusionRule, just to see what we're up agaist Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying. - preparing to mimic FuseOps - rework SubGraph to use IndexSet - rough cut at MaximalFusion - split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph. - top-down iterative handling of sub-sub-graphs - about to give up on one-pass extraction with 'sub-sub-graphs' - Add notion of 'labels' to sub-graphs - Rework FusionRules to be more compositional - partway through reworking fusion rules, broken - SubGraph::IsValid, but still need to add no_taps check - dataflow rework, preparing for SubGraph::IsValid - explode into subdir - mnist with one fusion rule (which fires twice) working - switch to CandidateKernelIndex - Confirm can measure 'pre-annotated' primitive functions - checkpoint - stuff - more sketching - dominator logging
mbs-octoml · Jul 13, 2022 · 2218e79 · 2218e79
1 parent d3b608e
commit 2218e79
Show file tree

Hide file tree

Showing 22 changed files with 6,891 additions and 8 deletions.
diff --git a/python/tvm/autotvm/task/dispatcher.py b/python/tvm/autotvm/task/dispatcher.py
@@ -58,6 +58,11 @@ class DispatchContext(object):
     def __init__(self):
         self._old_ctx = DispatchContext.current
 
+    # TODO(mbs): Hack for Collage demo: Allow cache query
+    # DO NOT SUBMIT
+    def contains(self, target, workload):
+        raise NotImplementedError()
+
     def query(self, target, workload):
         """
         Query the context to get the specific config for a template.
@@ -297,8 +302,10 @@ def load(self, records):
         counter = 0
         for inp, res in joint_records:
             counter += 1
-            if res.error_no != 0:
-                continue
+            # TODO(mbs): Hack for Collage demo: Cache the error so don't re-tune
+            # DO NOT SUBMIT
+            # if res.error_no != 0:
+            #     continue
 
             # use target keys in tvm target system as key to build best map
             for k in inp.target.keys:
@@ -320,7 +327,16 @@ def load(self, records):
                 if np.mean(other_res.costs) > np.mean(res.costs):
                     best_by_model[key] = (inp, res)
 
-        logger.debug("Finish loading %d records", counter)
+        # TODO(mbs): Hack for Collage demo: Too verbose
+        # DO NOT SUBMIT
+        # logger.info("Finished loading %d records", counter)
+
+    # TODO(mbs): Hack for Collage demo: Allow cache query
+    # DO NOT SUBMIT
+    def contains(self, target, workload):
+        # logger.info(
+        #    f"look for match with {target} and {workload} with {len(self._best_user_defined)} user-defined, {len(self.best_by_model)} model and {len(self.best_by_targetkey)} target entries")
+        return self._query_inside(target, workload) is not None
 
     def _query_inside(self, target, workload):
         if target is None:

diff --git a/python/tvm/relay/__init__.py b/python/tvm/relay/__init__.py
@@ -32,6 +32,7 @@
 
 from . import transform
 from . import analysis
+from . import collage
 from .build_module import build, create_executor, optimize
 from .transform import build_config
 from . import debug

diff --git a/python/tvm/relay/collage/__init__.py b/python/tvm/relay/collage/__init__.py
@@ -0,0 +1,18 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+# pylint: disable=wildcard-import
+from .collage_partitioner import *
diff --git a/python/tvm/relay/collage/_ffi_api.py b/python/tvm/relay/collage/_ffi_api.py
@@ -0,0 +1,21 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""FFI APIs for the Collage partitioner."""
+import tvm._ffi
+
+
+tvm._ffi._init_api("relay.collage", __name__)
diff --git a/python/tvm/relay/collage/collage_partitioner.py b/python/tvm/relay/collage/collage_partitioner.py
@@ -0,0 +1,237 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Search for optimal partitionings over Relay models."""
+
+import tvm
+import numpy as np
+from tvm._ffi.registry import register_func, register_object
+from tvm.runtime import Object
+import logging
+import os
+import shutil
+import math
+import tempfile
+
+from . import _ffi_api
+
+AUTOTVM_NUM_TRIALS = 2000
+AUTOTVM_EARLY_STOPPING = 600
+MEASURE_NUMBER = 20
+MEASURE_REPEAT = 5
+WARMUP_MIN_REPEAT_MS = 250
+TIMEOUT = 10
+
+
+@register_object("relay.collage.CostEstimator")
+class CostEstimator(Object):
+    """CostEstimator class"""
+
+    def __init__(self):
+        self.__init_handle_by_constructor__(_ffi_api.CostEstimator)
+
+
+@register_object("relay.collage.MockEstimator")
+class MockEstimator(Object):
+    """MockEstimator class"""
+
+    def __init__(self, target_costs):
+        self.__init_handle_by_constructor__(_ffi_api.MockEstimator, target_costs)
+
+
+def arg_for(type, device):
+    """Returns a test argument of type on device"""
+    assert isinstance(type, tvm.ir.TensorType)
+    return tvm.nd.array(
+        np.random.uniform(-1.0, 1.0, size=type.concrete_shape).astype(type.dtype), device=device
+    )
+
+
+def is_already_tuned(task, log_filename):
+    """Returns true if we already have a tuning record for task in turning logs in log_filename"""
+    if not os.path.exists(log_filename):
+        return False
+
+    dispatch_context = tvm.autotvm.task.ApplyHistoryBest(log_filename)
+    return dispatch_context.contains(task.target, task.workload)
+
+
+def extract_autotvm_tasks(mod, target):
+    return tvm.autotvm.task.extract_from_program(mod, target=target, params=None)
+
+
+def optional_tuning_records(log_filename):
+    if log_filename == "" or not os.path.exists(log_filename):
+        return tvm.autotvm.task.FallbackContext()
+    else:
+        return tvm.autotvm.task.ApplyHistoryBest(log_filename)
+
+
+def tune_autotvm_tasks(tasks, log_filename):
+    """Appends to log_filename the best strategies for tasks"""
+    if len(tasks) == 0:
+        return
+
+    measure_option = tvm.autotvm.measure_option(
+        builder=tvm.autotvm.LocalBuilder(timeout=TIMEOUT),
+        runner=tvm.autotvm.LocalRunner(
+            number=MEASURE_NUMBER, repeat=MEASURE_REPEAT, timeout=TIMEOUT, min_repeat_ms=0
+        ),
+    )
+
+    logging.info(
+        f"Using autotvm tuning for {len(tasks)} tasks with {AUTOTVM_NUM_TRIALS} trials, logging to {log_filename}"
+    )
+
+    # create tmp log file, starting with contents from existing log file
+    tmp_log_filename = log_filename + ".tmp"
+    if os.path.exists(tmp_log_filename):
+        os.remove(tmp_log_filename)
+    if os.path.exists(log_filename):
+        logging.info(f"Copying existing log {log_filename} to {tmp_log_filename}")
+        shutil.copy(log_filename, tmp_log_filename)
+
+    for i, task in enumerate(reversed(tasks)):
+        prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
+        logging.info(f"Considering task {task.name} {prefix}")
+        if is_already_tuned(task, tmp_log_filename):
+            logging.info(f"Re-using existing record for {task.name}")
+            continue
+
+        logging.info(f"Using autotvm to tune {task.name}")
+        tuner_obj = tvm.autotvm.tuner.XGBTuner(task, loss_type="rank")
+        if os.path.exists(tmp_log_filename):
+            tuner_obj.load_history(tvm.autotvm.record.load_from_file(tmp_log_filename))
+
+        # do tuning
+        n_trial = min(AUTOTVM_NUM_TRIALS, len(task.config_space))
+        tuner_obj.tune(
+            n_trial=n_trial,
+            early_stopping=AUTOTVM_EARLY_STOPPING,
+            measure_option=measure_option,
+            callbacks=[
+                tvm.autotvm.callback.progress_bar(n_trial, prefix=prefix),
+                tvm.autotvm.callback.log_to_file(tmp_log_filename),
+            ],
+        )
+
+    # pick best records and copy back to main log file
+    tvm.autotvm.record.pick_best(tmp_log_filename, log_filename)
+    os.remove(tmp_log_filename)
+
+    logging.info("Done with autotvm tuning")
+
+
+def vm_estimate_seconds(device, vm, func_name, args):
+    # Warmup
+    vm.benchmark(
+        device, repeat=1, number=1, min_repeat_ms=WARMUP_MIN_REPEAT_MS, func_name=func_name, **args
+    )
+    # For realz this time
+    return vm.benchmark(
+        device,
+        repeat=MEASURE_REPEAT,
+        number=MEASURE_NUMBER,
+        min_repeat_ms=0,
+        func_name=func_name,
+        **args,
+    )
+
+
+@register_func("tvm.relay.collage.estimate_seconds")
+def estimate_seconds(mod, target, needs_tvm_tuning):
+    """Returns the mean execution time of "main" in mod on target with params. The module
+    may contain "Primitive" functions, possibly with "Compiler" attributes."""
+    device = tvm.device(target.kind.device_type)
+
+    try:
+        # Build the module.
+        logging.info("Compiling module to estimate")
+        exe = tvm.relay.vm.compile(mod, target)
+    except RuntimeError as e:
+        # A build failure indicates the partition is not supported.
+        # eg trying to build an nn.batch_norm on GPU, which has no schedule since we assume it
+        # is only ever used with a tuple projection which is rewritten away.
+        logging.info(f"Assigning module infinite cost since unable to build: {e}")
+        return math.inf
+
+    # Finalize compilation
+    tmp_dir = tempfile.mkdtemp()
+    code, lib = exe.save()
+    lib_path = os.path.join(tmp_dir, "library.so")
+    # TODO(mbs): Avoid nvcc dependency?
+    lib.export_library(lib_path, workspace_dir=tmp_dir, cc="nvcc")
+    lib = tvm.runtime.load_module(lib_path)
+    exe = tvm.runtime.vm.Executable.load_exec(code, lib)
+
+    # Benchmark the module.
+    vm = tvm.runtime.vm.VirtualMachine(exe, device)
+    func_name = "main"
+    main_args = {v.name_hint: arg_for(v.checked_type, device) for v in mod[func_name].params}
+    logging.info("Benchmarking module to estimate")
+    profile = vm_estimate_seconds(device, vm, func_name, main_args)
+    logging.info(f"profile: {profile}")
+    return profile.median  # seconds
+
+
+make_labelled_dfpattern_partition_rule = tvm._ffi.get_global_func(
+    "relay.collage.make_labelled_dfpattern_partition_rule"
+)
+make_labelled_dfpattern_partition_rule_with_predicate = tvm._ffi.get_global_func(
+    "relay.collage.make_labelled_dfpattern_partition_rule_with_predicate"
+)
+make_pattern_byoc_partition_rule = tvm._ffi.get_global_func(
+    "relay.collage.make_pattern_byoc_partition_rule"
+)
+
+
+def make_labelled_dfpattern_partition_rule_wrapper(compiler, tuple):
+    if len(tuple) == 2:
+        rule_name, dataflow_pattern = tuple
+        return make_labelled_dfpattern_partition_rule(compiler, rule_name, dataflow_pattern)
+    else:
+        rule_name, dataflow_pattern, predicate = tuple
+        return make_labelled_dfpattern_partition_rule_with_predicate(
+            compiler, rule_name, dataflow_pattern, predicate
+        )
+
+
+@register_func("tvm.relay.collage.make_byoc_partition_rule")
+def make_byoc_partition_rule(compiler):
+    """Returns the PartitionRule for BYOC compiler"""
+    pattern_table = tvm.relay.op.contrib.get_pattern_table(compiler)
+    assert (
+        pattern_table is not None
+    ), f"No pattern table entry was found for BYOC compiler {compiler}"
+    logging.info(
+        f"Converting {len(pattern_table)} rules for {compiler} for use in pattern style BYOC lowering/codegen"
+    )
+    sub_rules = [
+        make_labelled_dfpattern_partition_rule_wrapper(compiler, tuple) for tuple in pattern_table
+    ]
+    return make_pattern_byoc_partition_rule(compiler, sub_rules)
+
+
+def autotvm_tune_module(mod, target, log_filename):
+    if log_filename == "":
+        logging.info("Not tuning with autotvm since disabled")
+        return
+    # Extract and tune any TVM kernels. BYOC partitions will have no tasks extracted.
+    logging.info("Extracting tasks from overall module")
+    tasks = extract_autotvm_tasks(mod, target)
+    logging.info(f"Auto-tuning {len(tasks)} tasks from overall module")
+    tune_autotvm_tasks(tasks, log_filename)
diff --git a/python/tvm/relay/transform/transform.py b/python/tvm/relay/transform/transform.py
@@ -1461,3 +1461,26 @@ def InlineCompilerFunctionsBoundTo(global_vars):
         The pass.
     """
     return _ffi_api.InlineCompilerFunctionsBoundTo(global_vars)
+
+
+def CollagePartition(config, cost_estimator=None):
+    """Partition the bodies of all functions according to the available targets so as to
+    minimize model latency. See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.
+
+    Parameters
+    ----------
+    config : CompilationConfig
+        The available targets.
+    cost_estimator : CostEstimator, optional
+        The custom cost estimator to use for costing each candidate partition.
+
+    Returns
+    -------
+    ret : tvm.transform.Pass
+        The pass.
+
+    """
+    if cost_estimator is None:
+        cost_estimator = relay.collage.CostEstimator()
+
+    return _ffi_api.CollagePartition(config, cost_estimator)