Skip to content

Commit

Permalink
** Collage v2 sketch ***
Browse files Browse the repository at this point in the history
- VLOG in vm runner
- lints
- Get test_pass_collage_partition.py going
- one more rollback
- fix relay.collage ffi prefix.
- zap all unnecessary changes
- test (and minor cleanup) of CandidatePartition::EstimateCost
- More partition rule tests
- tuple arg test
- Test ByKind CombinerRule
- Move the TOpPattern attributes from Python to C++ so visible to C++ unit tests.
- wibble
- wibble
- starting to add CombinerRule unit tests
- sync with mbs-collage-subgraph changes
- rebase
- sync
- Clarify dataflow_graph.expr() vs expr constraints
- Beef up test_sub_graph
- Polish
- False alarm, reverting unnecessary const fiddles
- Bad merge, still have bug with missing const.
- Fix rebase
- Prepare for rebase
- Move CaptureIndexInSpans to generic tvm.relay.transform
- Fix test_sub_graph.py unit tests
- Make PartitionSpecs 1:1 with Targets
- Fix tests
- Finish merging Matthew's changes
- First pass merging Matthew's changes
- finish fixing lints
- test_tensorrt.py runs
- some lint fixes while waiting
- test annotation fiddles, disable pytorch test
- fix constant handling
- update tests for new API
- Switch TensorRT BYOC integration to IRModule-at-a-time
- [bug] index out of range
- don't need InferTypeExpr
- revert unnecessary changes
- revert unnecessary changes
- fix accumulate bug
- sync with 11481
- Eta-expand tuple ars in candidate partitions
  (so measurements does not need to worry about
  constructing tuple arguments)
- Polish compiler_function_utils for splitting out
- Mark functions as extern.
- Get rid of relay.ext.cutlass
- kExternalSymbol:String ----> kExtern:Bool
- Host glitch if PlanDevices run before CollagePartition
- Fix unit test
- Make load_static_library first class python func
- Get CUTLASS going on graph executor as well as vm
- Include export_library in estimate_seconds
- Rollback DSOLibrary changes.
- Add StaticLibraryNode and switch CUTLASS to use it
  This avoids the crazy serialize/deserialize/load hackery, which I'll now remove.
- Get running again
- CUTLASS picks up all options from 'cutlass' external codegen target.
- Revert false starts with cutlass handling
- Get CUTLASS going with program-at-a-time tuning and compilation instead of
  function at a time.
- Save DSOLibraries by contents rather than by reference.
- futzing with libraries
- revert unnecessary cutlass changes
- starting unit test for dsolibrary save
- Prepare scalar changes for PR.
- Eager candidate cost measurement.
- More conv2d_cudnn.cuda training records.
- cleanup before rebase
- Use 'regular' target when build, not external codegen target
- Tuned for -libs=cudnn
- Tune before collage not during
- Bring over target changes
- Fix GetSpecName
- Try again on python target changes, this time leave check_and_update_host_consist unchanged
- Revert python target changes to try again less agressively
- Few other cleanups
- Switch to 'external codegen targets' style
- Woops, run just_tvm after collage to pick up tuning logs
- Finish tuning for rtx3070
- Run them all!
- Update tuning logs
- Share global vars in the candidate function cache
- Finished tuning mobilenet, started on resnet50.
- Include model name in logs to make sure we don't get anything mixed up
- Drop -arch=sm_80
- Fix MaxCoalesce
- Attach external_symbol to lifted functions
- Add missing node registration, but leave VisitAttrs empty for now
- Make MaxCoalesce as aggressive as possible, since simple impl did not handle sharing.
- Finish tuning resnext50
- Improve coelescing
- Account for coelesced functions when outlining final module
- Fix caching, for real this time.
- More nn.conv2d autotvm tuning records, but still not done with resnext50_32_4d.
- OutlineExternalFunction both when preparing to estimate cost and after optimal
  partitioning applied.
- Use fp16 in TensorRT only if model's 'main_dtype' is float16.
- Fix CostEstimator caching issue
- More Target cleanup (while waiting for tuning runs)
- Better logging of candidates
- Support export to ONNX
- Fix merge
- Part-way through tuning for mobilenet.
- Add resnext50_32x4d
- Lift all "Compiler" functions before estimating to ensure no Relay passes are run on them
- Still trying
- Trying to track down weird failure in conv2d compute.
- Switch tensorrt to be fully pattern & composite function based
- Combiner rule for tuple projection
- Allow build to fail in estimate_seconds
- Add mobilenetv2 and resnet50v2 to menagerie
- Update CompilationConfig to handle target refinement
- Nuke remaining uses of TargetMap in favor of CompilationConfig
  (still needs to be pushed into python side)
- Save/Load dso libraries (needed for Cutlass with separated run)
- Move models into separate file
- gpt2_extract_16 and autotvm tuning log
- Handle missing tuning log files
- fp16 support in scalars and the tensorrt runtime.
- Wrap runner in nsys nvprof if requested
- Enforce strict compile/run time separation in preparation for profiling
- Better logging of final optimal partitioning and state of all candidates
- Fix handling of tuples and InlineComposites fixup pass.
- Fix TensorRT pattern bugs
- Pass max_max_depth via PassContext
- Better logging so can quickly compare specs
- BUG: Benchmark the partitioned rather than original model!!!
- Use median instead of mean
- Back to GPT2
- Make sure all function vars have a type
- Don't extract tasks if estimating BYOC-only
  (Was double-tuning every cutlass kernel).
- Make sure cudnn pattern table is registered
- Enable cudnn, get rid of support for op-predicate based BYOC integrations
- Enable cublas
- And yet another go at pruning unnecessary candidates.
- Another go at pruning unnecessary candidates
- Fix CompositePartitionRule use
- Fix a few bugs with new TensorRT pattern-based integration
- Rework RemoveSubCandidatesCombinerRule for soundness
- Better logging
- Bug fixes
- Implement critical nodes idea for avoiding obviously unnecessary candidates
- Promote DataflowGraph from alias to class so can cache downstream index set
- Quick check to avoid unioning candidates which would create a cycle
- Host out CandidatePartitionIndex and add rules to avoid small candidates subsumed by containing candidates
- GetFunction can legitimately return nullptr
- rename tuning log
- Support for int64 literals
- Switch GPT2 to plain model
- Fix library cloberring issue for cutlass
- actually checkin 'built in' tuning log (covers mnist & gpt2 only)
- trying to debug gpt2
- Update TargetKind attribute name
- working through gpt2 issues
- checkin tuning records for MNIST (with hack to not retry failed winograd)
- Autotvm tuning disabled if log file empty (default)
- Autotvm tuning during search working
- tune during search
  (but does not load tuned records after search!)
- About to add tuning to estimate_seconds
- Split out the combiner rules & make them FFI friendly
- Rework comments
- Estimate IRModule instead of Function (closer to meta_schedule iface)
- Add 'host' as first-class partitioning spec
  (Avoids special casing for the 'leave behind for the VM' case)
- Move CollagePartitioner to very start of VM compiler flow (not changing legacy)
- Fix bugs etc with new SubGraph::Rewrite approach
  Ready for updating RFC to focus on partitioning instead of fusion.
- Working again after partition<->fusion split.
- Add PrimitivePartitionRule
- Refactor SubGraph Extract/Rewrite
- Rename kernel->partition, fusion->partition
- Next: make nesting in "Primitive" an explicit transform
- respect existing target constraints from device planner
- make 'compiler' and 'fusion_rule' attributes avail on all target kinds
- moved design to tvm-rfcs, apache/tvm-rfcs#62
- incorporate comments
- avoid repeated fusion
- fix trt type checking
- better logs
- pretty print primitive rules
- fix tensorrt
- multiple targets per spec
- don't extract candidate function until need cost
  Need to bring CombineByPrimitives back under control since lost depth limit.
- cleaned up fusion rule names
- added 'fuse anything touching' for BYOC
- Finish dd example
- Add notion of 'MustLower', even if a candidate fires may still need to consider
  leaving node behind for VM (especially for constants).
- starting example
- finished all the dd sections
- documentation checkpoint
- docs checkpoint
- more design
- starting on dd
- runs MNIST with TVM+CUTLASS+TRT
- cutlass function-at-a-time build
- need to account for build_cutlass_kernels_vm
- move cutlass tuning into relay.ext.cutlass path to avoid special case
- add utils
- don't fuse non-scalar constants for tvm target.
- stuck on cuda mem failure on conv2d, suspect bug in main
- where do the cutlass attrs come from?
- running, roughtly
- pretty printing, signs of life
- wire things up again
- Switch SubGraph and CandidateKernel to TVM objects
- naive CombineByKindFusionRule, just to see what we're up agaist
  Will switch to Object/ObjectRef for SubGraph and CandidateKernel to avoid excess copying.
- preparing to mimic FuseOps
- rework SubGraph to use IndexSet
- rough cut at MaximalFusion
- split SubGraph and IndexSet in preparation for caching input/output/entry/exit sets in SubGraph.
- top-down iterative handling of sub-sub-graphs
- about to give up on one-pass extraction with 'sub-sub-graphs'
- Add notion of 'labels' to sub-graphs
- Rework FusionRules to be more compositional
- partway through reworking fusion rules, broken
- SubGraph::IsValid, but still need to add no_taps check
- dataflow rework, preparing for SubGraph::IsValid
- explode into subdir
- mnist with one fusion rule (which fires twice) working
- switch to CandidateKernelIndex
- Confirm can measure 'pre-annotated' primitive functions
- checkpoint
- stuff
- more sketching
- dominator logging
  • Loading branch information
mbs-octoml committed Jul 13, 2022
1 parent d3b608e commit 2218e79
Show file tree
Hide file tree
Showing 22 changed files with 6,891 additions and 8 deletions.
22 changes: 19 additions & 3 deletions python/tvm/autotvm/task/dispatcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,11 @@ class DispatchContext(object):
def __init__(self):
self._old_ctx = DispatchContext.current

# TODO(mbs): Hack for Collage demo: Allow cache query
# DO NOT SUBMIT
def contains(self, target, workload):
raise NotImplementedError()

def query(self, target, workload):
"""
Query the context to get the specific config for a template.
Expand Down Expand Up @@ -297,8 +302,10 @@ def load(self, records):
counter = 0
for inp, res in joint_records:
counter += 1
if res.error_no != 0:
continue
# TODO(mbs): Hack for Collage demo: Cache the error so don't re-tune
# DO NOT SUBMIT
# if res.error_no != 0:
# continue

# use target keys in tvm target system as key to build best map
for k in inp.target.keys:
Expand All @@ -320,7 +327,16 @@ def load(self, records):
if np.mean(other_res.costs) > np.mean(res.costs):
best_by_model[key] = (inp, res)

logger.debug("Finish loading %d records", counter)
# TODO(mbs): Hack for Collage demo: Too verbose
# DO NOT SUBMIT
# logger.info("Finished loading %d records", counter)

# TODO(mbs): Hack for Collage demo: Allow cache query
# DO NOT SUBMIT
def contains(self, target, workload):
# logger.info(
# f"look for match with {target} and {workload} with {len(self._best_user_defined)} user-defined, {len(self.best_by_model)} model and {len(self.best_by_targetkey)} target entries")
return self._query_inside(target, workload) is not None

def _query_inside(self, target, workload):
if target is None:
Expand Down
1 change: 1 addition & 0 deletions python/tvm/relay/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@

from . import transform
from . import analysis
from . import collage
from .build_module import build, create_executor, optimize
from .transform import build_config
from . import debug
Expand Down
18 changes: 18 additions & 0 deletions python/tvm/relay/collage/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# pylint: disable=wildcard-import
from .collage_partitioner import *
21 changes: 21 additions & 0 deletions python/tvm/relay/collage/_ffi_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""FFI APIs for the Collage partitioner."""
import tvm._ffi


tvm._ffi._init_api("relay.collage", __name__)
237 changes: 237 additions & 0 deletions python/tvm/relay/collage/collage_partitioner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

"""Search for optimal partitionings over Relay models."""

import tvm
import numpy as np
from tvm._ffi.registry import register_func, register_object
from tvm.runtime import Object
import logging
import os
import shutil
import math
import tempfile

from . import _ffi_api

AUTOTVM_NUM_TRIALS = 2000
AUTOTVM_EARLY_STOPPING = 600
MEASURE_NUMBER = 20
MEASURE_REPEAT = 5
WARMUP_MIN_REPEAT_MS = 250
TIMEOUT = 10


@register_object("relay.collage.CostEstimator")
class CostEstimator(Object):
"""CostEstimator class"""

def __init__(self):
self.__init_handle_by_constructor__(_ffi_api.CostEstimator)


@register_object("relay.collage.MockEstimator")
class MockEstimator(Object):
"""MockEstimator class"""

def __init__(self, target_costs):
self.__init_handle_by_constructor__(_ffi_api.MockEstimator, target_costs)


def arg_for(type, device):
"""Returns a test argument of type on device"""
assert isinstance(type, tvm.ir.TensorType)
return tvm.nd.array(
np.random.uniform(-1.0, 1.0, size=type.concrete_shape).astype(type.dtype), device=device
)


def is_already_tuned(task, log_filename):
"""Returns true if we already have a tuning record for task in turning logs in log_filename"""
if not os.path.exists(log_filename):
return False

dispatch_context = tvm.autotvm.task.ApplyHistoryBest(log_filename)
return dispatch_context.contains(task.target, task.workload)


def extract_autotvm_tasks(mod, target):
return tvm.autotvm.task.extract_from_program(mod, target=target, params=None)


def optional_tuning_records(log_filename):
if log_filename == "" or not os.path.exists(log_filename):
return tvm.autotvm.task.FallbackContext()
else:
return tvm.autotvm.task.ApplyHistoryBest(log_filename)


def tune_autotvm_tasks(tasks, log_filename):
"""Appends to log_filename the best strategies for tasks"""
if len(tasks) == 0:
return

measure_option = tvm.autotvm.measure_option(
builder=tvm.autotvm.LocalBuilder(timeout=TIMEOUT),
runner=tvm.autotvm.LocalRunner(
number=MEASURE_NUMBER, repeat=MEASURE_REPEAT, timeout=TIMEOUT, min_repeat_ms=0
),
)

logging.info(
f"Using autotvm tuning for {len(tasks)} tasks with {AUTOTVM_NUM_TRIALS} trials, logging to {log_filename}"
)

# create tmp log file, starting with contents from existing log file
tmp_log_filename = log_filename + ".tmp"
if os.path.exists(tmp_log_filename):
os.remove(tmp_log_filename)
if os.path.exists(log_filename):
logging.info(f"Copying existing log {log_filename} to {tmp_log_filename}")
shutil.copy(log_filename, tmp_log_filename)

for i, task in enumerate(reversed(tasks)):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
logging.info(f"Considering task {task.name} {prefix}")
if is_already_tuned(task, tmp_log_filename):
logging.info(f"Re-using existing record for {task.name}")
continue

logging.info(f"Using autotvm to tune {task.name}")
tuner_obj = tvm.autotvm.tuner.XGBTuner(task, loss_type="rank")
if os.path.exists(tmp_log_filename):
tuner_obj.load_history(tvm.autotvm.record.load_from_file(tmp_log_filename))

# do tuning
n_trial = min(AUTOTVM_NUM_TRIALS, len(task.config_space))
tuner_obj.tune(
n_trial=n_trial,
early_stopping=AUTOTVM_EARLY_STOPPING,
measure_option=measure_option,
callbacks=[
tvm.autotvm.callback.progress_bar(n_trial, prefix=prefix),
tvm.autotvm.callback.log_to_file(tmp_log_filename),
],
)

# pick best records and copy back to main log file
tvm.autotvm.record.pick_best(tmp_log_filename, log_filename)
os.remove(tmp_log_filename)

logging.info("Done with autotvm tuning")


def vm_estimate_seconds(device, vm, func_name, args):
# Warmup
vm.benchmark(
device, repeat=1, number=1, min_repeat_ms=WARMUP_MIN_REPEAT_MS, func_name=func_name, **args
)
# For realz this time
return vm.benchmark(
device,
repeat=MEASURE_REPEAT,
number=MEASURE_NUMBER,
min_repeat_ms=0,
func_name=func_name,
**args,
)


@register_func("tvm.relay.collage.estimate_seconds")
def estimate_seconds(mod, target, needs_tvm_tuning):
"""Returns the mean execution time of "main" in mod on target with params. The module
may contain "Primitive" functions, possibly with "Compiler" attributes."""
device = tvm.device(target.kind.device_type)

try:
# Build the module.
logging.info("Compiling module to estimate")
exe = tvm.relay.vm.compile(mod, target)
except RuntimeError as e:
# A build failure indicates the partition is not supported.
# eg trying to build an nn.batch_norm on GPU, which has no schedule since we assume it
# is only ever used with a tuple projection which is rewritten away.
logging.info(f"Assigning module infinite cost since unable to build: {e}")
return math.inf

# Finalize compilation
tmp_dir = tempfile.mkdtemp()
code, lib = exe.save()
lib_path = os.path.join(tmp_dir, "library.so")
# TODO(mbs): Avoid nvcc dependency?
lib.export_library(lib_path, workspace_dir=tmp_dir, cc="nvcc")
lib = tvm.runtime.load_module(lib_path)
exe = tvm.runtime.vm.Executable.load_exec(code, lib)

# Benchmark the module.
vm = tvm.runtime.vm.VirtualMachine(exe, device)
func_name = "main"
main_args = {v.name_hint: arg_for(v.checked_type, device) for v in mod[func_name].params}
logging.info("Benchmarking module to estimate")
profile = vm_estimate_seconds(device, vm, func_name, main_args)
logging.info(f"profile: {profile}")
return profile.median # seconds


make_labelled_dfpattern_partition_rule = tvm._ffi.get_global_func(
"relay.collage.make_labelled_dfpattern_partition_rule"
)
make_labelled_dfpattern_partition_rule_with_predicate = tvm._ffi.get_global_func(
"relay.collage.make_labelled_dfpattern_partition_rule_with_predicate"
)
make_pattern_byoc_partition_rule = tvm._ffi.get_global_func(
"relay.collage.make_pattern_byoc_partition_rule"
)


def make_labelled_dfpattern_partition_rule_wrapper(compiler, tuple):
if len(tuple) == 2:
rule_name, dataflow_pattern = tuple
return make_labelled_dfpattern_partition_rule(compiler, rule_name, dataflow_pattern)
else:
rule_name, dataflow_pattern, predicate = tuple
return make_labelled_dfpattern_partition_rule_with_predicate(
compiler, rule_name, dataflow_pattern, predicate
)


@register_func("tvm.relay.collage.make_byoc_partition_rule")
def make_byoc_partition_rule(compiler):
"""Returns the PartitionRule for BYOC compiler"""
pattern_table = tvm.relay.op.contrib.get_pattern_table(compiler)
assert (
pattern_table is not None
), f"No pattern table entry was found for BYOC compiler {compiler}"
logging.info(
f"Converting {len(pattern_table)} rules for {compiler} for use in pattern style BYOC lowering/codegen"
)
sub_rules = [
make_labelled_dfpattern_partition_rule_wrapper(compiler, tuple) for tuple in pattern_table
]
return make_pattern_byoc_partition_rule(compiler, sub_rules)


def autotvm_tune_module(mod, target, log_filename):
if log_filename == "":
logging.info("Not tuning with autotvm since disabled")
return
# Extract and tune any TVM kernels. BYOC partitions will have no tasks extracted.
logging.info("Extracting tasks from overall module")
tasks = extract_autotvm_tasks(mod, target)
logging.info(f"Auto-tuning {len(tasks)} tasks from overall module")
tune_autotvm_tasks(tasks, log_filename)
23 changes: 23 additions & 0 deletions python/tvm/relay/transform/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -1461,3 +1461,26 @@ def InlineCompilerFunctionsBoundTo(global_vars):
The pass.
"""
return _ffi_api.InlineCompilerFunctionsBoundTo(global_vars)


def CollagePartition(config, cost_estimator=None):
"""Partition the bodies of all functions according to the available targets so as to
minimize model latency. See https://github.com/apache/tvm-rfcs/blob/main/rfcs/0062-collage.md.

Parameters
----------
config : CompilationConfig
The available targets.
cost_estimator : CostEstimator, optional
The custom cost estimator to use for costing each candidate partition.

Returns
-------
ret : tvm.transform.Pass
The pass.

"""
if cost_estimator is None:
cost_estimator = relay.collage.CostEstimator()

return _ffi_api.CollagePartition(config, cost_estimator)
Loading

0 comments on commit 2218e79

Please sign in to comment.