Skip to content
This repository has been archived by the owner on May 22, 2023. It is now read-only.

[DISCUSS] Pattern Language #160

Closed
2 of 6 tasks
ganler opened this issue Jun 11, 2022 · 11 comments
Closed
2 of 6 tasks

[DISCUSS] Pattern Language #160

ganler opened this issue Jun 11, 2022 · 11 comments

Comments

@ganler
Copy link
Collaborator

ganler commented Jun 11, 2022

This thread discusses the design of Relax Pattern Language.

🎯 Progress and Questions

This section is pinged to show current progress. Please sklip this section if you are new here.

Design Updates/Patches

  1. [DISCUSS] Pattern Language #160 (comment)

Implementation Progress

  • Dataflow Pattern Lang: Core Matching Features #163 implements the core functionalities of pattern matching:
    • migrated syntatical/expression matching from relay and extended to support relax IR;
    • implemented graph matching that allows users to easily add constraints across expressions (mainly use-def constraints);
    • implemented user-friendly syntax sugars: >> (only used by), ^ (used by), dup, fork_to.
    • tested with real-world examples from TensorRT and TASO.
  • Pattern rewriting: the concrete pattern rewriting will be implemented when @ganler is exploring Collage.
  • Fuzzy matching;
  • Enhancements:
    • Handle match shape.
    • Co-exist with non-DataflowBlock.

Open Questions

  • Are there any real-world patterns that are hard/impossible to be covered by the current design?
  • The relay style pattern syntax sugar is conveniently short (e.g., is_var means VarPattern) but could be confusing to beginners as is<T> usually means a function signature of bool(...). Can we have better naming practice to make pattern definitions short yet easy-to-understand? For example, how about using dp.var() instead of is_var where dp means dataflow_pattern?

Motivation

The many existing graph-level optimizations require looking up a specific (sub-)graph pattern (e.g., Conv->Bias->ReLU) and then rewriting the graph towards a more efficient one. Potential usages include:

  1. Operator fusion;
  2. Expression simplification;
  3. Dispatching sub-graph to 3rd party (i.e., Collage);

Such usages can be prevalent:

Relay’s pattern language matches relay.Expr with a series of patterns. As we are now developing pattern language for Relax, we want to adapt features in Relax (e.g., RuntimeDepShape) and make defining topological dataflow relations easier (not (only) Dominator pattern), towards high scalability in terms of writing graph-level passes.

Goals:

  1. Relax Pattern Language should be easy enough to define real-world patterns (examples in next section) whose eventual goal is to benefit Relax’s graph level optimization and integrating Collage.
  2. Our target users include Python users even without compiler engineering background. (as simple as torch.fx)

Design through Examples

To better understand the examples below, it is highly recommended to read a few examples in Relay’s pattern language.

Use Dependency

Compared with Dominator analysis, use-def chain information can be more powerful, fine-grained, and lightweight in describing graph patterns.

To fuse or dispatch Conv-ReLU-Add , one may write:

# Relay's Pattern Lang
is_op("add")(is_op("relu")(is_op("conv2d")), wildcard())

Despite the syntax, it is actually incorrect: what if the output of relu output is consumed by operators other than add? To not violate the graph semantic, we either yield the fusion chance or duplicate Conv-ReLU. It is a concern in XLA (See Page 20) but not for TVM as TVM usually disallows replication (fuse_ops) by requiring a domination relation, i.e., add must post-dominate conv and relu (see also 2nd code block of Relay’s Pattern Matcher).

Using dominator, however, can be unnecessary and limited to representing such patterns correctly. Here is a counter-example suggested by @sunggg:

#     conv
#     / \
# relu   sigmoid
#     \ /
#     add
#
conv = is_op("conv2d")
relu = is_op("relu")(conv)
add = is_op("add")(relu, wildcard())

add.dominates(conv, relu)
# Now if we match the pattern above, it returns `True`.
# This is not what we want as `conv` is also **used** by `sigmoid`.

# Then we spent some time and figure out that there's one rule left:
relu.dominates(conv)
# Matching returns `False` now which is correct.

# It is correct, however, incomplete:
# "relu" does not have to be the left operand -- it can be right operand
# ... conv & relu
add = is_op("add")(relu, wildcard()) | is_op("add")(wildcard(), relu)
# ... repeat dominator stuff...
# Relay's pattern matcher solves this internally but still misses cases like `And`.

Things can be much easier if we describes patterns with use dependency.

is_op("conv").oub(is_op("relu").oub(is_op("add")))
# or 
is_op("conv") >> is_op("relu") >> is_op("add") # See `Syntax Sugar` section
  1. oub stands for “Only Used By” that A.oub(B) means B is the only consumer of A.
  2. Since “B is the only consumer of A”, B pdom and idom A.
  3. The oub ”syntax sugar” also makes things easier as often we don’t care about the argument position but just the dependency relation.
  4. The use-def analysis is more affordable (trivially $O(|V|)$ for SSA and Relax) than dominator analysis (naive two-sided backtrace’s complexity is $O(|V|^2)$).

One-One Match

  • Each Expr can only be matched by one pattern object.
c = is_op("conv2d")
b0 = c > is_op("bias")
b1 = c > is_op("bias")
# This means
#    conv2d
#    /    \
# bias0   bias1

# Cannot be
#    conv2d - bias0 (match b0 & b1)
# As bias0 can only be matched once.
  • Each pattern object can only match one Expr
# The example above cannot be
#   conv2d0    conv2d1
#      |          |
#     bias0     bias1
# As `c` can only match one `Expr`.

Unfortunately, this part is not very explicit in Relay (can be bugs or undefined behavior). See #160 (comment).

# conv   conv
#   \     /
#     add
# The parallel pattern above will match a diamond expression in Relay...

TVM OpFusion

The three fusible patterns in TVM (Figure Source).

image

is_out_fusible() >> is_injective()    # out fusible -> inj
is_injective()   >> is_injective()    # inj -> inj
is_injective()   >> is_reduce()       # inj -> reduce

If implement this in Relay, we need to handle argument positions for injective operators and put dominance relations.

TensorRT CBR

image

is_op("conv2d").with_attr("kernel_size", (1, 1)) \
>> is_op("bias") \
>> is_op("relu")

TensorRT Parallel CBR

Parallel conv2d with the same attributes can be merged in a big kernel.

image

cbr0 = is_op("conv2d") >> is_op("bias") >> is_op("relu")

# Quickly replicate another CBR pattern.
cbr1 = dup(cbr0)
cbr2 = dup(cbr0)

# Parallel forked from the same input.
fork_to(cbr0, cbr1, cbr2)  # = wildcard().fork_to(c0, c1, c2)

New syntax sugar [self].fork_to([vars, ...]): vars share the same producer self. vars are all (immediate) users to self.

Allowing General Parallel Patterns

The example above is actually impractical:

  • Fusible patterns should include kernel sizes of 2x2, 3x3…
  • To be trivially fusible, the three parallel conv2d should have the same attribute.

Thus we add one more line:

# cbr0, cbr1, cbr2 share equivalent attributes (kernel sizes, strides).
cbr0.attr_eq(cbr1).attr_eq(cbr2)
# = attr_eq(cbr0, cbr1, cbr2)

Self-attention Fusion

Reference: Fig 5a in NVIDIA Blog

image

qt = is_op("fc") >> is_op("transpose")
kt = dup(fc_trans0)
vt = dup(fc_trans0)

# Above patterns can already be fused to a big FC and transpose.
# But if you want to continue...

kq_mul = qt.uniq() * kt.uniq()
# =   kq_mul = qt * kt  &&  (qt >> kq_mul)  &&  (kt >> kq_mul)

softmax_out = kq_mul >> is_op("scale") >> is_op("softmax")
out = vt.uniq() * softmax_out.uniq()

New syntax sugar [self].uniq():

  • [self] is a value only used by one Expr | pattern object.
  • Who uses the variable first will be its only user. If [self] is used by a second Expr | pattern object, an exception will be raised.

Fuzzy Matching

Suggested by @comaniac (#160 (comment)), sometimes we want to match patterns like gemm -- (cast -- ) add that cast may or may not be present (other examples can be broadcast_to).

To generally support this feature, we should allow user to "skip" certain operators when doing pattern matching.

gemm_add = is_op("gemm") >> is_op("add")
gemm_add.match(..., fuzzy_ops=[op.cast, op.cast_like, op.broadcast_to, op.broadcast_to_like])
"""
For example:
Gemm -> [Cast -> Broadcast ->]  Add
will be regarded as:
Gemm -> Add
"""

CBR Rewriting

The rewriting style is similar to Relay’s, but allows multiple outputs to support parallel patterns (In Relay, the previous design and implementation makes it hard to support parallel outputs. See this thread).

# Rewrite it to one CBR kernel
@R.function
def cbr(x: Tensor):
    with R.dataflow():
        lv0 = conv2d(x, w, kernel_size=(1, 1))
        lv1 = bias_add(lv0, bias)
        lv2 = relu(lv1)
        relax.output(lv2)
    return lv2

class FuseCBR(DFPatternCallback):
    def __init__(self):
        super().__init__()
        c = is_op("conv2d").with_attr("kernel_size", (1, 1))
        b = is_op("bias")
        r = is_op("relu")
        self.pattern = c >> b >> r # CBR pattern

    def callback(node_map: Dict[DFPattern, VarBinding]) -> List[Tuple[Pattern, Expr]]]:
        # Access everything through `node_map` (Pattern -> Correponding VarBinding).
        assert len(outputs) == 1
        input_args = node_map[c].args
        output_type = node_map[r].type_annotation
        return [
            (node_map[r], call_tir("cbr1x1", input_args, output_type))
            # All external uses to `relu` will be redirected to `cbr1x1`.
        ]

# Fuse it!
fused_cbr = FuseCBR()(cbr)

Syntax Sugar

The discussed ones include:

  1. [producer].oub([consumer]) -> [producer]
  2. [source].fork_to([paralle_heads, ...]) -> [source]
  3. [self].attr_eq([others, ...]) -> [self]
    1. attr_eq([exprs, ...]) -> [exprs, ...]
  4. [self].uniq() -> [self]

Pipe-like syntax sugar to make tree-like pattern definition easier. Inspired by Pipes (CPPCON’20). The examples are both C++- and Python-compatible.

Sequential API a special class to handle sequence patterns, co-working with Pipes API.

  • a > b: a is used by b; returns Seq(a, b);
  • a >> b: a is only used by b; returns PdomSeq(a, b);
  • Seq([vars, ...]) defines sequential dependency along vars;
  • PdomSeq([vars, ...]) (huh... better naming?) further requires successor idom predecessor;
# Conv-Bias-Relu (CBR)
is_op("conv2d") >> is_op("bias") >> is_op("relu")
# Same as:
# PdomSeq(is_op("conv2d"), is_op("bias"), is_op("relu"))

# 3-way Parallel CBR
fork_to(
  attr_eq(
    is_op("conv2d") >> (is_op("bias") >> is_op("relu")),
    is_op("conv2d") >> (is_op("bias") >> is_op("relu")),
    is_op("conv2d") >> (is_op("bias") >> is_op("relu")),
  )
)

# Multi-head in self-attention
fork_to(
  attr_eq(
    is_op("fc") >> is_op("transpose"),
    is_op("fc") >> is_op("transpose"),
    is_op("fc") >> is_op("transpose"),
  )
)
# Or
fork_to(
  attr_eq(
    make(3, is_op("fc") >> is_op("transpose")), # Like make in Go.
  )
)

Other syntax sugar follow Relay’s style (e.g., * means is_op("relax.multiply")).

@sunggg
Copy link
Collaborator

sunggg commented Jun 13, 2022

Thank you for the great proposal, @ganler!
This is great. Especially, I appreciate your detailed examples and I like how we can easily write patterns to capture parallel components with your syntactic sugar.

A couple questions:

  • For is_op("conv") >> is_op("relu") >> is_op("add") example, we may still be able to fuse the operators even though relu is not dominating add. For example, in the case below, there is no problem in fusing Conv+Relu+Add because there is no cyclic dependency (which we want to avoid). In my current understanding, dominance analysis seems necessary to clarify the difference between this case and the diamond shapes. What do you think?
Conv     ...
 |        |
relu   softmax
  \     /
    Add                 
  • One-One Match seems interesting. Do you think this is the feature or a bug to fix?
  • It's a long time ago, so I cannot remember clearly but I might face the issue in differentiating those two cases with a Relay pattern. It might be worth checking if we can differentiate the following case when we want to with the pattern language.
Conv1   Conv2
 |        |
relu   softmax
  \     /
    Add      
   Conv
 /      \
relu   softmax
  \     /
    Add      

@ganler
Copy link
Collaborator Author

ganler commented Jun 13, 2022

@sunggg Thanks for your feedback!

Regarding the is_op("conv") >> is_op("relu") >> is_op("add") example, it can match the one you presented:

Conv     ...
 |        |
relu   softmax
  \     /
    Add    

but won't match:

     conv
     / \
 relu   sigmoid
     \ /
     add

is_op("conv") >> is_op("relu") stands for "conv" is used and only used by "relu", which is true for the 1st case but false for the 2nd one ("conv" is not only used by "relu": it is also used by "sigmoid"). I agree that dominance can distinguish them -- use-def dependency can also tell (because A is only used by B -> B immediately post-dom A), and I think the concept of "only used by" is more intuitive and common in real-world patterns (of course, it is also more affordable).

@sunggg
Copy link
Collaborator

sunggg commented Jun 13, 2022

@ganler Thank you for the clarification. I misunderstood its definition and I agree that yours can differentiate these two. Then, how about this case? Relu has two consumers now and thus Add is not the only consumer anymore? In this case, there is no cyclic dependency so we should be able to fuse Conv+Relu+Add.

      Conv   
       |        
      Relu   
     /   \     
Softmax  Add                 

@ganler
Copy link
Collaborator Author

ganler commented Jun 13, 2022

@sunggg Correct! is_op("conv") >> is_op("relu") >> is_op("add") will filter this sub-graph.

I am not sure if we want to fuse Conv+Relu+Add in this pattern. Because fusing it means we need to replicate the Conv-Relu chain to make sure Add can be fed:

          ConvInput
      /                \
FusedConvReluAdd      Conv
      |                 | 
                       Relu 
                        |   
                       Softmax

But it might not be beneficial as the theoretical FLOPs are increased. That's why I previously pointed out:

To not violate the graph semantic, we either yield the fusion chance or duplicate Conv-ReLU. It is a concern in XLA (See Page 20) but not for TVM as TVM usually disallows replication (fuse_ops) by requiring a domination relation, i.e., add must post-dominate conv and relu

If we want to allow this pattern we write:
is_op("conv") >> is_op("relu") > is_op("add") where A > B means A is consumed by B (does not have to be the unique consumer).

I am also curious if such patterns are realistic. Because most patterns I saw before are either vertical one (A -> B -> C) or horizontal ones (A -> {B, B, B}).

@ganler
Copy link
Collaborator Author

ganler commented Jun 13, 2022

  • One-One Match seems interesting. Do you think this is the feature or a bug to fix?
  • It's a long time ago, ...

@sunggg Good question! To me, it is an undefined behavior in Relay that may or may not distinguish diamond or parallel expressions:

import tvm
from tvm import relay
from tvm.relay.dataflow_pattern import *

is_conv2d_0 = is_op('nn.conv2d')(is_var(), is_var())
is_conv2d_1 = is_op('nn.conv2d')(is_var(), is_var())
"""
 conv
/    \
\    /
 add
"""
diamond_pattern  = is_conv2d_0 + is_conv2d_0
"""
 conv    conv
    \    /
     add
"""
parallel_pattern = is_conv2d_0 + is_conv2d_1

conv2d_0 = relay.op.nn.conv2d(relay.var('input0'), relay.var('weight0'))
conv2d_1 = relay.op.nn.conv2d(relay.var('input1'), relay.var('weight1'))
out_parallel = conv2d_0 + conv2d_1
out_diamond = conv2d_0 + conv2d_0

print(out_diamond)
print(f'{diamond_pattern.match(out_diamond) = }, which should be True')
print(f'{diamond_pattern.match(out_parallel) = }, which should be False')
"""[Output]
%0 = nn.conv2d(%input0, %weight0, padding=[0, 0, 0, 0]);
add(%0, %0)
diamond_pattern.match(out_diamond) = 1, which should be True
diamond_pattern.match(out_parallel) = 0, which should be False
"""
print('==============================================================')
print(out_parallel)
print(f'{parallel_pattern.match(out_parallel) = }, which should be True')
print(f'{parallel_pattern.match(out_diamond) = }, which should be False')
"""[Output]
%0 = nn.conv2d(%input0, %weight0, padding=[0, 0, 0, 0]);
%1 = nn.conv2d(%input1, %weight1, padding=[0, 0, 0, 0]);
add(%0, %1)
parallel_pattern.match(out_parallel) = 1, which should be True
parallel_pattern.match(out_diamond) = 1, which should be False
"""

That's why I think it is important to make the semantics clear. And the semantics should be simple and natural that each pattern variable is unique.

@sunggg
Copy link
Collaborator

sunggg commented Jun 13, 2022

@ganler I see. It seems like we are based on different assumptions about whether a fused kernel allows multiple outputs (potentially intermediate output) or not. If I remember correctly such multiple outputs are possible, so we won't need such duplication. Maybe it is worth verifying with AutoTIR folks about this - cc. @junrushao1994 @LeshengJin @Hzfengsy @jinhongyii

Another context for pattern matching is to use it for BYOC graph partitioning. It allows the multiple outputs for each region (See top green box in Step 3. Its region has two outputs) - link.

And thanks for verifying my old memory! :)
Yeap. Totally agree. This can be dangerous behavior since it silently returns the wrong output without raising any assertion. Hope we can clarify this with our semantic design.

@comaniac
Copy link
Contributor

Thanks for the proposal. I'm wondering whether this pattern language supports the following cases. Those are the cases that Relay pattern language doesn't support well and have been bother me for a while:

  1. Fuzzy pattern matching. For example, could we have a pattern that matches both gemm - add and cast - gemm - cast - add?
  2. Multiple outputs. For example, the pattern used to combine parallel GEMMs. I'm also interested in seeing whether we could fuse kernels with multiple (intermediate) outputs. As @sunggg pointed out, if the underlying TIR/MetaSchedule allows, it would be great to have general multi-output supports in pattern language.

@ganler
Copy link
Collaborator Author

ganler commented Jun 13, 2022

@sunggg Thanks for letting me know that one fused kernel may produce multiple outputs. I think something like is_op("conv") >> is_op("relu") > is_op("add") can support multi-output in your case. I will check more cases to see if the current design can always catch them!

@ganler
Copy link
Collaborator Author

ganler commented Jun 13, 2022

@comaniac Thanks for the great suggestion!

Fuzzy pattern matching. For example, could we have a pattern that matches both gemm - add and cast - gemm - cast - add?

I think the fuzzy matching proposal is great! More generally, we should give users options to "skip" certain operators, e.g., cast, broadcast, etc. This is should be easier to implement in the current design as we split the implementation of node-wise and graph-wise constraints.

gemm_add = is_op("gemm") >> is_op("add")
gemm_add.match(..., fuzzy_ops=[op.cast, op.cast_like, op.broadcast_to, op.broadcast_to_like])

Multiple outputs....

Yes, I think current pattern matching should allow matching expressions with one or more outputs. To fuse Conv-Relu-Add in a way that "Conv" is not sharable output while "Relu" and "Add" are, we can write is_op("conv") >> is_op("relu") > is_op("add"). The >> means "Conv" is only used by "Relu". The > means "Relu" is used by "Add" but can also be used by others.

      Conv   
       |        
      Relu   
     /   \     
Softmax  Add 

@ganler
Copy link
Collaborator Author

ganler commented Jun 15, 2022

Thanks for feedbacks from the Relax Open Meeting! After some further discussions with @YuchenJin, the following improvements are expected to be included:

  1. Exclude patterns: for example is_op("relu") > not(is_op("cast")) should match a relu → ? expression where ? is not “cast”.
  2. Add more helpful graph-level constraints:
    1. Allow external use (default): the matched non-leaf node is allowed to be used by nodes outside the matched subgraph.
    2. Must have external use: the matched non-leaf node must be used by nodes outside the matched subgraph.
    3. No external use: the matched non-leaf node must not be used by nodes outside the matched subgraph.
# [GRAPH A]
#    conv
#     |
#    bias
#    /  \
#  relu ...
#
# [GRAPH B]
#   ... -> conv -> bias -> conv -> ...

# Graph A is allowed as by default `bias` allows external use.
is_op("conv") >> is_op("bias") >> is_op("relu")

# Graph A is NOT allowed as by `bias` << `relu` means "only used by".
is_op("conv") >> is_op("bias") >> is_op("relu")

# Graph A is allowed as `bias` must be used by external node.
# Graph B is rejected as `bias` must be used by external node.
is_op("conv") >> is_op("bias").extern_use(True) > is_op("relu")

# Graph A is rejected as `bias` must NOT be used by external node.
# Graph B is allowed as `bias` must NOT be used by external node.
is_op("conv") >> is_op("bias").extern_use(False) > is_op("relu")

# Or if we simply want to assume that non-leaf nodes in the
# graph have no external use.
pattern = ... # any patter in the pattern graph.
pattern.add_graph_constraint(NonLeafExternUse(True))
# or
pattern.add_graph_constraint(NonLeafExternUse(False))
  1. Iterative dataflow rewriting: For example, if have Reshape-Reshape-...-Reshape, how can we express the ... here? In fact, we can just implement rewriting rules for Reshape-Reshape and then apply iterative dataflow rewriting to continuously rewrite the graph until fixed point ($L = R$ after $L\to R$).
# ORIGINAL: relu -> reshape -> reshape -> reshape -> relu
func = ...

merged = MergeReshape()(func) # one step by default
# ONESTEP: relu -> reshape -> reshape -> relu

merged = MergeReshape()(func, iterative=True)
# ITERATIVE: relu -> reshape -> relu
  1. Simplify the rewriting API by only using the node_map:
class FuseCBR(DFPatternCallback):
    def __init__(self):
        super().__init__()
        c = is_op("conv2d").with_attr("kernel_size", (1, 1))
        b = is_op("bias")
        r = is_op("relu")
        self.pattern = c >> b >> r # CBR pattern

    def callback(node_map: Dict[DFPattern, VarBinding]) -> List[Tuple[Pattern, Expr]]]:
        # Access everything through `node_map` (Pattern -> Correponding VarBinding).
        assert len(outputs) == 1
        input_args = node_map[c].args
        output_type = node_map[r].type_annotation
        return [
            (node_map[r], call_tir("cbr1x1", input_args, output_type))
            # All external uses to `relu` will be redirected to `cbr1x1`.
        ]
  1. Rewriting checkers:
    1. All external uses must be satisfied after writing. For conv > relu where conv is used external nodes, it is invalid to just return the rewrited relu.

@tqchen tqchen closed this as completed Jan 6, 2023
@tqchen
Copy link
Contributor

tqchen commented Jan 6, 2023

Most are implemented thanks to @ganler

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants