[CUTLASS] Initial conv2d support #9595

masahi · 2021-11-26T10:27:39Z

Adds boilerplate for generating conv2d kernels. Dynamic shape is supported.

To keep the diff small, this first PR only adds minimum code to demonstrate basic functionalities. In particular, activation fusion is not implemented yet, and profiling and kernel selection is done by piggy-backing on the existing GEMM profiler (see cutlass/gen_conv2d.py). The latter choice simplified the implementation, but as discussed in NVIDIA/cutlass#358, we probably want a dedicated profiler and kernel selection logic for conv2d. These missing features will be added after this PR.

cc @comaniac @Laurawly @zhiics

Laurawly · 2021-11-29T20:03:09Z

src/relay/backend/contrib/cutlass/codegen.cc

+  CutlassPrint(conv2d_decl, "size_t workspace_size = conv2d_op.get_workspace_size(arguments);\n");
+  // Allocate workspace memory
+  CutlassPrint(conv2d_decl,
+               "cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);\n");


There's memory leak by allocating workspace this way. @ZihengJiang

Does the destructor clean itself up? https://github.com/NVIDIA/cutlass/blob/8f8a80cad57950be2d538ac0ead420740662318d/tools/util/include/cutlass/util/device_memory.h#L199

This is basically the same code as gemm op.

comaniac

Overall LGTM

python/tvm/contrib/cutlass/conv2d_operation.py

comaniac · 2021-11-30T22:05:41Z

python/tvm/contrib/cutlass/gen_conv2d.py

+                    if epilogue == EpilogueFunctor.LinearCombination:
+                        op_entry["op"] = op
+                        op_entry["name"] = op.procedural_name()
+                        op_entry["runtime"] = 9999999


What's this for? Maybe also adding comments for clarification.

This corresponds to the gemm generator counterpart in

tvm/python/tvm/contrib/cutlass/gen_gemm.py

Lines 109 to 127 in adf560e

op_entry["op"] = op

op_entry["name"] = op.procedural_name()

op_entry["opdef"] = kernel_emitter.emit(op, batched=batched)

op_entry["opdef_bias"] = kernel_emitter.emit(

op_bias, no_beta_scaling=True, batched=batched

)

op_entry["opdef_bias_relu"] = kernel_emitter.emit(

op_bias_relu, no_beta_scaling=True, batched=batched

)

op_entry["opdef_bias_gelu"] = kernel_emitter.emit(op_bias_gelu, batched=batched)

op_entry["src"] = profiler_emitter.emit(

op.procedural_name(),

kernel_emitter.emit(op, batched=False),

DataTypeTag[element_a],

DataTypeTag[element_b],

DataTypeTag[element_c],

op.leading_dim(),

)

op_entry["runtime"] = 9999999

In addition to creating opdef, opdef_bias etc, we also need to set op, name, runtime etc. I tried to simplify that code and this is what I came up with.

I'll rewrite this code to make it easier to understand (by pulling the non-activation case, EpilogueFunctor.LinearCombination, out of the loop).

It should be much clearer now.

masahi · 2021-12-01T23:26:51Z

@comaniac good to go?

comaniac

LGTM. @Laurawly @ZihengJiang I guess we could merge this first and fix the memory leaking issue in follow-up PRs if that is really a case?

comaniac · 2021-12-02T01:35:21Z

Thanks @masahi. @Laurawly @ZihengJiang please feel free to continue the discussion here or at the forum for the potential issue.

* Add initial conv generator * added conv2d pattern * profile by gemm profiler * remove conv2d profiler for now * remove unused code * add default * minor fix, profiling working * start codegen * generated code compiled * fixed layout initialization * matched with autotvm tensorcore result * test refactor * minor cleanup * remove iteration algo "Analytic" * add test for dynamic batch conv2d * pass dl tensor as output too * support conv2d dynamic shape in codegen * test working * lint * simplify codegen * fix weird formatting * typo fix * check if cutlass is enabled in the test * simplify gen_conv2d.py

KangHe000 · 2023-02-13T02:17:38Z

Hi, @Laurawly , In your bolt paper, I notice that you made a special design for conv2d layout transformation(nchw to nhwc), this feature is very useful for networks with multiple convolutions. However, I can't find the way to enable this feature. Is it not merged into tvm? cc @masahi

masahi requested review from anijain2305, areusch, comaniac, icemelon, jroesch, junrushao, jwfromm, manupak, MarisaKirisame, mbaret, mbrookhart, merrymercy, slyubomirsky, tqchen, trevor-m, vinx13, wweic, yzhliu, zhiics and ZihengJiang as code owners November 26, 2021 10:27

Laurawly reviewed Nov 29, 2021

View reviewed changes

comaniac reviewed Nov 30, 2021

View reviewed changes

masahi requested review from Huyuwei, kazum, kparzysz-quic, leandron, liangfu, siju-samuel, srkreddy1238 and tmoreau89 as code owners December 1, 2021 13:11

masahi added 19 commits December 1, 2021 22:11

add default

ef7e305

minor fix, profiling working

f87971b

start codegen

30de42b

generated code compiled

a7a2173

fixed layout initialization

60666bc

matched with autotvm tensorcore result

6ff06dc

test refactor

da610ba

minor cleanup

1629dfd

remove iteration algo "Analytic"

8566d4f

add test for dynamic batch conv2d

556b13e

pass dl tensor as output too

ad97f67

support conv2d dynamic shape in codegen

1ab88d5

test working

cb3548f

lint

1959b05

simplify codegen

b8f1bfd

fix weird formatting

71a1af8

typo fix

0dd9e75

check if cutlass is enabled in the test

67d6312

simplify gen_conv2d.py

bfe22bf

masahi force-pushed the cutlass-conv2d branch from 61040bf to bfe22bf Compare December 1, 2021 13:12

comaniac approved these changes Dec 1, 2021

View reviewed changes

comaniac merged commit dc988b2 into apache:main Dec 2, 2021

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUTLASS] Initial conv2d support #9595

[CUTLASS] Initial conv2d support #9595

masahi commented Nov 26, 2021 •

edited

Loading

Laurawly Nov 29, 2021

masahi Nov 30, 2021

comaniac left a comment

comaniac Nov 30, 2021

masahi Nov 30, 2021

masahi Dec 1, 2021

masahi commented Dec 1, 2021

comaniac left a comment

comaniac commented Dec 2, 2021

KangHe000 commented Feb 13, 2023

	op_entry["op"] = op
	op_entry["name"] = op.procedural_name()
	op_entry["opdef"] = kernel_emitter.emit(op, batched=batched)
	op_entry["opdef_bias"] = kernel_emitter.emit(
	op_bias, no_beta_scaling=True, batched=batched
	)
	op_entry["opdef_bias_relu"] = kernel_emitter.emit(
	op_bias_relu, no_beta_scaling=True, batched=batched
	)
	op_entry["opdef_bias_gelu"] = kernel_emitter.emit(op_bias_gelu, batched=batched)
	op_entry["src"] = profiler_emitter.emit(
	op.procedural_name(),
	kernel_emitter.emit(op, batched=False),
	DataTypeTag[element_a],
	DataTypeTag[element_b],
	DataTypeTag[element_c],
	op.leading_dim(),
	)
	op_entry["runtime"] = 9999999

[CUTLASS] Initial conv2d support #9595

[CUTLASS] Initial conv2d support #9595

Conversation

masahi commented Nov 26, 2021 • edited Loading

Laurawly Nov 29, 2021

Choose a reason for hiding this comment

masahi Nov 30, 2021

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

comaniac Nov 30, 2021

Choose a reason for hiding this comment

masahi Nov 30, 2021

Choose a reason for hiding this comment

masahi Dec 1, 2021

Choose a reason for hiding this comment

masahi commented Dec 1, 2021

comaniac left a comment

Choose a reason for hiding this comment

comaniac commented Dec 2, 2021

KangHe000 commented Feb 13, 2023

masahi commented Nov 26, 2021 •

edited

Loading