-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUTLASS] Initial conv2d support #9595
Conversation
CutlassPrint(conv2d_decl, "size_t workspace_size = conv2d_op.get_workspace_size(arguments);\n"); | ||
// Allocate workspace memory | ||
CutlassPrint(conv2d_decl, | ||
"cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's memory leak by allocating workspace this way. @ZihengJiang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the destructor clean itself up? https://github.com/NVIDIA/cutlass/blob/8f8a80cad57950be2d538ac0ead420740662318d/tools/util/include/cutlass/util/device_memory.h#L199
This is basically the same code as gemm op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
if epilogue == EpilogueFunctor.LinearCombination: | ||
op_entry["op"] = op | ||
op_entry["name"] = op.procedural_name() | ||
op_entry["runtime"] = 9999999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? Maybe also adding comments for clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to the gemm generator counterpart in
tvm/python/tvm/contrib/cutlass/gen_gemm.py
Lines 109 to 127 in adf560e
op_entry["op"] = op | |
op_entry["name"] = op.procedural_name() | |
op_entry["opdef"] = kernel_emitter.emit(op, batched=batched) | |
op_entry["opdef_bias"] = kernel_emitter.emit( | |
op_bias, no_beta_scaling=True, batched=batched | |
) | |
op_entry["opdef_bias_relu"] = kernel_emitter.emit( | |
op_bias_relu, no_beta_scaling=True, batched=batched | |
) | |
op_entry["opdef_bias_gelu"] = kernel_emitter.emit(op_bias_gelu, batched=batched) | |
op_entry["src"] = profiler_emitter.emit( | |
op.procedural_name(), | |
kernel_emitter.emit(op, batched=False), | |
DataTypeTag[element_a], | |
DataTypeTag[element_b], | |
DataTypeTag[element_c], | |
op.leading_dim(), | |
) | |
op_entry["runtime"] = 9999999 |
In addition to creating opdef
, opdef_bias
etc, we also need to set op
, name
, runtime
etc. I tried to simplify that code and this is what I came up with.
I'll rewrite this code to make it easier to understand (by pulling the non-activation case, EpilogueFunctor.LinearCombination
, out of the loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be much clearer now.
61040bf
to
bfe22bf
Compare
@comaniac good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @Laurawly @ZihengJiang I guess we could merge this first and fix the memory leaking issue in follow-up PRs if that is really a case?
Thanks @masahi. @Laurawly @ZihengJiang please feel free to continue the discussion here or at the forum for the potential issue. |
* Add initial conv generator * added conv2d pattern * profile by gemm profiler * remove conv2d profiler for now * remove unused code * add default * minor fix, profiling working * start codegen * generated code compiled * fixed layout initialization * matched with autotvm tensorcore result * test refactor * minor cleanup * remove iteration algo "Analytic" * add test for dynamic batch conv2d * pass dl tensor as output too * support conv2d dynamic shape in codegen * test working * lint * simplify codegen * fix weird formatting * typo fix * check if cutlass is enabled in the test * simplify gen_conv2d.py
* Add initial conv generator * added conv2d pattern * profile by gemm profiler * remove conv2d profiler for now * remove unused code * add default * minor fix, profiling working * start codegen * generated code compiled * fixed layout initialization * matched with autotvm tensorcore result * test refactor * minor cleanup * remove iteration algo "Analytic" * add test for dynamic batch conv2d * pass dl tensor as output too * support conv2d dynamic shape in codegen * test working * lint * simplify codegen * fix weird formatting * typo fix * check if cutlass is enabled in the test * simplify gen_conv2d.py
* Add initial conv generator * added conv2d pattern * profile by gemm profiler * remove conv2d profiler for now * remove unused code * add default * minor fix, profiling working * start codegen * generated code compiled * fixed layout initialization * matched with autotvm tensorcore result * test refactor * minor cleanup * remove iteration algo "Analytic" * add test for dynamic batch conv2d * pass dl tensor as output too * support conv2d dynamic shape in codegen * test working * lint * simplify codegen * fix weird formatting * typo fix * check if cutlass is enabled in the test * simplify gen_conv2d.py
* Add initial conv generator * added conv2d pattern * profile by gemm profiler * remove conv2d profiler for now * remove unused code * add default * minor fix, profiling working * start codegen * generated code compiled * fixed layout initialization * matched with autotvm tensorcore result * test refactor * minor cleanup * remove iteration algo "Analytic" * add test for dynamic batch conv2d * pass dl tensor as output too * support conv2d dynamic shape in codegen * test working * lint * simplify codegen * fix weird formatting * typo fix * check if cutlass is enabled in the test * simplify gen_conv2d.py
Adds boilerplate for generating conv2d kernels. Dynamic shape is supported.
To keep the diff small, this first PR only adds minimum code to demonstrate basic functionalities. In particular, activation fusion is not implemented yet, and profiling and kernel selection is done by piggy-backing on the existing GEMM profiler (see
cutlass/gen_conv2d.py
). The latter choice simplified the implementation, but as discussed in NVIDIA/cutlass#358, we probably want a dedicated profiler and kernel selection logic for conv2d. These missing features will be added after this PR.cc @comaniac @Laurawly @zhiics