-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUTLASS] Conv2d dgrad #10110
[CUTLASS] Conv2d dgrad #10110
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the insightful investigation :)
This could also illustrate the scenario of developing Collage that deals with backend placement (https://arxiv.org/pdf/2111.00655.pdf).
I'm having a conversion about strided dgrad performance compared to cuDNN. Will give more update before merging. |
HUGE UPDATE: Thanks to a tip from @manishucsd and @hwu36, it turns out upgrading the CUDA version from 11.3 to 11.6 alone gives 2x speedup on cutlass strided dgrad (unreal). Moreover, there was a critical bug in the parameter Here are the updated results after these two fixes: Now, cutlass is winning in ALL but one case in batch size 256, which is still 0.96 vs 0.94 difference. Note that activation fusion is not enabled for dgrad yet. So I expect the cutlass perf to be much better in practice for DL training use cases. |
The real world training would require fp32 accumulation. In that case, the kernel will be more compute-bounded and the better kernel will have more advantages. |
Merging, I'll follow up with wgrad + parallel split-k support. |
* add conv2d transpose nhwc cudnn test * support conv2d transpose nhwc direct offload to cudnn * add cutlass dgrad support * remove unused arg * allow target none * fix beta initiaization condition * disable dynamic dense fp16 test since it fails on cuda 11.6
Adds dgrad support. Wgrad is more complicated and I'm having weird accuracy issues, so it will come later.
UPDATE: See the latest result in #10110 (comment)
@comaniac @Laurawly @junrushao1994 @vinx13 @YuchenJin @hwu36 @manishucsd
Old results below, not relevant anymore
Linked below is a benchmark result against cuDNN on resnet50 workloads, with batch size 8 and 256. All numbers in milli second, generated on RTX 3070 by this script
It's interesting to note that, on batch size = 8, cutlass is mostly faster while on batch size = 256, cuDNN is faster. Looking at nvprof dump, it turns out that even if the e2e time, as reported by TVM's
time_evaluator
, shows cutlass being faster, cuDNN could be winning in the kernel-only time. For example, the first row of batch 8 case shows thatcutlass vs cudnn = 54 vs 109 usec
. But nvprof shows:This means more than half of cuDNN e2e time is spent on overhead, either inside TVM during cuDNN call or within cuDNN itself. Apparently, cutlass has much smaller overhead.
cuDNN is mostly faster in batch 256 case. This could be due to overhead being small. In particular, the difference is large for stride = 2 cases. For example, on the 5-th row, which shows
cutlass vs cudnn = 4.18 vs 1.71 msec
, nvprof showswhich suggests cuDNN's strided dgrad being significantly better than cutlass (?) @manishucsd
However, even on the larger batch size, cutlass is always winning on workloads with filter size 3. For example, here is the nvprof dump for the thrid row.