[VTA] Enable streamlined GEMM execution #4392

liangfu · 2019-11-21T04:31:02Z

This PR fixed an issue in the streamlined GEMM execution by disabling pipelined adder, which consumes 4 cycles (in case of LOG_BLOCK=4) in addition to the single-cycle fused multiplier-adder. This is much longer than the 4-stage streamline design in the TensorGemm module, so instead of creating a routine to wait for the pipelined adder, this PR disabled the pipelined adder and bring the accumulated results to the output instantly.

Previously, the SMT schedule for GEMM in test_vta_insn.py was successful simply because the streamlined GEMM execution doesn't accumulate on the row, so there is no dependency between stage cycles in the TensorGemm module.

In addition, this PR brings successful evaluation of matrix_multiply.py, matrix_multiply_opt.py and convolution_opt.py under the tutorials directory.

@vegaluisjose @tmoreau89 Please review.

vegaluisjose

Hey @liangfu,

Just to double check, is this a fix to an error? or improvement?

The reason why we had the pipelined adder was because it showed more performance (after P&R) overall in terms of timing, because of register retiming.

Have you tried to push both version on the tools and verify timing?

tmoreau89 · 2019-11-21T22:31:57Z

I second Luis' comments on reporting on the latency/throughput tradeoffs; @liangfu thank you for the PR, do you mind pushing the old and new VTA design through Intel or Xilinx P&R and report on fmax, area and cycle count (perhaps on one of the conv2d benchmarks)

liangfu · 2019-11-22T03:23:11Z

This PR doesn't intend to reduce cycle count, or any performance improvement.

My major intention is to bring successful evaluation of matrix_multiply_opt.py and convolution_opt.py scripts, so that we can get closer to bring end-to-end support in evaluating resnet18. The reason the evaluation of these scripts failed previously is that

This PR fixed an issue in the streamlined GEMM execution by disabling pipelined adder, which consumes 4 cycles (in case of LOG_BLOCK=4) in addition to the single-cycle fused multiplier-adder. This is much longer than the 4-stage streamline design in the TensorGemm module.

Here are the benchmarks from Intel's Timing Analyzer (, with cycle count and result entry performed with matrix_multiply_opt.py script).

design	area (in ALMs)	fmax (slow 100C)	fmax (slow -40C)	fmax (fast 100C)	fmax (fast -40C)	cycle count	result
PipeAdderX4	20,419/41,910 (49%)	71.74 MHz	73.52 MHz	109.12 MHz	135.46 MHz	183,038	fail
PipeAdderX2 AdderX2	19,811/41,910 (47%)	66.94 MHz	69.93 MHz	100.76 MHz	125.3 MHz	183,006	pass
PipeAdderX1 AdderX3	19,811/41,910 (47%)	65.71 MHz	68.54 MHz	106.3 MHz	130.82 MHz	182,990	pass
AdderX4	18,186/41,910 (43%)	60.72 MHz	58.82 MHz	92.43 MHz	114.97 MHz	182,990	pass

However, it is recommended to use PipeAdderX1 AdderX3 design, since it guarantees correctness in even larger designs in evaluating matrix_multiply_opt.py .

liangfu · 2019-11-26T03:18:43Z

@vegaluisjose @tmoreau89 I've updated the timing results along with an update to add PipeAdder in the first layer of the adders. Please take another look.

tmoreau89

Thank you Liangfu for the enhancements and for sharing insights on performance on Intel FPGAs! I left a couple nits, but it seems good to go.

vta/hardware/chisel/src/main/scala/core/TensorGemm.scala

liangfu · 2019-11-26T11:46:31Z

@tmoreau89 All review comments has been addressed, please take another look.

For Chisel based design, I think for now, our target is to bring end-to-end support (with sufficient scalability) and reproduce what HLS is capable of. After that it would be more meaningful to consider performance improvements (with correctness guaranteed), and deprecate HLS based design along the way.

tmoreau89 · 2019-11-26T22:29:49Z

Thanks @liangfu ; I left one final comment, and the PR is good to go!

tmoreau89

Thanks, LGTM!

* disable pipelined adder and enable streamlined gemm execution * pipeline first layer of adder * explain difference between pipeadder and adder * add comment for explaining the hard-coded latency

disable pipelined adder and enable streamlined gemm execution

c1bae67

vegaluisjose reviewed Nov 21, 2019

View reviewed changes

tqchen added the status: need review label Nov 21, 2019

tqchen assigned tmoreau89 Nov 21, 2019

liangfu added 2 commits November 25, 2019 16:59

Merge remote-tracking branch 'upstream/master' into patch-15

c2b896b

pipeline first layer of adder

f02e979

tmoreau89 reviewed Nov 26, 2019

View reviewed changes

vta/hardware/chisel/src/main/scala/core/TensorGemm.scala Show resolved Hide resolved

vta/hardware/chisel/src/main/scala/core/TensorGemm.scala Show resolved Hide resolved

vta/hardware/chisel/src/main/scala/core/TensorGemm.scala Outdated Show resolved Hide resolved

explain difference between pipeadder and adder

a7985cf

add comment for explaining the hard-coded latency

ab23556

tmoreau89 approved these changes Nov 27, 2019

View reviewed changes

tmoreau89 merged commit 3a1c8c5 into apache:master Nov 27, 2019

liangfu deleted the patch-15 branch April 14, 2020 14:31

liangfu mentioned this pull request Jun 23, 2020

[BACKPORT-0.6][Bugfix][VTA] Enable streamlined GEMM execution #5893

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VTA] Enable streamlined GEMM execution #4392

[VTA] Enable streamlined GEMM execution #4392

liangfu commented Nov 21, 2019

vegaluisjose left a comment

tmoreau89 commented Nov 21, 2019

liangfu commented Nov 22, 2019 •

edited

Loading

liangfu commented Nov 26, 2019

tmoreau89 left a comment

liangfu commented Nov 26, 2019

tmoreau89 commented Nov 26, 2019

tmoreau89 left a comment

[VTA] Enable streamlined GEMM execution #4392

[VTA] Enable streamlined GEMM execution #4392

Conversation

liangfu commented Nov 21, 2019

vegaluisjose left a comment

Choose a reason for hiding this comment

tmoreau89 commented Nov 21, 2019

liangfu commented Nov 22, 2019 • edited Loading

liangfu commented Nov 26, 2019

tmoreau89 left a comment

Choose a reason for hiding this comment

liangfu commented Nov 26, 2019

tmoreau89 commented Nov 26, 2019

tmoreau89 left a comment

Choose a reason for hiding this comment

liangfu commented Nov 22, 2019 •

edited

Loading