-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VTA] Enable streamlined GEMM execution #4392
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @liangfu,
Just to double check, is this a fix to an error? or improvement?
The reason why we had the pipelined adder was because it showed more performance (after P&R) overall in terms of timing, because of register retiming.
Have you tried to push both version on the tools and verify timing?
I second Luis' comments on reporting on the latency/throughput tradeoffs; @liangfu thank you for the PR, do you mind pushing the old and new VTA design through Intel or Xilinx P&R and report on fmax, area and cycle count (perhaps on one of the |
This PR doesn't intend to reduce cycle count, or any performance improvement. My major intention is to bring successful evaluation of
Here are the benchmarks from Intel's Timing Analyzer (, with cycle count and result entry performed with
However, it is recommended to use |
@vegaluisjose @tmoreau89 I've updated the timing results along with an update to add PipeAdder in the first layer of the adders. Please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Liangfu for the enhancements and for sharing insights on performance on Intel FPGAs! I left a couple nits, but it seems good to go.
@tmoreau89 All review comments has been addressed, please take another look. For Chisel based design, I think for now, our target is to bring end-to-end support (with sufficient scalability) and reproduce what HLS is capable of. After that it would be more meaningful to consider performance improvements (with correctness guaranteed), and deprecate HLS based design along the way. |
Thanks @liangfu ; I left one final comment, and the PR is good to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
* disable pipelined adder and enable streamlined gemm execution * pipeline first layer of adder * explain difference between pipeadder and adder * add comment for explaining the hard-coded latency
* disable pipelined adder and enable streamlined gemm execution * pipeline first layer of adder * explain difference between pipeadder and adder * add comment for explaining the hard-coded latency
* disable pipelined adder and enable streamlined gemm execution * pipeline first layer of adder * explain difference between pipeadder and adder * add comment for explaining the hard-coded latency
* disable pipelined adder and enable streamlined gemm execution * pipeline first layer of adder * explain difference between pipeadder and adder * add comment for explaining the hard-coded latency
* disable pipelined adder and enable streamlined gemm execution * pipeline first layer of adder * explain difference between pipeadder and adder * add comment for explaining the hard-coded latency
This PR fixed an issue in the streamlined GEMM execution by disabling pipelined adder, which consumes 4 cycles (in case of LOG_BLOCK=4) in addition to the single-cycle fused multiplier-adder. This is much longer than the 4-stage streamline design in the
TensorGemm
module, so instead of creating a routine to wait for the pipelined adder, this PR disabled the pipelined adder and bring the accumulated results to the output instantly.Previously, the SMT schedule for GEMM in
test_vta_insn.py
was successful simply because the streamlined GEMM execution doesn't accumulate on the row, so there is no dependency between stage cycles in the TensorGemm module.In addition, this PR brings successful evaluation of
matrix_multiply.py
,matrix_multiply_opt.py
andconvolution_opt.py
under thetutorials
directory.@vegaluisjose @tmoreau89 Please review.