perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

yzh119 · 2024-08-11T04:06:43Z

hardware fp8->fp16 fast conversion instruction is not available for sm_80 & sm_89, which makes #420 slow for these architectures.

this pr uses marlin's fast fp8->fp16x4 conversion algorithm (copied from vllm project) to accelerate such cases.

Co-authored-by: Antoni Baum [email protected]
Co-authored-by: Cody Yu [email protected]

@comaniac

🤖 I have created a release *beep* *boop* --- ## [0.1.5](v0.1.4...v0.1.5) (2024-08-13) ### Bugfix * Fix PagedPrefill python api and some typos ([#441](#441)) ([3fff008](3fff008)) * fix prefill kernels' lse result for empty kv-cache ([#440](#440)) ([6ac28f4](6ac28f4)) ### Features * decouple float and int workspace buffer ([#442](#442)) ([a7ee566](a7ee566)) ### Performance Improvements * faster fp8->fp16 dequantization for pre sm_90 arch ([#439](#439)) ([c93f647](c93f647)) ### Acknowledgement We thank contributions and feedbacks from the community: [@comaniac](https://github.com/comaniac), [@hnyls2002](https://github.com/hnyls2002), [@jianfei-wangg](https://github.com/jianfei-wangg), [@Yard1](https://github.com/Yard1). --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <[email protected]>

Followup of #439 , use `constexpr` in if conditions so that `BIAS_OFFSET` won't exceed 32 at compile time.

hardware fp8->fp16 fast conversion instruction is not available for sm_80 & sm_89, which makes #420 slow for these architectures. this pr uses marlin's fast fp8->fp16x4 conversion algorithm (copied from vllm project) to accelerate such cases. Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Cody Yu <[email protected]>

yzh119 added 8 commits August 11, 2024 03:58

upd

3ec4128

upd

def5692

upd

8b424c1

upd

2d2d485

upd

24601b7

upd

cf49265

upd

3147b48

upd

55d5715

yzh119 merged commit c93f647 into main Aug 11, 2024

github-actions bot mentioned this pull request Aug 10, 2024

chore(main): release 0.1.5 #435

Merged

yzh119 deleted the faster-f8-f16-dequant branch August 11, 2024 07:50

yzh119 mentioned this pull request Aug 13, 2024

bugfix: suppress warning #63-D: shift count is too large #444

Merged

yzh119 added a commit that referenced this pull request Aug 13, 2024

bugfix: suppress warning #63-D: shift count is too large (#444)

d07b19e

Followup of #439 , use `constexpr` in if conditions so that `BIAS_OFFSET` won't exceed 32 at compile time.

github-actions bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

yzh119 commented Aug 11, 2024

perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

Conversation

yzh119 commented Aug 11, 2024