Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: faster fp8->fp16 dequantization for pre sm_90 arch #439

Merged
merged 8 commits into from
Aug 11, 2024

Conversation

yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Aug 11, 2024

hardware fp8->fp16 fast conversion instruction is not available for sm_80 & sm_89, which makes #420 slow for these architectures.

this pr uses marlin's fast fp8->fp16x4 conversion algorithm (copied from vllm project) to accelerate such cases.

Co-authored-by: Antoni Baum [email protected]
Co-authored-by: Cody Yu [email protected]

@yzh119 yzh119 merged commit c93f647 into main Aug 11, 2024
@yzh119 yzh119 deleted the faster-f8-f16-dequant branch August 11, 2024 07:50
yzh119 added a commit that referenced this pull request Aug 13, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.1.5](v0.1.4...v0.1.5)
(2024-08-13)


### Bugfix

* Fix PagedPrefill python api and some typos
([#441](#441))
([3fff008](3fff008))
* fix prefill kernels' lse result for empty kv-cache
([#440](#440))
([6ac28f4](6ac28f4))

### Features

* decouple float and int workspace buffer
([#442](#442))
([a7ee566](a7ee566))


### Performance Improvements

* faster fp8->fp16 dequantization for pre sm_90 arch
([#439](#439))
([c93f647](c93f647))

### Acknowledgement

We thank contributions and feedbacks from the community:
[@comaniac](https://github.com/comaniac),
[@hnyls2002](https://github.com/hnyls2002),
[@jianfei-wangg](https://github.com/jianfei-wangg),
[@Yard1](https://github.com/Yard1).


---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zihao Ye <[email protected]>
yzh119 added a commit that referenced this pull request Aug 13, 2024
Followup of #439 , use `constexpr` in if conditions so that
`BIAS_OFFSET` won't exceed 32 at compile time.
zhyncs pushed a commit that referenced this pull request Aug 14, 2024
hardware fp8->fp16 fast conversion instruction is not available for
sm_80 & sm_89, which makes #420 slow for these architectures.

this pr uses marlin's fast fp8->fp16x4 conversion algorithm (copied from
vllm project) to accelerate such cases.

Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant