Run fp8 models on Ampere GPUs with Marlin Kernels #2503
Replies: 1 comment
-
I just ran this and it's flawless. On an A40. 3X higher throughput than vllm at batch 64 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I see that awq is supported (likely via Marlin?) and am wondering whether fp8 can also be done on Ampere (via dequantisation)? It works on vLLM.
fp8 works natively on Lovelace and Hopper.
Beta Was this translation helpful? Give feedback.
All reactions