-
Notifications
You must be signed in to change notification settings - Fork 369
Paged Attention #333
Comments
Does it have any benefit on cpu-only inference, given that host memory is already paged? |
@vikigenius could you please share your benchmarks with vllm vs llama.cpp for gpu ? That will give us some insight into potential speed up. |
@okpatil4u I don't have the benchmarks for llama.cpp. I primarily noticed the speed up between the PyTorch implementations with and without paged attention. And there is no reason to think an algorithmic change like that wouldn't translate across languages. We tested it on NVIDIA A100 GPUs and got significant speedup I will try to get the numbers soon, once we have access to them again. |
@okpatil4u got the numbers now. Not a rigorous benchmark, but should still hold up since the gains are so significant. WIth a 40 GB A100 GPU Inference on a vicuna-13B model without paged attention produces 20 tokens / sec So the speedup is almost 10x. Obviously this is a bit skewed because our workload involves using the same initial prompt prefix in a batch inference setting so there might be good reuse of the KV cache which is helped by Paged Attention. |
Wow, this is amazing. Thanks for postint.
But are you sure if vicuna 13b llama.cpp is benchmarking at 50 ms/token on
an A40 gpu ? I would expect it to be a bit faster.
…On Wed, 28 Jun 2023 at 9:12 PM, Vikash ***@***.***> wrote:
@okpatil4u <https://github.com/okpatil4u> got the numbers now. Not a
rigorous benchmark, but should still hold up since the gains are so
significant.
WIth a 40 GB A100 GPU
Inference on a vicuna-13B model without paged attention produces 20 tokens
/ sec
Inference on a vicuna-13B model with paged attention produces 190 tokens /
sec
So the speedup is almost 10x. Obviously this is a bit skewed because our
workload involves using the same initial prompt prefix in a batch inference
setting so there might be good reuse of the KV cache which is helped by
Paged Attention.
—
Reply to this email directly, view it on GitHub
<#333 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXGU4CYRRIDUFY4V4ZOTNLXNRGFFANCNFSM6AAAAAAZU2FPHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Well as I mentioned before we don't actually use llama.cpp at work in our A100s, so my benchmark numbers are comparing pytorch implementations. It is possible that at this point llama.cpp itself is a bit better than the PyTorch implementation which might explain the discrepancy. But given how big the gain is I would expect that if you port Paged Attention to llama.cpp you should see similar gains there as well ? |
The discussion here might be relevant ggerganov/llama.cpp#1955 although it seems many people are misunderstanding how the paging works. It should be hugely beneficial for any batched inference workloads even on a single GPU |
Unfortunately, we are likely beholden to what upstream GGML supports, as this would be applied at that layer of the execution. This is something we could potentially implement with #312, but even then we'd need to work with I'll leave this issue open for now, but I don't think we'll see much movement here from our end, sorry :/ |
Hello, |
Hi, that would be nice to have! I'm not sure if we'll get around to it any time soon as it'll require updating our GGML version and setting up all of the required structures, but I'll see what can be done once we get there. |
Just found a recent blog https://vllm.ai/ and repo https://github.com/vllm-project/vllm that implements paged attention. Tested this out and it provides massive throughput and memory efficiency improvements.
Can we implement something like this? The paper isn't out yet. But shouldn't Rust be very good at this in theory with it's memory safety guarantees.
The text was updated successfully, but these errors were encountered: