Very promising! #22
evansumarosenberg
started this conversation in
General
Replies: 1 comment
-
Nice, so it works under WSL? Same speeds? I was able to push it to 19 tokens/sec on an Ada A6000 on RunPod: Splitting the model to use my 4090 /w the older A6000 using |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Fantastic work! I just started using exllama and the performance is very impressive. Here are some benchmarks from my initial testing today using the included benchmarking script (128 tokens, 1920 token prompt).
Model: TheBloke_guanaco-33B-GPTQ
Model: TheBloke_guanaco-65B-GPTQ
The 4090 was on my local machine with a Core i9-12900K on Windows 11 (WSL). The A100 and A40 benchmarks were run on an HPC cluster using a single compute node with 32 CPU cores. This is an order of magnitude increase in performance compared to using text-generation-webui and GPTQ-for-LLaMA. For comparison, with the 35B model, I was previously getting 8-10 tokens/sec on the 4090 and 2.5-3 tokens/sec on the A100.
I saw in your updates that you are working on a web UI, which is fantastic. I am more interested in a web API, so I will probably go ahead and implement a quick and dirty server if you're not already close to finishing one. Happy to share the code for that if you're interested.
By the way, I had to figure out a few extra steps to get things running in conda. The following worked for me in both WSL and Linux:
Beta Was this translation helpful? Give feedback.
All reactions