Very promising! #22

evansumarosenberg · 2023-05-31T21:32:17Z

evansumarosenberg
May 31, 2023

Fantastic work! I just started using exllama and the performance is very impressive. Here are some benchmarks from my initial testing today using the included benchmarking script (128 tokens, 1920 token prompt).

Model: TheBloke_guanaco-33B-GPTQ

4090: 33.37 tokens/sec
A100: 26.11 tokens/sec
A40: 21.35 tokens/sec

Model: TheBloke_guanaco-65B-GPTQ

A100: 14.73 tokens/sec
A40: 11.72 tokens/sec

The 4090 was on my local machine with a Core i9-12900K on Windows 11 (WSL). The A100 and A40 benchmarks were run on an HPC cluster using a single compute node with 32 CPU cores. This is an order of magnitude increase in performance compared to using text-generation-webui and GPTQ-for-LLaMA. For comparison, with the 35B model, I was previously getting 8-10 tokens/sec on the 4090 and 2.5-3 tokens/sec on the A100.

I saw in your updates that you are working on a web UI, which is fantastic. I am more interested in a web API, so I will probably go ahead and implement a quick and dirty server if you're not already close to finishing one. Happy to share the code for that if you're interested.

By the way, I had to figure out a few extra steps to get things running in conda. The following worked for me in both WSL and Linux:

conda create --name exllama python=3.10.11
conda activate exllama
conda install -y -k pytorch[version=2,build=py3.10_cuda11.7*] torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit ninja git -c pytorch -c nvidia/label/cuda-11.7.0 -c nvidia
conda install -c conda-forge cudatoolkit-dev

git clone https://github.com/turboderp/exllama
cd exllama
pip install -r requirements.txt

disarmyouwitha · 2023-06-01T02:23:52Z

disarmyouwitha
Jun 1, 2023

Nice, so it works under WSL? Same speeds?

I was able to push it to 19 tokens/sec on an Ada A6000 on RunPod:
https://pastebin.com/7iS3SSEw

Splitting the model to use my 4090 /w the older A6000 using -gs 18,45 I am getting 16-19 tokens/sec on 65b, which is faster somehow than I get running it 100% on the A6000 (~13 t/s)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very promising! #22

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Very promising! #22

evansumarosenberg May 31, 2023

Replies: 1 comment

disarmyouwitha Jun 1, 2023

evansumarosenberg
May 31, 2023

disarmyouwitha
Jun 1, 2023