-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference speed regression in running local models on v1.0.2 #4659
Comments
Performance on metal devices is affected as well: new and prev version
|
Confirmed regression is:
|
Also affects non streaming chat/completions
Total time, in seconds. |
Also affects CPU only
|
Difference here: v1.0.1...v1.0.2
|
So diff is likely from mistral.rs changes. This is the diff: spiceai/mistral.rs@51994b5...80b548a Will binary search the 85 commits to see where performance degrades. |
Issue in mistral.rs, resolved in spiceai/mistral.rs#23 |
In spiceai, #4665 |
Above PR now in trunk, resolved. |
Describe the bug
After we released the
v1.0.2
binaries - during post-release testing we found that the inference speed of local models dropped significantly. Withv1.0.1
I was able to get 10 tok/s on themeta-llama/Llama-3.2-3B-Instruct
huggingface model running on an Nvidia RTX 3060 GPU. Onv1.0.2
it dropped to 2 tok/s for the same model and GPU - a 5x speed decrease.To Reproduce
Run the below Spicepod on a Windows machine with a GPU on
v1.0.1
and note the tokens/sec. Run the same spicepod withv1.0.2
and notice the inference speed has decreased.Expected behavior
The inference speed is the same or better in subsequent versions.
Runtime Details
Spicepod
Screenshots
Here is a screenshot of the GPU usage when running the Llama model on
1.0.2
. Notice that the GPU memory has increased, indicating it has loaded the model into memory. Also notice the GPU 3D compute (which is where the CUDA operations happen) is not maxxed out. In previous versions, this was maxxed out during inferencing.Running the non-CUDA models binary (i.e.
spice install ai --cpu
) performed even worse - which indicates the GPU was still being utilized, but not to its fullest extent.Additional context
I suspect a regression in the
mistral.rs
library or our usage of it.Additional logs:
CUDA(fast):
https://gist.github.com/sgrebnov/2dc75a4b34ee16f016f6d8ec4c2199dc
CUDA(slow):
https://gist.github.com/sgrebnov/abc9e0e6ace8788717f0ee0b2c7625db
The text was updated successfully, but these errors were encountered: