Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference speed regression in running local models on v1.0.2 #4659

Closed
phillipleblanc opened this issue Feb 4, 2025 · 9 comments
Closed

Inference speed regression in running local models on v1.0.2 #4659

phillipleblanc opened this issue Feb 4, 2025 · 9 comments
Assignees
Milestone

Comments

@phillipleblanc
Copy link
Contributor

phillipleblanc commented Feb 4, 2025

Describe the bug

After we released the v1.0.2 binaries - during post-release testing we found that the inference speed of local models dropped significantly. With v1.0.1 I was able to get 10 tok/s on the meta-llama/Llama-3.2-3B-Instruct huggingface model running on an Nvidia RTX 3060 GPU. On v1.0.2 it dropped to 2 tok/s for the same model and GPU - a 5x speed decrease.

To Reproduce

Run the below Spicepod on a Windows machine with a GPU on v1.0.1 and note the tokens/sec. Run the same spicepod with v1.0.2 and notice the inference speed has decreased.

Expected behavior

The inference speed is the same or better in subsequent versions.

Runtime Details

Spicepod

version: v1
kind: Spicepod
name: llama

models:
  - from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    name: llama
    params:
      hf_token: ${secrets:HF_TOKEN}

Screenshots

Here is a screenshot of the GPU usage when running the Llama model on 1.0.2. Notice that the GPU memory has increased, indicating it has loaded the model into memory. Also notice the GPU 3D compute (which is where the CUDA operations happen) is not maxxed out. In previous versions, this was maxxed out during inferencing.

Image

Running the non-CUDA models binary (i.e. spice install ai --cpu) performed even worse - which indicates the GPU was still being utilized, but not to its fullest extent.

Additional context

I suspect a regression in the mistral.rs library or our usage of it.

Additional logs:
CUDA(fast):
https://gist.github.com/sgrebnov/2dc75a4b34ee16f016f6d8ec4c2199dc

CUDA(slow):
https://gist.github.com/sgrebnov/abc9e0e6ace8788717f0ee0b2c7625db

@phillipleblanc phillipleblanc added this to the v1.0.2 milestone Feb 4, 2025
@sgrebnov
Copy link
Contributor

sgrebnov commented Feb 4, 2025

Performance on metal devices is affected as well: new and prev version

Using model: hf_local_model
chat> test
It seems like there might be some confusion. "test" on its own is not very specific as a command or request. If you are looking to speak with a chatbot or need assistance with something, please provide more context or clarify your request so that I can offer the best possible support.

Time: 4.90s (first token 0.40s). Tokens: 119. Prompt: 59. Completion: 60 (13.33/s).

chat> ^C
spice chat
Using model: hf_local_model
chat> test
It seems like there might be some confusion. "test" on its own is not very specific as a command or request. If you are looking to speak with a chatbot or need assistance with something, please provide more context or clarify your request so that I can offer the best possible support.

Time: 2.86s (first token 0.22s). Tokens: 119. Prompt: 59. Completion: 60 (22.72/s).

@Jeadie
Copy link
Contributor

Jeadie commented Feb 4, 2025

Confirmed regression is:

  • On metal
  • Between v1.0.1 and v1.0.2 (i.e. regression was not missed from prior releases).

@Jeadie
Copy link
Contributor

Jeadie commented Feb 4, 2025

Also affects non streaming chat/completions

Old: 1.240, 1.061, 1.118
New: 5.782, 4.824, 5.091

Total time, in seconds.

@Jeadie
Copy link
Contributor

Jeadie commented Feb 4, 2025

Also affects CPU only

Old: 3.303, 2.758, 2.977
New: 7.456, 8.817, 7.860

@Jeadie
Copy link
Contributor

Jeadie commented Feb 4, 2025

Difference here: v1.0.1...v1.0.2

  • Merge mistral upstream #4562 (as per mistral.rs diff).
    - DeviceMapMetadata::dummy(),
    + DeviceMapSetting::Auto(AutoDeviceMapParams::default_text()),
  • Should not be issue. IIRC ::dummy() essentially does a similar thing.

@Jeadie
Copy link
Contributor

Jeadie commented Feb 5, 2025

So diff is likely from mistral.rs changes. This is the diff: spiceai/mistral.rs@51994b5...80b548a

Will binary search the 85 commits to see where performance degrades.

@Jeadie
Copy link
Contributor

Jeadie commented Feb 5, 2025

Issue in mistral.rs, resolved in spiceai/mistral.rs#23

@Jeadie
Copy link
Contributor

Jeadie commented Feb 5, 2025

In spiceai, #4665

@Jeadie
Copy link
Contributor

Jeadie commented Feb 5, 2025

Above PR now in trunk, resolved.

@Jeadie Jeadie closed this as completed Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants