Inference speed regression in running local models on v1.0.2 #4659

phillipleblanc · 2025-02-04T08:57:19Z

Describe the bug

After we released the v1.0.2 binaries - during post-release testing we found that the inference speed of local models dropped significantly. With v1.0.1 I was able to get 10 tok/s on the meta-llama/Llama-3.2-3B-Instruct huggingface model running on an Nvidia RTX 3060 GPU. On v1.0.2 it dropped to 2 tok/s for the same model and GPU - a 5x speed decrease.

To Reproduce

Run the below Spicepod on a Windows machine with a GPU on v1.0.1 and note the tokens/sec. Run the same spicepod with v1.0.2 and notice the inference speed has decreased.

Expected behavior

The inference speed is the same or better in subsequent versions.

Runtime Details

Spicepod

version: v1
kind: Spicepod
name: llama

models:
  - from: huggingface:huggingface.co/meta-llama/Llama-3.2-3B-Instruct
    name: llama
    params:
      hf_token: ${secrets:HF_TOKEN}

Screenshots

Here is a screenshot of the GPU usage when running the Llama model on 1.0.2. Notice that the GPU memory has increased, indicating it has loaded the model into memory. Also notice the GPU 3D compute (which is where the CUDA operations happen) is not maxxed out. In previous versions, this was maxxed out during inferencing.

Running the non-CUDA models binary (i.e. spice install ai --cpu) performed even worse - which indicates the GPU was still being utilized, but not to its fullest extent.

Additional context

I suspect a regression in the mistral.rs library or our usage of it.

Additional logs:
CUDA(fast):
https://gist.github.com/sgrebnov/2dc75a4b34ee16f016f6d8ec4c2199dc

CUDA(slow):
https://gist.github.com/sgrebnov/abc9e0e6ace8788717f0ee0b2c7625db

The text was updated successfully, but these errors were encountered:

sgrebnov · 2025-02-04T09:15:17Z

Performance on metal devices is affected as well: new and prev version

Using model: hf_local_model
chat> test
It seems like there might be some confusion. "test" on its own is not very specific as a command or request. If you are looking to speak with a chatbot or need assistance with something, please provide more context or clarify your request so that I can offer the best possible support.

Time: 4.90s (first token 0.40s). Tokens: 119. Prompt: 59. Completion: 60 (13.33/s).

chat> ^C
spice chat
Using model: hf_local_model
chat> test
It seems like there might be some confusion. "test" on its own is not very specific as a command or request. If you are looking to speak with a chatbot or need assistance with something, please provide more context or clarify your request so that I can offer the best possible support.

Time: 2.86s (first token 0.22s). Tokens: 119. Prompt: 59. Completion: 60 (22.72/s).

Jeadie · 2025-02-04T22:28:23Z

Confirmed regression is:

On metal
Between v1.0.1 and v1.0.2 (i.e. regression was not missed from prior releases).

Jeadie · 2025-02-04T23:14:11Z

Also affects non streaming chat/completions

Old: 1.240, 1.061, 1.118
New: 5.782, 4.824, 5.091

Total time, in seconds.

Jeadie · 2025-02-04T23:41:14Z

Also affects CPU only

Old: 3.303, 2.758, 2.977
New: 7.456, 8.817, 7.860

Jeadie · 2025-02-04T23:57:34Z

Difference here: v1.0.1...v1.0.2

Merge mistral upstream #4562 (as per mistral.rs diff).

- DeviceMapMetadata::dummy(),
+ DeviceMapSetting::Auto(AutoDeviceMapParams::default_text()),

Should not be issue. IIRC ::dummy() essentially does a similar thing.

Jeadie · 2025-02-05T00:17:19Z

So diff is likely from mistral.rs changes. This is the diff: spiceai/mistral.rs@51994b5...80b548a

Will binary search the 85 commits to see where performance degrades.

Jeadie · 2025-02-05T01:39:51Z

Issue in mistral.rs, resolved in spiceai/mistral.rs#23

Jeadie · 2025-02-05T01:40:57Z

In spiceai, #4665

Jeadie · 2025-02-05T21:13:25Z

Above PR now in trunk, resolved.

phillipleblanc added this to the v1.0.2 milestone Feb 4, 2025

phillipleblanc assigned Jeadie Feb 4, 2025

Jeadie closed this as completed Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference speed regression in running local models on v1.0.2 #4659

Inference speed regression in running local models on v1.0.2 #4659

phillipleblanc commented Feb 4, 2025 •

edited

Loading

sgrebnov commented Feb 4, 2025 •

edited

Loading

Jeadie commented Feb 4, 2025

Jeadie commented Feb 4, 2025 •

edited

Loading

Jeadie commented Feb 4, 2025

Jeadie commented Feb 4, 2025 •

edited

Loading

Jeadie commented Feb 5, 2025 •

edited

Loading

Jeadie commented Feb 5, 2025

Jeadie commented Feb 5, 2025

Jeadie commented Feb 5, 2025

Inference speed regression in running local models on v1.0.2 #4659

Inference speed regression in running local models on v1.0.2 #4659

Comments

phillipleblanc commented Feb 4, 2025 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Runtime Details

Spicepod

Screenshots

Additional context

sgrebnov commented Feb 4, 2025 • edited Loading

Jeadie commented Feb 4, 2025

Jeadie commented Feb 4, 2025 • edited Loading

Jeadie commented Feb 4, 2025

Jeadie commented Feb 4, 2025 • edited Loading

Jeadie commented Feb 5, 2025 • edited Loading

Jeadie commented Feb 5, 2025

Jeadie commented Feb 5, 2025

Jeadie commented Feb 5, 2025

phillipleblanc commented Feb 4, 2025 •

edited

Loading

sgrebnov commented Feb 4, 2025 •

edited

Loading

Jeadie commented Feb 4, 2025 •

edited

Loading

Jeadie commented Feb 4, 2025 •

edited

Loading

Jeadie commented Feb 5, 2025 •

edited

Loading