[Feature] Initial support for multi-LoRA serving #1307

Ying1123 · 2024-09-03T01:04:40Z

This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (qkvo) and mlp (gate, up, down) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller --mem-frac to launch server with larger --max-loras-per-batch.

Current example usage:

# launch server
python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora /home/ying/test_lora_1 /home/ying/test_lora_2 /home/ying/test_lora_3 /home/ying/test_lora_4 --disable-radix --disable-cuda-graph --max-loras-per-batch 4

# send requests
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
import json
import requests

url = "http://127.0.0.1:30000"
json_data = {
        "text": ["prompt 1", "prompt 2", "prompt 3", "prompt 4", "prompt 5", "prompt 6", "prompt7"],
        "sampling_params": {"max_new_tokens": 32},
        "lora_path": ["/home/ying/test_lora", "/home/ying/test_lora_1", "/home/ying/test_lora_2", "/home/ying/test_lora_3", "/home/ying/test_lora_4", "/home/ying/test_lora", "/home/ying/test_lora_1"],
}
response = requests.post(
        url + "/generate",
        json=json_data,
)
print(json.dumps(response.json()))

You can expect the items below in the follow-up PRs.

References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving

binarycrayon · 2024-09-12T15:59:49Z

Tested and looks great!

upskyy · 2024-10-22T09:26:30Z

@Ying1123 @merrymercy
I also get the same error when serving the gemma2 model as multi lora. Has the error been resolved?
I tested on image lmsysorg/sglang:v0.3.2-cu121.

python3 -m sglang.launch_server --model-path /base_model --tokenizer-path /base_model --lora-paths /lora_model0 /lora_model1  --disable-radix --disable-cuda-graph --max-loras-per-batch 2 --mem-fraction-static 0.5 --random-seed 0 --enable-torch-compile

AttributeError: 'Gemma2ForCausalLM' object has no attribute 'get_module_name'

Related issues: #1416

Resolved PR: #2330 (comment)

xijiz · 2024-12-18T01:43:55Z

@Ying1123 Does this feature support multimodal models such as Qwen2-VL? Currently vLLM does not support this.

Ying1123 marked this pull request as draft September 3, 2024 01:04

Ying1123 mentioned this pull request Sep 3, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

Ying1123 changed the title ~~Initial support for multi-LoRA serving~~ [Feature] Initial support for multi-LoRA serving Sep 3, 2024

Ying1123 force-pushed the new_lora branch 21 times, most recently from 7092431 to ff5f51d Compare September 11, 2024 07:33

Ying1123 marked this pull request as ready for review September 11, 2024 09:03

Ying1123 force-pushed the new_lora branch from b162048 to 50a9ef3 Compare September 11, 2024 09:23

Ying1123 force-pushed the new_lora branch from 50a9ef3 to 3649a73 Compare September 12, 2024 18:46

Ying1123 added 2 commits September 12, 2024 18:49

in progress (most apply_lora added)

e7f49c2

single req correct (support gqa)

6534378

Ying1123 force-pushed the new_lora branch 7 times, most recently from f639a8d to 59b1385 Compare September 12, 2024 20:26

support base-only request & clean up

786d98f

Ying1123 force-pushed the new_lora branch from 59b1385 to 786d98f Compare September 12, 2024 20:28

Ying1123 merged commit 7122169 into main Sep 12, 2024
8 of 9 checks passed

Ying1123 deleted the new_lora branch September 12, 2024 23:46

Ying1123 mentioned this pull request Sep 15, 2024

[Feature] Support LoRA path renaming and add LoRA serving benchmarks #1433

Merged

11 tasks

This was referenced Oct 16, 2024

[Feature] When will a version of S-Lora be available? #1668

Closed

[LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Closed

Fridge003 mentioned this pull request Jan 14, 2025

[Feature] Support dynamic loading and unloading of Lora adapters #2891

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Initial support for multi-LoRA serving #1307

[Feature] Initial support for multi-LoRA serving #1307

Ying1123 commented Sep 3, 2024 •

edited

Loading

binarycrayon commented Sep 12, 2024

upskyy commented Oct 22, 2024 •

edited

Loading

xijiz commented Dec 18, 2024 •

edited

Loading

[Feature] Initial support for multi-LoRA serving #1307

[Feature] Initial support for multi-LoRA serving #1307

Conversation

Ying1123 commented Sep 3, 2024 • edited Loading

binarycrayon commented Sep 12, 2024

upskyy commented Oct 22, 2024 • edited Loading

xijiz commented Dec 18, 2024 • edited Loading

Ying1123 commented Sep 3, 2024 •

edited

Loading

upskyy commented Oct 22, 2024 •

edited

Loading

xijiz commented Dec 18, 2024 •

edited

Loading