Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Initial support for multi-LoRA serving #1307

Merged
merged 3 commits into from
Sep 12, 2024
Merged

[Feature] Initial support for multi-LoRA serving #1307

merged 3 commits into from
Sep 12, 2024

Conversation

Ying1123
Copy link
Member

@Ying1123 Ying1123 commented Sep 3, 2024

This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (qkvo) and mlp (gate, up, down) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller --mem-frac to launch server with larger --max-loras-per-batch.

Current example usage:

# launch server
python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora /home/ying/test_lora_1 /home/ying/test_lora_2 /home/ying/test_lora_3 /home/ying/test_lora_4 --disable-radix --disable-cuda-graph --max-loras-per-batch 4
# send requests
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
import json
import requests

url = "http://127.0.0.1:30000"
json_data = {
        "text": ["prompt 1", "prompt 2", "prompt 3", "prompt 4", "prompt 5", "prompt 6", "prompt7"],
        "sampling_params": {"max_new_tokens": 32},
        "lora_path": ["/home/ying/test_lora", "/home/ying/test_lora_1", "/home/ying/test_lora_2", "/home/ying/test_lora_3", "/home/ying/test_lora_4", "/home/ying/test_lora", "/home/ying/test_lora_1"],
}
response = requests.post(
        url + "/generate",
        json=json_data,
)
print(json.dumps(response.json()))

You can expect the items below in the follow-up PRs.

References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving

@Ying1123 Ying1123 marked this pull request as draft September 3, 2024 01:04
@Ying1123 Ying1123 mentioned this pull request Sep 3, 2024
29 tasks
@Ying1123 Ying1123 changed the title Initial support for multi-LoRA serving [Feature] Initial support for multi-LoRA serving Sep 3, 2024
@Ying1123 Ying1123 force-pushed the new_lora branch 21 times, most recently from 7092431 to ff5f51d Compare September 11, 2024 07:33
@Ying1123 Ying1123 marked this pull request as ready for review September 11, 2024 09:03
@binarycrayon
Copy link
Contributor

Tested and looks great!

@Ying1123 Ying1123 force-pushed the new_lora branch 7 times, most recently from f639a8d to 59b1385 Compare September 12, 2024 20:26
@upskyy
Copy link
Contributor

upskyy commented Oct 22, 2024

@Ying1123 @merrymercy
I also get the same error when serving the gemma2 model as multi lora. Has the error been resolved?
I tested on image lmsysorg/sglang:v0.3.2-cu121.

python3 -m sglang.launch_server --model-path /base_model --tokenizer-path /base_model --lora-paths /lora_model0 /lora_model1  --disable-radix --disable-cuda-graph --max-loras-per-batch 2 --mem-fraction-static 0.5 --random-seed 0 --enable-torch-compile

AttributeError: 'Gemma2ForCausalLM' object has no attribute 'get_module_name'

Related issues: #1416

Resolved PR: #2330 (comment)

@xijiz
Copy link

xijiz commented Dec 18, 2024

@Ying1123 Does this feature support multimodal models such as Qwen2-VL? Currently vLLM does not support this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants