-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Initial support for multi-LoRA serving #1307
Conversation
7092431
to
ff5f51d
Compare
Tested and looks great! |
f639a8d
to
59b1385
Compare
@Ying1123 @merrymercy python3 -m sglang.launch_server --model-path /base_model --tokenizer-path /base_model --lora-paths /lora_model0 /lora_model1 --disable-radix --disable-cuda-graph --max-loras-per-batch 2 --mem-fraction-static 0.5 --random-seed 0 --enable-torch-compile AttributeError: 'Gemma2ForCausalLM' object has no attribute 'get_module_name' Related issues: #1416 Resolved PR: #2330 (comment) |
@Ying1123 Does this feature support multimodal models such as Qwen2-VL? Currently vLLM does not support this. |
This PR gives initial multi-LoRA serving support. Currently, it supports LoRA on attention (
qkvo
) and mlp (gate, up, down
) linear layers. It supports dynamic loading and offloading, but it does not support unified memory. The memory pool for LoRA adapters is pre-allocated. Please use a smaller--mem-frac
to launch server with larger--max-loras-per-batch
.Current example usage:
You can expect the items below in the follow-up PRs.
References:
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Punica: Multi-Tenant LoRA Serving