-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support LoRA adapter #289
Support LoRA adapter #289
Conversation
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
There is no module named 'vllm.model_executor.adapters' |
@FarziBuilder The code is not part of this repo .. It is in a different fork |
@Saiteja-Tallam-Infrrd so I need to git clone and pip install from that fork. What fork has he written this code? |
@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run |
@efraisse I installed from the mentioned fork and I was able to use it .. |
@Saiteja-Tallam-Infrrd I think I made a mistake while cloning the repo. I was able to get it to work as well. |
Hey, I see that this only works for q,v loras. However most of the Qlora fine tunes use all k,q,v,o,up and down proj layers for llama architecture. Is there a way to get all of them to work? |
Do you have to pull down Llama2 commits and merge them together in the meantime to work with L2 models? |
It seems that it works only on single gpu inference and does not support tensor parallel. Could it be support in future or any quick way to make it work with Ray? |
Thank you very much for your excellent work! It really helps. There is an error on my side: File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader
lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter')
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained
cls.load_adapter(layers, config)
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter
new_model = VllmLoRA(
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__
self.active_adapter = adapter_name
File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__
super().__setattr__(name, value)
AttributeError: can't set attribute My {
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "meta-llama/Llama-2-7b-hf",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layers_pattern": null,
"layers_to_transform": null,
"lora_alpha": 16,
"lora_dropout": 0.1,
"modules_to_save": null,
"peft_type": "LORA",
"r": 64,
"rank_pattern": {},
"revision": null,
"target_modules": [
"v_proj",
"q_proj"
],
"task_type": "CAUSAL_LM"
} Thank you very much in advance! |
Please note that this problem can be solved by commenting this line. # self.active_adapter = adapter_name And thank you very much again for your excellent work! @mymusise |
git clone --branch support_peft https://github.com/troph-team/vllm.git |
Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804 |
Wow in what cases do we have to serve thousands of LoRAs?
ᐧ
…On Sat, Dec 16, 2023 at 9:10 AM Kyle Corbitt ***@***.***> wrote:
Note that for anyone else watching this issue who missed the news, there's
an active PR in to vLLM to add most of the tricks from the S-LoRA paper,
which is a very elegant way of serving up to thousands of LoRAs
simultaneously! #1804 <#1804>
—
Reply to this email directly, view it on GitHub
<#289 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A25RQLLMWUQVEPUCMMH65XTYJUJ3ZAVCNFSM6AAAAAAZWZ7LJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYG4YDEMJQGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Great work, I'm waiting for this FEATURE, when will this PR be merged? |
mark |
mark. Is this any merged PR for LoRA which support target modules including linear layers (o_proj, lm_head , etc ...)? |
Traceback (most recent call last): Does anyone have the same error as me? |
Please check my solution: https://github.com/SuperBruceJia/vllm git clone --branch support_peft https://github.com/SuperBruceJia/vllm.git
cd vllm
pip install -e . --user Special Notice:
target_modules=[
"q_proj",
"k_proj",
"v_proj",
],
Please let me know if you have any questions! Best regards, Shuyue |
@mymusise |
@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and |
I think you could, but you need to have a LoRa adapter for the ChatGLM2 model. First of all, you are suggested to add a LoRA adapter to your base ChatGLM2 Model: from peft import LoraConfig, TaskType
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
)
model = AutoModelForCausalLM.from_pretrained("THUDM/chatglm2-6b")
lora_config = LoraConfig(
r=lora_r,
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
bias="none",
target_modules=[
"q_proj",
"k_proj",
"v_proj",
],
task_type=TaskType.CAUSAL_LM,
)
model.add_adapter(lora_config, adapter_name="adapter")
model.enable_adapters() After connecting (and maybe going through several rounds of training) the adapter, you need to save it to a folder in your local directory. trainer.train() # Train the adapter
trainer.model.save_pretrained(save_path) # Only the adapter will be saved. Afterwards, you can load the from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora
# Create an LLM.
llm = LLM(model="THUDM/chatglm2-6b", gpu_memory_utilization=0.85)
# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "save_path")
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0, top_k=-1)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") If you have any further questions, please let me know. Best regards, Shuyue |
Please take a look at the fine-tuning codes of the LLaMA 2 (7B) model. Best regards, Shuyue |
@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before? |
During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting Like this: llama_path = "YOUR_LLAMA_MODEL_PATH" # The original pre-trained model is not fine-tuned
adapter_path = "YOUR_SAVED_ADAPTER_PATH" # Only the LoRA adapter is fine-tuned
llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path) Please inform me if you have found a solution. Best regards, Shuyue |
@SuperBruceJia Thank you for your code. This way can make the response better, but it still worse than the way without vllm. |
Solved by #1804 |
Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this? ![]() |
Sorry, I haven't met this issue yet. It seems that the issue is related to the version of CUDA being used, as described here. |
@SuperBruceJia Thanks. May I ask the CUDA version that is applicable? I tried CUDA12.1 which is not working, as well as CUDA11.8 |
Please open your terminal and check the CUDA version on your laptop via the |
WAIT UNTIL UPSTREAM SYNC LANDS TO MERGE SUMMARY: * refactored lm-eval workflows to use a single script for generating a baseline * refactored lm-eval workflows to accept a config file so we can parameterize for the different length runs * added configuration for `remote-push` -> running `llama-3-8b` on 250 GSM prompts * removed lm-eval-smoke such that we have one single pathway for running lm-eval tests
vllm-project#289) Re-implements following PRs for current habana_main: HabanaAI#102 (Removing div_i32 operations from each layer) HabanaAI#115 (removing scatter for reshape&cache in case of prompt) Accuracy (GSM8K on Llama3.1-8B-Instruct): | Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr| |---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101| | | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101| I've benchmarked this change on Llama3.1-8B-Instruct and on average, +2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can be observed across all prefill buckets on G2, with up to +4.40% (+956.79 tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound scenarios.
…upport all vllm args (vllm-project#289) * Added --output-json parameter in the P3l script. Using arg_utils to support all vllm args * Description
hi guys,
We found that infer with vllm can greatly improve performance! But we need to use LoRA(
peft
) in inference.We also found that the community has a strong demand for lora. #182
After reading the model implementation of vllm, we found there are some differences from huggingface's transformer, so we cannot directly use peft to add LoRA with vllm.
So we added an extra to add LoRA weights to
qkv
. The following is an example of use:Currently only supports the lora model with
["q_proj", "v_proj"]
target modules, likeopt
,llama
.And, it's not yet supported to use LoRA in the case of using parallel