Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LoRA adapter #289

Closed
wants to merge 4 commits into from
Closed

Conversation

mymusise
Copy link

@mymusise mymusise commented Jun 28, 2023

hi guys,
We found that infer with vllm can greatly improve performance! But we need to use LoRA(peft) in inference.
We also found that the community has a strong demand for lora. #182
After reading the model implementation of vllm, we found there are some differences from huggingface's transformer, so we cannot directly use peft to add LoRA with vllm.

So we added an extra to add LoRA weights to qkv. The following is an example of use:

from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora

# Create an LLM.
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.05)

# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "edbeeching/opt-125m-imdb-lora")

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0, top_k=-1)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Currently only supports the lora model with ["q_proj", "v_proj"] target modules, like opt, llama.

And, it's not yet supported to use LoRA in the case of using parallel

mymusise added 2 commits June 28, 2023 17:07
Signed-off-by: mymusise <[email protected]>
Signed-off-by: mymusise <[email protected]>
@mymusise mymusise changed the title Add lora support Add LoRA support Jun 28, 2023
@mymusise mymusise changed the title Add LoRA support Support LoRA adapter Jun 28, 2023
@FarziBuilder
Copy link

There is no module named 'vllm.model_executor.adapters'

@Saiteja-Tallam-Infrrd
Copy link

Saiteja-Tallam-Infrrd commented Jul 14, 2023

@FarziBuilder The code is not part of this repo .. It is in a different fork

@FarziBuilder
Copy link

@Saiteja-Tallam-Infrrd so I need to git clone and pip install from that fork. What fork has he written this code?

@efraisse
Copy link

efraisse commented Jul 25, 2023

@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run from vllm.model_executor.adapters import lora : There is no module named 'vllm.model_executor.adapters'

@Saiteja-Tallam-Infrrd
Copy link

@efraisse I installed from the mentioned fork and I was able to use it ..

@efraisse
Copy link

@Saiteja-Tallam-Infrrd I think I made a mistake while cloning the repo. I was able to get it to work as well.

@nivibilla
Copy link

nivibilla commented Jul 28, 2023

Hey, I see that this only works for q,v loras. However most of the Qlora fine tunes use all k,q,v,o,up and down proj layers for llama architecture. Is there a way to get all of them to work?

@admangan
Copy link

Do you have to pull down Llama2 commits and merge them together in the meantime to work with L2 models?

@Rannichan
Copy link

It seems that it works only on single gpu inference and does not support tensor parallel. Could it be support in future or any quick way to make it work with Ray?

@SuperBruceJia
Copy link

SuperBruceJia commented Nov 21, 2023

edbeeching/opt-125m-imdb-lora

Thank you very much for your excellent work! It really helps.

There is an error on my side:

File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader
  lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter')
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained
  cls.load_adapter(layers, config)
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter
  new_model = VllmLoRA(
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__
  self.active_adapter = adapter_name
File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__
  super().__setattr__(name, value)
AttributeError: can't set attribute

My adapter_config.json file is as follows:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "meta-llama/Llama-2-7b-hf",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 16,
  "lora_dropout": 0.1,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "q_proj"
  ],
  "task_type": "CAUSAL_LM"
}

Thank you very much in advance!

@SuperBruceJia
Copy link

edbeeching/opt-125m-imdb-lora

Thank you very much for your excellent work! It really helps.

There is an error on my side:

File "/projectnb/pnn/test_2/IntuitLLMProject/lib/data_manager.py", line 135, in d_eval_g_data_loader
  lora.LoRAModel.from_pretrained(pipe.llm_engine.workers[0].model, g_saver_dir + '/adapter')
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 62, in from_pretrained
  cls.load_adapter(layers, config)
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 70, in load_adapter
  new_model = VllmLoRA(
File "/projectnb/pnn/test_2/vllm/vllm/model_executor/adapters/lora.py", line 34, in __init__
  self.active_adapter = adapter_name
File "/usr4/ec523/brucejia/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1754, in __setattr__
  super().__setattr__(name, value)
AttributeError: can't set attribute

My adapter_config.json file is as follows:

{
  "alpha_pattern": {},
  "auto_mapping": null,
  "base_model_name_or_path": "meta-llama/Llama-2-7b-hf",
  "bias": "none",
  "fan_in_fan_out": false,
  "inference_mode": true,
  "init_lora_weights": true,
  "layers_pattern": null,
  "layers_to_transform": null,
  "lora_alpha": 16,
  "lora_dropout": 0.1,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 64,
  "rank_pattern": {},
  "revision": null,
  "target_modules": [
    "v_proj",
    "q_proj"
  ],
  "task_type": "CAUSAL_LM"
}

Thank you very much in advance!

Please note that this problem can be solved by commenting this line.

# self.active_adapter = adapter_name

And thank you very much again for your excellent work! @mymusise

@SuperBruceJia
Copy link

@Saiteja-Tallam-Infrrd what fork are you referring to? I pip installed the troph-team:support_peft fork on the support_peft branch and got the same error as @FarziBuilder when trying to run from vllm.model_executor.adapters import lora : There is no module named 'vllm.model_executor.adapters'

git clone --branch support_peft https://github.com/troph-team/vllm.git

@corbt
Copy link

corbt commented Dec 16, 2023

Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804

@FarziBuilder
Copy link

FarziBuilder commented Dec 16, 2023 via email

@SuperBruceJia
Copy link

Note that for anyone else watching this issue who missed the news, there's an active PR in to vLLM to add most of the tricks from the S-LoRA paper, which is a very elegant way of serving up to thousands of LoRAs simultaneously! #1804

@corbt That sounds great! Thank you so much for the update!

@oushu1zhangxiangxuan1
Copy link
Contributor

Great work, I'm waiting for this FEATURE, when will this PR be merged?

@callanwu
Copy link

mark

@allzero-kwon
Copy link

allzero-kwon commented Dec 28, 2023

mark. Is this any merged PR for LoRA which support target modules including linear layers (o_proj, lm_head , etc ...)?

@echo669
Copy link

echo669 commented Dec 28, 2023

Traceback (most recent call last):
File "../lora_inference.py", line 62, in
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "....")
File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 62, in from_pretrained
cls.load_adapter(layers, config)
File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 70, in load_adapter
new_model = VllmLoRA(
File "../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 27, in init
ColumnParallelLinear.init(self, input_size, output_size, *args, **kwargs)
File ".../python3.10/site-packagesllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in init
self.weight = Parameter(torch.empty(
File "../python3.10/site-packages/torch/nn/modules/module.py", line 1712, in setattr
self.register_parameter(name, value)
File "../python3.10/site-packages/torch/nn/modules/module.py", line 577, in register_parameter
elif hasattr(self, name) and name not in self._parameters:
File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight
weight = base_layer.weight
File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight
weight = base_layer.weight
File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight
weight = base_layer.weight
[Previous line repeated 984 more times]
File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 349, in weight
base_layer = self.get_base_layer()
File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 338, in get_base_layer
while hasattr(base_layer, "base_layer"):
File "../python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
RecursionError: maximum recursion depth exceeded while calling a Python object

Does anyone have the same error as me?

@SuperBruceJia
Copy link

Traceback (most recent call last): File "../lora_inference.py", line 62, in lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "....") File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 62, in from_pretrained cls.load_adapter(layers, config) File ".../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 70, in load_adapter new_model = VllmLoRA( File "../python3.10/site-packagesllm/model_executor/adapters/lora.py", line 27, in init ColumnParallelLinear.init(self, input_size, output_size, *args, **kwargs) File ".../python3.10/site-packagesllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in init self.weight = Parameter(torch.empty( File "../python3.10/site-packages/torch/nn/modules/module.py", line 1712, in setattr self.register_parameter(name, value) File "../python3.10/site-packages/torch/nn/modules/module.py", line 577, in register_parameter elif hasattr(self, name) and name not in self._parameters: File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 355, in weight weight = base_layer.weight [Previous line repeated 984 more times] File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 349, in weight base_layer = self.get_base_layer() File "../python3.10/site-packages/peft/tuners/tuners_utils.py", line 338, in get_base_layer while hasattr(base_layer, "base_layer"): File "../python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") RecursionError: maximum recursion depth exceeded while calling a Python object

Does anyone have the same error as me?

Please check my solution: https://github.com/SuperBruceJia/vllm

git clone --branch support_peft https://github.com/SuperBruceJia/vllm.git
cd vllm
pip install -e . --user

Special Notice:

  1. Only support
target_modules=[
            "q_proj",
            "k_proj",
            "v_proj",
        ],
  1. Only support inference by one GPU

Please let me know if you have any questions!

Best regards,

Shuyue
Dec. 30th, 2023

@Senna1960321
Copy link

@mymusise
Thank you for your code. I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

@xiaobo-Chen
Copy link

@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and
It seems that the code does not adapt to the structure .I greatly appreciate your assistance

@SuperBruceJia
Copy link

SuperBruceJia commented Jan 17, 2024

@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance

I think you could, but you need to have a LoRa adapter for the ChatGLM2 model.

First of all, you are suggested to add a LoRA adapter to your base ChatGLM2 Model:

from peft import LoraConfig, TaskType
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)

model = AutoModelForCausalLM.from_pretrained("THUDM/chatglm2-6b")

lora_config = LoraConfig(
    r=lora_r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias="none",
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
    ],
    task_type=TaskType.CAUSAL_LM,
)
model.add_adapter(lora_config, adapter_name="adapter")
model.enable_adapters()

After connecting (and maybe going through several rounds of training) the adapter, you need to save it to a folder in your local directory.

trainer.train()  # Train the adapter
trainer.model.save_pretrained(save_path)  # Only the adapter will be saved.

Afterwards, you can load the base model + adapter using vLLM:

from vllm import LLM, SamplingParams
from vllm.model_executor.adapters import lora

# Create an LLM.
llm = LLM(model="THUDM/chatglm2-6b", gpu_memory_utilization=0.85)

# Add LoRA adapter
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, "save_path")

prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature=0, top_k=-1)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

If you have any further questions, please let me know.

Best regards,

Shuyue
Jan. 16th, 2024

@SuperBruceJia
Copy link

@SuperBruceJia, does your solution accommodate the ChatGLM2 model? If I intend to utilize ChatGLM2, which codes should I modify? I presume I need to add a MODEL_LAYER_MAPPING in mapping.py, yet the layer names differ from those of Llama,and It seems that the code does not adapt to the structure .I greatly appreciate your assistance

Please take a look at the fine-tuning codes of the LLaMA 2 (7B) model.
Main execution file: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/main.py
Model loader: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/lib/model_loader.py#L98-L108
Evaluation: https://github.com/SuperBruceJia/MetaMath-Fine-Tune-with-LoRA/blob/main/lib/evaluation.py#L129-L137

Best regards,

Shuyue
Jan. 16th, 2024

@Senna1960321
Copy link

@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

@SuperBruceJia
Copy link

@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting fixed pre-trained models + trained LoRA adapters.

Like this:

llama_path = "YOUR_LLAMA_MODEL_PATH"  # The original pre-trained model is not fine-tuned
adapter_path = "YOUR_SAVED_ADAPTER_PATH"  # Only the LoRA adapter is fine-tuned
llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path) 

Please inform me if you have found a solution.

Best regards,

Shuyue
Jan. 22nd, 2024

@Senna1960321
Copy link

@SuperBruceJia Hello, I can now load the LoRa generated from fine-tuning the LLama2-7b-chat-hf bin model. I've noticed that it performs quite consistently in generative tasks, but when it comes to detailed inference tasks, its output results are unstable, and the error rate is relatively high. If I don't load it through vllm, I can consistently infer the correct content. Have you encountered this situation before?

During inference using the fine-tuned pre-trained model, the model's generations were worse. However, the performance was pretty good under the setting fixed pre-trained models + trained LoRA adapters.

Like this:

llama_path = "YOUR_LLAMA_MODEL_PATH"  # The original pre-trained model is not fine-tuned
adapter_path = "YOUR_SAVED_ADAPTER_PATH"  # Only the LoRA adapter is fine-tuned
llm = LLM(model=llama_path, tensor_parallel_size=1, gpu_memory_utilization=0.85)
lora.LoRAModel.from_pretrained(llm.llm_engine.workers[0].model, adapter_path) 

Please inform me if you have found a solution.

Best regards,

Shuyue Jan. 22nd, 2024

@SuperBruceJia Thank you for your code. This way can make the response better, but it still worse than the way without vllm.

@zhuohan123
Copy link
Member

Solved by #1804

@zhuohan123 zhuohan123 closed this Feb 16, 2024
@meiru-cam
Copy link

meiru-cam commented Mar 21, 2024

Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this?

image

@SuperBruceJia
Copy link

Hi @SuperBruceJia, Thank you for providing the lora support code. I tried to install from source. But received error with pyproject. Do you have any idea on how to fix this?

image

Sorry, I haven't met this issue yet.

It seems that the issue is related to the version of CUDA being used, as described here.

@meiru-cam
Copy link

@SuperBruceJia Thanks. May I ask the CUDA version that is applicable? I tried CUDA12.1 which is not working, as well as CUDA11.8

@SuperBruceJia
Copy link

@SuperBruceJia Thanks. May I ask the CUDA version that is applicable? I tried CUDA12.1 which is not working, as well as CUDA11.8

Please open your terminal and check the CUDA version on your laptop via the nvidia-smi command.

yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
WAIT UNTIL UPSTREAM SYNC LANDS TO MERGE

SUMMARY:
* refactored lm-eval workflows to use a single script for generating a
baseline
* refactored lm-eval workflows to accept a config file so we can
parameterize for the different length runs
* added configuration for `remote-push` -> running `llama-3-8b` on 250
GSM prompts
* removed lm-eval-smoke such that we have one single pathway for running
lm-eval tests
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Sep 30, 2024
vllm-project#289)

Re-implements following PRs for current habana_main:
HabanaAI#102 (Removing div_i32
operations from each layer)
HabanaAI#115 (removing scatter for
reshape&cache in case of prompt)

Accuracy (GSM8K on Llama3.1-8B-Instruct):
| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|

|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama| 3|flexible-extract| 8|exact_match|↑ |0.8415|± |0.0101|
| | |strict-match | 8|exact_match|↑ |0.8400|± |0.0101|

I've benchmarked this change on Llama3.1-8B-Instruct and on average,
+2.50% throughput gain (+558.14 tok/s, ~21594 tok/s -> ~22152 tok/s) can
be observed across all prefill buckets on G2, with up to +4.40% (+956.79
tok/s, ~25031 -> ~25988 tok/s) throughput increase in compute-bound
scenarios.
billishyahao pushed a commit to billishyahao/vllm that referenced this pull request Dec 31, 2024
…upport all vllm args (vllm-project#289)

* Added --output-json parameter in the P3l script. Using arg_utils to support all vllm args

* Description
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.