-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA is incompatible with DeepSpeed ZeRO3 #24445
Comments
Hello, please refer this doc for the correct way of using PEFT + DeepSpeed: https://huggingface.co/docs/peft/accelerate/deepspeed-zero3-offload |
Thank you for your response! I note that this doc is based on |
The following steps work for me:
Few important notes:
|
Thanks! And I would imagine you launch with |
Yes, I launch it with |
@1ytic very useful explaination! Could you offer a example how to implement this quick workaround ? thx |
@1ytic I am getting this error while running LORA with zero 3 deepspeed.: Can you please explain this more clearly: Traceback (most recent call last): File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847,in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1923, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847,in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/autograd/init.py", line 200,in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847,in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 141, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1923, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward result = forward_call(*args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1155, in all_gather_coalesced File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842,in get_only_unique_item File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842,in get_only_unique_item File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/deepspeed/runtime/utils.py", line 842,in get_only_unique_item |
Could you explain a bit more on |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hello! I'm facing the same issue with
and eventually
I'm using trainer = Trainer(
...
strategy = DeepSpeedStrategy(stage=3)
)
class Module(LightningModule):
def configure_model(self) -> None:
deepspeed_config = self.trainer.strategy.config
self.dschf = HfDeepSpeedConfig(deepspeed_config)
model = AutoModelForCausalLM.from_pretrained(...)
model = get_peft_model(
model,
LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
target_modules=target_modules,
r=48,
lora_alpha=16,
lora_dropout=0.0,
),
) |
Have you find a decent solution about this? I met the same situation that using transformers.Trainer + LoRA + deepspeed to finetune a CasualLM. Since the mode is partitioned before |
System Info
pytorch==2.0.0, transformers==4.28.0, peft==0.2.0
When use LoRA to wrap model in
__init__
and enable deepspeed ZeRO3, i will get the following errors:It seems like that deepspeed begins to partition parameters before
PeftModelForCausalLM
finish its__init__
, since it can not get the attributebase_model
.It's also notable that this error leads to a infinite recursion, since
PeftModel
catch the AttributeError when trying to get the attributebase_model
while this attribute does not exist so the AttributeError will be raised again and again.Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
environments:
pytorch==2.0.0, transformers==4.28.0, peft==0.2.0
slurm launch command:
srun --gres=gpu:8 --ntasks=8 --ntasks-per-node=8 --cpus-per-task=8 python -u bug_unit_test.py --output_dir ./outputs/debug --deepspeed ./configs/default_offload_opt_param_zero3.json
deepspeed config to reproduce:
code to reproduce:
Expected behavior
I expect to wrap the model with LoRA during
__init__
successfully when i enable ZeRO3.The text was updated successfully, but these errors were encountered: