Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] {'use_reentrant': True} results in "Gradients will be None" #490

Open
2 tasks done
RonanKMcGovern opened this issue Nov 20, 2024 · 0 comments
Open
2 tasks done

Comments

@RonanKMcGovern
Copy link

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Error:

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[20], line 1
----> 1 trainer.train()

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2141, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2139         hf_hub_utils.enable_progress_bars()
   2140 else:
-> 2141     return inner_training_loop(
   2142         args=args,
   2143         resume_from_checkpoint=resume_from_checkpoint,
   2144         trial=trial,
   2145         ignore_keys_for_eval=ignore_keys_for_eval,
   2146     )

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2497, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2491 context = (
   2492     functools.partial(self.accelerator.no_sync, model=model)
   2493     if i != len(batch_samples) - 1
   2494     else contextlib.nullcontext
   2495 )
   2496 with context():
-> 2497     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
   2499 if (
   2500     args.logging_nan_inf_filter
   2501     and not is_torch_xla_available()
   2502     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2503 ):
   2504     # if loss is nan or inf simply add the average of previous logged losses
   2505     tr_loss = tr_loss + tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3659, in Trainer.training_step(***failed resolving arguments***)
   3657         scaled_loss.backward()
   3658 else:
-> 3659     self.accelerator.backward(loss, **kwargs)
   3660     # Finally we need to normalize the loss for reporting
   3661     if num_items_in_batch is None:

File /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:2241, in Accelerator.backward(self, loss, **kwargs)
   2239     self.lomo_backward(loss, learning_rate)
   2240 else:
-> 2241     loss.backward(**kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File /usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Reproduction:

import torch
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# del model

model_id = "Qwen/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16, #float16 for colab.
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_id)

from peft import LoraConfig

lora_config = LoraConfig(
    r=32,                 # Rank (usually 8, 16, or 32 depending on model size and needs)
    lora_alpha=16,         # Scaling factor for the low-rank updates
    use_rslora=True,
    # target_modules="all-linear", # causes issues with Qwen
    target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj","gate_proj","fc1","fc2","qkv"],
    # modules_to_save=["lm_head","embed_tokens"],
    lora_dropout=0.1,      # Dropout for low-rank adapter layers
    bias="none",           # Bias in adapter layers: "none", "all", or "lora_only"
    task_type="CAUSAL_LM"  # Task type: "CAUSAL_LM", "SEQ_2_SEQ_LM", or "TOKEN_CLS"
)

from peft import get_peft_model
model=get_peft_model(model,lora_config)

training_args = TrainingArguments(
    # max_steps=1,
    num_train_epochs=epochs,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    # warmup_steps=50, #comment in only if you have a lot more than 50 samples.
    learning_rate=lr,
    weight_decay=0.01,
    logging_steps=0.1,
    output_dir="fine-tuned-model",
    eval_strategy="steps",
    eval_steps=0.2,
    lr_scheduler_type=schedule,
    # save_strategy="steps",
    # save_steps=250,
    # save_total_limit=1,
    # fp16=True, #if using Colab, but then you need to use bitsandbytes quantization too.
    bf16=True,
    hub_model_id="Trelis/Qwen-2B-chess",
    remove_unused_columns=False,
    report_to="tensorboard",
    run_name=run_name,
    logging_dir=f"./logs/{run_name}",
    gradient_checkpointing=True, #should reduce VRAM requirements a lot
    gradient_checkpointing_kwargs={'use_reentrant':True}
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)

trainer.train()

期望行为 | Expected Behavior

If I switch reentrant to False, the training work without error, but is slow.

复现方法 | Steps To Reproduce

See above.

运行环境 | Environment

- OS: Ubuntu 20.04
- Python: 3.10.12
- Transformers: 4.47.0.dev0
- Torch:  2.1.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant