[BUG] {'use_reentrant': True} results in "Gradients will be None" #490

RonanKMcGovern · 2024-11-20T07:41:29Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

Error:

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[20], line 1
----> 1 trainer.train()

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2141, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2139         hf_hub_utils.enable_progress_bars()
   2140 else:
-> 2141     return inner_training_loop(
   2142         args=args,
   2143         resume_from_checkpoint=resume_from_checkpoint,
   2144         trial=trial,
   2145         ignore_keys_for_eval=ignore_keys_for_eval,
   2146     )

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2497, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2491 context = (
   2492     functools.partial(self.accelerator.no_sync, model=model)
   2493     if i != len(batch_samples) - 1
   2494     else contextlib.nullcontext
   2495 )
   2496 with context():
-> 2497     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
   2499 if (
   2500     args.logging_nan_inf_filter
   2501     and not is_torch_xla_available()
   2502     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2503 ):
   2504     # if loss is nan or inf simply add the average of previous logged losses
   2505     tr_loss = tr_loss + tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3659, in Trainer.training_step(***failed resolving arguments***)
   3657         scaled_loss.backward()
   3658 else:
-> 3659     self.accelerator.backward(loss, **kwargs)
   3660     # Finally we need to normalize the loss for reporting
   3661     if num_items_in_batch is None:

File /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:2241, in Accelerator.backward(self, loss, **kwargs)
   2239     self.lomo_backward(loss, learning_rate)
   2240 else:
-> 2241     loss.backward(**kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
    482 if has_torch_function_unary(self):
    483     return handle_torch_function(
    484         Tensor.backward,
    485         (self,),
   (...)
    490         inputs=inputs,
    491     )
--> 492 torch.autograd.backward(
    493     self, gradient, retain_graph, create_graph, inputs=inputs
    494 )

File /usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    246     retain_graph = create_graph
    248 # The reason we repeat the same comment below is that
    249 # some Python versions print out the first line of a multi-line function
    250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    252     tensors,
    253     grad_tensors_,
    254     retain_graph,
    255     create_graph,
    256     inputs,
    257     allow_unreachable=True,
    258     accumulate_grad=True,
    259 )

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Reproduction:

import torch
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# del model

model_id = "Qwen/Qwen2-VL-2B-Instruct"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16, #float16 for colab.
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_id)

from peft import LoraConfig

lora_config = LoraConfig(
    r=32,                 # Rank (usually 8, 16, or 32 depending on model size and needs)
    lora_alpha=16,         # Scaling factor for the low-rank updates
    use_rslora=True,
    # target_modules="all-linear", # causes issues with Qwen
    target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj","gate_proj","fc1","fc2","qkv"],
    # modules_to_save=["lm_head","embed_tokens"],
    lora_dropout=0.1,      # Dropout for low-rank adapter layers
    bias="none",           # Bias in adapter layers: "none", "all", or "lora_only"
    task_type="CAUSAL_LM"  # Task type: "CAUSAL_LM", "SEQ_2_SEQ_LM", or "TOKEN_CLS"
)

from peft import get_peft_model
model=get_peft_model(model,lora_config)

training_args = TrainingArguments(
    # max_steps=1,
    num_train_epochs=epochs,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    # warmup_steps=50, #comment in only if you have a lot more than 50 samples.
    learning_rate=lr,
    weight_decay=0.01,
    logging_steps=0.1,
    output_dir="fine-tuned-model",
    eval_strategy="steps",
    eval_steps=0.2,
    lr_scheduler_type=schedule,
    # save_strategy="steps",
    # save_steps=250,
    # save_total_limit=1,
    # fp16=True, #if using Colab, but then you need to use bitsandbytes quantization too.
    bf16=True,
    hub_model_id="Trelis/Qwen-2B-chess",
    remove_unused_columns=False,
    report_to="tensorboard",
    run_name=run_name,
    logging_dir=f"./logs/{run_name}",
    gradient_checkpointing=True, #should reduce VRAM requirements a lot
    gradient_checkpointing_kwargs={'use_reentrant':True}
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)

trainer.train()

期望行为 | Expected Behavior

If I switch reentrant to False, the training work without error, but is slow.

复现方法 | Steps To Reproduce

See above.

运行环境 | Environment

- OS: Ubuntu 20.04
- Python: 3.10.12
- Transformers: 4.47.0.dev0
- Torch:  2.1.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] {'use_reentrant': True} results in "Gradients will be None" #490

[BUG] {'use_reentrant': True} results in "Gradients will be None" #490

RonanKMcGovern commented Nov 20, 2024

[BUG] {'use_reentrant': True} results in "Gradients will be None" #490

[BUG] {'use_reentrant': True} results in "Gradients will be None" #490

Comments

RonanKMcGovern commented Nov 20, 2024

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?