You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
Error:
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[20], line 1
----> 1 trainer.train()
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2141, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
2139 hf_hub_utils.enable_progress_bars()
2140 else:
-> 2141 return inner_training_loop(
2142 args=args,
2143 resume_from_checkpoint=resume_from_checkpoint,
2144 trial=trial,
2145 ignore_keys_for_eval=ignore_keys_for_eval,
2146 )
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2497, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2491 context = (
2492 functools.partial(self.accelerator.no_sync, model=model)
2493 if i != len(batch_samples) - 1
2494 else contextlib.nullcontext
2495 )
2496 with context():
-> 2497 tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
2499 if (
2500 args.logging_nan_inf_filter
2501 and not is_torch_xla_available()
2502 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
2503 ):
2504 # if loss is nan or inf simply add the average of previous logged losses
2505 tr_loss = tr_loss + tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3659, in Trainer.training_step(***failed resolving arguments***)
3657 scaled_loss.backward()
3658 else:
-> 3659 self.accelerator.backward(loss, **kwargs)
3660 # Finally we need to normalize the loss for reporting
3661 if num_items_in_batch is None:
File /usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:2241, in Accelerator.backward(self, loss, **kwargs)
2239 self.lomo_backward(loss, learning_rate)
2240 else:
-> 2241 loss.backward(**kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/_tensor.py:492, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
482 if has_torch_function_unary(self):
483 return handle_torch_function(
484 Tensor.backward,
485 (self,),
(...)
490 inputs=inputs,
491 )
--> 492 torch.autograd.backward(
493 self, gradient, retain_graph, create_graph, inputs=inputs
494 )
File /usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:251, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
246 retain_graph = create_graph
248 # The reason we repeat the same comment below is that
249 # some Python versions print out the first line of a multi-line function
250 # calls in the traceback and some print out the last line
--> 251 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
252 tensors,
253 grad_tensors_,
254 retain_graph,
255 create_graph,
256 inputs,
257 allow_unreachable=True,
258 accumulate_grad=True,
259 )
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Reproduction:
import torch
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# del model
model_id = "Qwen/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
torch_dtype=torch.bfloat16, #float16 for colab.
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
from peft import LoraConfig
lora_config = LoraConfig(
r=32, # Rank (usually 8, 16, or 32 depending on model size and needs)
lora_alpha=16, # Scaling factor for the low-rank updates
use_rslora=True,
# target_modules="all-linear", # causes issues with Qwen
target_modules=["q_proj","k_proj","v_proj","o_proj","up_proj","down_proj","gate_proj","fc1","fc2","qkv"],
# modules_to_save=["lm_head","embed_tokens"],
lora_dropout=0.1, # Dropout for low-rank adapter layers
bias="none", # Bias in adapter layers: "none", "all", or "lora_only"
task_type="CAUSAL_LM" # Task type: "CAUSAL_LM", "SEQ_2_SEQ_LM", or "TOKEN_CLS"
)
from peft import get_peft_model
model=get_peft_model(model,lora_config)
training_args = TrainingArguments(
# max_steps=1,
num_train_epochs=epochs,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
gradient_accumulation_steps=1,
# warmup_steps=50, #comment in only if you have a lot more than 50 samples.
learning_rate=lr,
weight_decay=0.01,
logging_steps=0.1,
output_dir="fine-tuned-model",
eval_strategy="steps",
eval_steps=0.2,
lr_scheduler_type=schedule,
# save_strategy="steps",
# save_steps=250,
# save_total_limit=1,
# fp16=True, #if using Colab, but then you need to use bitsandbytes quantization too.
bf16=True,
hub_model_id="Trelis/Qwen-2B-chess",
remove_unused_columns=False,
report_to="tensorboard",
run_name=run_name,
logging_dir=f"./logs/{run_name}",
gradient_checkpointing=True, #should reduce VRAM requirements a lot
gradient_checkpointing_kwargs={'use_reentrant':True}
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=eval_dataset, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)
trainer.train()
期望行为 | Expected Behavior
If I switch reentrant to False, the training work without error, but is slow.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
Error:
Reproduction:
期望行为 | Expected Behavior
If I switch reentrant to False, the training work without error, but is slow.
复现方法 | Steps To Reproduce
See above.
运行环境 | Environment
备注 | Anything else?
No response
The text was updated successfully, but these errors were encountered: