loss is 0 when turn off use_peft #12

baibaiw5 · 2023-06-14T08:07:41Z

Describe the Question

When I use peft to train bloom.Everything is OK.
If I turn off use_peft using the following scripts,loss is zero(both llama and bloom):

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 1 pretraining.py
--model_type bloom
--model_name_or_path /home/bmb/models/bigscience/bloom-560m
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft False
--seed 42
--bf16 True
--tf32 True
--learning_rate 1e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--logging_strategy steps
--logging_steps 1
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True

{'loss': 1.9695, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 2.8571428571428573e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 7.1428571428571436e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 1e-05, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.999370638369377e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.997482711915926e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.994336695915041e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.989933382359423e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.984273879759713e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.977359612865424e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.969192322306271e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.959774064153977e-06, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 9.949107209404664e-06, 'epoch': 0.04}

baibaiw5 · 2023-06-14T08:12:03Z

I use RTX 4090,If i use float16,there is an error:"ValueError: Attempting to unscale FP16 gradients."
so I change the parames:--bf16 True,--tf32 True

shibing624 · 2023-06-14T08:25:16Z

fp16 and bf16 maybe get 0 loss, please set float32, or you can try set batchsize=1 refer: lm-sys/FastChat#1339

baibaiw5 added the question Further information is requested label Jun 14, 2023

baibaiw5 closed this as completed Jun 14, 2023

veresse mentioned this issue Oct 1, 2024

复现医疗大模型与训练数据加载问题 #425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss is 0 when turn off use_peft #12

loss is 0 when turn off use_peft #12

baibaiw5 commented Jun 14, 2023

baibaiw5 commented Jun 14, 2023

shibing624 commented Jun 14, 2023

loss is 0 when turn off use_peft #12

loss is 0 when turn off use_peft #12

Comments

baibaiw5 commented Jun 14, 2023

Describe the Question

baibaiw5 commented Jun 14, 2023

shibing624 commented Jun 14, 2023