Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlanT5 training and zero tensors #1339

Open
GenVr opened this issue May 19, 2023 · 13 comments
Open

FlanT5 training and zero tensors #1339

GenVr opened this issue May 19, 2023 · 13 comments
Assignees

Comments

@GenVr
Copy link

GenVr commented May 19, 2023

Hi, I'm training a FlanT5 network. The training completes successfully, but when I try to run a simple inference, I have a tensor of zeros, so the prediction is null.

Example:

tokenizer = AutoTokenizer.from_pretrained(path, use_fast=False)
model = T5ForConditionalGeneration.from_pretrained(path, low_cpu_mem_usage=True, torch_dtype=torch.float16).cuda()

tokenized_text = tokenizer(query, return_tensors="pt")

source_ids = tokenized_text["input_ids"].to(device, dtype=torch.long)

generated_ids = model.generate(input_ids=source_ids)

Output:

tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       device='cuda:0')

I tried to run several trainings, both on flanT5-xl and flanT5-large, both on my personal dataset and a dummy.json dataset.

That's my training configuration:

!python3 -m torch.distributed.run --nproc_per_node=6 fastchat/train/train_flant5.py \
    --model_name_or_path google/flan-t5-xl \
    --data_path playground/data/dummy.json \
    --fp16 True \
    --output_dir ./output \
    --num_train_epochs 5 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 99999 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp_transformer_layer_cls_to_wrap T5Block \
    --tf32 False \
    --fsdp "full_shard auto_wrap" \
    --model_max_length 256 \
    --gradient_checkpointing True \
    --preprocessed_path ./preprocessed_data/processed.json 

Any idea what's going on? Thank you.

@merrymercy
Copy link
Member

cc @DachengLi1

@DachengLi1
Copy link
Collaborator

@GenVr This is likely because Pytorch FSDP saves t5 model incorrectly (if you print out the loaded model weight, the encoder embedding or decoder embedding is likely 0, which causing the final predictions to be all 0). Can you try using our postprocessing function? There is another issue on this solving the same problem. Let me know if it works!

@GenVr
Copy link
Author

GenVr commented May 22, 2023

@DachengLi1 Thanks, I trained on GPUs with more memory and used the function after the training. I am able to load the model correctly. Now, I have another problem. During the training, I have a zero loss and learning rate:

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.04}                              
...                         
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.32}                              
...                                                   
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.67}                              
...                        
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 1.02} 
...

It seems that the network can't learn anything. My configurations are written in the initial post. I trained both on the dummy.json dataset and on a personal one, having the same results. Do you have any idea about it? Thanks.

@DachengLi1
Copy link
Collaborator

@GenVr I met a similar issue that learning rate is 0 in a small dataset. This is because some integer flooring behavior in the huggingface transformer. Can you try warmup ratio =0 (or not give this argument), and let me know what happens?

@GenVr
Copy link
Author

GenVr commented May 22, 2023

@DachengLi1 Thanks. I tried removing --warmup_ratio 0.03, I had this:

...                          
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.07}
...                                                                            
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.31}
...                                                   
{'loss': 0.0, 'learning_rate': 2e-05, 'epoch': 0.52}
...

Now I have the LR, but the loss is always zero.
Changing the batch size to 1, I can see the loss sometimes different from zero.
I tried to change the learning rate to 1e3 but after the first epoch the situation remains the same.

@DachengLi1
Copy link
Collaborator

@GenVr Nice to hear that! Let's keep bs=1 for now, I will look into whether bs>1 can cause other problems (Haven't really tested bs>1 because of the GPU memory limit). Can you try bs=1 on your own dataset? The dummy dataset is composed of very simple questions (if you look into it, a lot of them are very similar), so you probably want to see whether this still happens in a more complex dataset.

@DachengLi1
Copy link
Collaborator

BTW, remember to change the preprocessed path, otherwise it will read from the file.

@GenVr
Copy link
Author

GenVr commented May 23, 2023

@DachengLi1 Thanks. I tried both with BS equal to 1 and greater than one, with my personal dataset. The loss is always zero and it seems the network fails to train (looks untrained). I could try maybe a big public .json dataset to see what happens (?)

@DachengLi1
Copy link
Collaborator

@GenVr Interesting.. I haven't seen this before, could you print an input/target tensor before it goes into the trainer to see what are the contents? Is the data processed in a wrong way?

@emnlpanon
Copy link

same problem (0 loss from start to finish), both dummy.json and my own dataset

@richagadgil
Copy link

richagadgil commented Jul 10, 2023

Was this resolved? same problem with a 0 loss.

@leng-yue
Copy link

same problem

1 similar comment
@jxmorris12
Copy link

same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants