-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlanT5 training and zero tensors #1339
Comments
cc @DachengLi1 |
@GenVr This is likely because Pytorch FSDP saves t5 model incorrectly (if you print out the loaded model weight, the encoder embedding or decoder embedding is likely 0, which causing the final predictions to be all 0). Can you try using our postprocessing function? There is another issue on this solving the same problem. Let me know if it works! |
@DachengLi1 Thanks, I trained on GPUs with more memory and used the function after the training. I am able to load the model correctly. Now, I have another problem. During the training, I have a zero loss and learning rate:
It seems that the network can't learn anything. My configurations are written in the initial post. I trained both on the dummy.json dataset and on a personal one, having the same results. Do you have any idea about it? Thanks. |
@GenVr I met a similar issue that learning rate is 0 in a small dataset. This is because some integer flooring behavior in the huggingface transformer. Can you try warmup ratio =0 (or not give this argument), and let me know what happens? |
@DachengLi1 Thanks. I tried removing --warmup_ratio 0.03, I had this:
Now I have the LR, but the loss is always zero. |
@GenVr Nice to hear that! Let's keep bs=1 for now, I will look into whether bs>1 can cause other problems (Haven't really tested bs>1 because of the GPU memory limit). Can you try bs=1 on your own dataset? The dummy dataset is composed of very simple questions (if you look into it, a lot of them are very similar), so you probably want to see whether this still happens in a more complex dataset. |
BTW, remember to change the preprocessed path, otherwise it will read from the file. |
@DachengLi1 Thanks. I tried both with BS equal to 1 and greater than one, with my personal dataset. The loss is always zero and it seems the network fails to train (looks untrained). I could try maybe a big public .json dataset to see what happens (?) |
@GenVr Interesting.. I haven't seen this before, could you print an input/target tensor before it goes into the trainer to see what are the contents? Is the data processed in a wrong way? |
same problem (0 loss from start to finish), both dummy.json and my own dataset |
Was this resolved? same problem with a 0 loss. |
same problem |
1 similar comment
same problem |
Hi, I'm training a FlanT5 network. The training completes successfully, but when I try to run a simple inference, I have a tensor of zeros, so the prediction is null.
Example:
Output:
I tried to run several trainings, both on flanT5-xl and flanT5-large, both on my personal dataset and a dummy.json dataset.
That's my training configuration:
Any idea what's going on? Thank you.
The text was updated successfully, but these errors were encountered: