-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN during llama3 finetuning #427
Comments
Are you training on |
Hi @danielhanchen, Thank you for your response. I'm unsure about the inner workings of get_peft_model in Unsloth, but assuming it functions similarly to other peft methods, it should freeze the base model, including the embedding matrix, correct? Consequently, I believe my scripts are only training the Lora parameters. I attempted to use Unsloth's fix_untrained_tokens, but it didn't work out for me. Additionally, I noticed that Unsloth's blog mentions the llama-3-8b base model, whereas I'm using the llama-3-8b-instruct model. Instruct model's reserved tokens should not arise any issues as they are finetuned (unlike base model) right? |
@mano3-1 what does the traceback say if you run
|
Hi @lapp0,
|
I'm running into issues with back-propagation in unsloth as well, albeit I'm using a custom loss function and Mistral instead of llama-3. It works fine for `RuntimeError: Function 'LoRA_MLPBackward' returned nan values in its 0th output.
I'd be interested in the cause of your issue, perhaps it is the same as mine. If I figure anything out with mine I'll let you know. |
Hi @lapp0 |
I'm not sure. Your backwards step where it fails is a different layer of the model than me, but the only thing our scripts have in common is unsloth. How about some debug details?
|
Here is the pip freeze: Here is the full training script: link This is how I trigger the training scripts: you may set hf_token to string "None", if you are loading unsloth models I guess. |
requirements.txt isn't the same as pip freeze. |
Oh no sorry guys - i will take a look |
Thanks @danielhanchen Here is my reproduction script as well, run on a 4090 with cuda 12.1. @mano3-1 has a standard SFT script so his is probably worth looking at first.
|
Hi @lapp0 , |
Sorry about my confusion @mano3-1 I reviewed and compared our installed packages. Nothing noteworthy in the shared dependencies, other than perhaps the issue is related to the use of xformers. Will experiment with this later.
|
Thanks for the code repro - will test this out - sorry on the issue again! |
Also facing same issue. While using colab and the standard notebook in the unsloth folder. Thought to add. |
hey, |
Sorry guys just started debugging this.
For Colab / Kaggle should be fine with a restart @DementedWeasel1971 When you said the colab notebook we provided broke, could you point to exactly which one thanks. @mano3-1 Extremely weird actually - I reran Colab with Instruct and it seems fine - would you be able to run just the conversational notebook for Llama-3 here: https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing @lapp0 I'm currently running your PPO example here: https://colab.research.google.com/drive/1fgJv0eKlRKexOl2RqcxoiZ-HhGrdNWQW?usp=sharing (will wait for it to complete) |
Thank so much for looking into it! Unfortunately I'm still getting
Please let me know if there's any other debug details that would help. Also fyi, to speed up debugging you can set Edit: I pushed a bad commit to my branch, I reverted the broken change. Should be good to try again with head of |
Hi, @lapp0 I can see paged adam in your script, perhaps change it to adamw_8bit and try it over. |
@mano3-1 I just tried using a non-paged optimizer as you suggested, but unfortunately it didn't resolve the issue. @danielhanchen this is a brand new trainer adapted from huggingface/trl#1540 based on https://arxiv.org/pdf/2403.17031 It hasn't ever been run successfully with unsloth before, but runs with peft + BnB. Shouldn't the forward and backward pass be identical to peft + BnB, or are there some steps where precision loss occurs? @mano3-1 @danielhanchen it's interesting that mano isn't getting nan, but you are. Perhaps there is something different between your environment? Here's mine for context:
|
@mano3-1 Wait if it works for you - weird it might be a weird paged optimizer issue. @lapp0 HMm very weird indeed - yes I only edit the forward and backward passes, but I'm assuming the wrapping mechanisms are causing issues ie If I had to guess, the Cross Entropy Loss part is causing issues, since I manually shift the labels and append stuff, so maybe it might be causing issues. I also turned off |
@danielhanchen one other thing that isn't tried / tested by the Unsloth community is interleaving training and generating, which this script does. I have a feeling that is a possible culprit. I'll experiment with training only using pre-generated samples when I get a chance. Also I don't think the For nan gradient checks, I already am running with
Do you know a good way to inject hooks which apply more extensive and detailed nan checks? |
Edit: I was mistaken about the source of the problem. However I did discover that if my |
@lapp0 Apologies on the delay! Ok weird so it might be something related to batching. Weird. Do you know if generation also uses |
@danielhanchen I'm pretty confident that the issue relates to padding now. The error doesn't occur with batch size
I'm wondering whether Unsloth includes logits which aren't included by the attention mask in the backward pass? I'll do some more experimentation. |
I found the issue and created a reproduction script! #533 |
Thanks for the investigation - ill take a look! |
Is anyone this having this error? |
Hi,
I'm currently fine-tuning llama3-instruct-8b on a custom dataset using unsloth's FastLanguageModel. I'm using Hugging Face's SFTTrainer to train the model. Surprisingly, the gradient norm and evaluation loss become NaN after a few steps. I've seen a blog from unsloth mentioning that NaNs may appear due to a bug, but they also mentioned that the bug was fixed by Hugging Face and unsloth now (here, under the llama3-Quirks section). So, I not only updated unsloth and Hugging Face but also added the "pad_token" mentioned in the blog. Despite these attempts, the NaN problem still persists. Is there something else that I'm missing? Can someone help me out with this?
Here's the code snippet of how I'm loading the model:
Following is the training code:
The text was updated successfully, but these errors were encountered: