-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issue when resuming from checkpoint with fp8 #246
Comments
Can you share your training commands? |
I believe this was the command used on resume: I'm also not sure about the comment that it works in non-fp8. |
Sorry for the delay on my end. Some further clarifications:
If you could help us with minimally reproducible snippets to debug the issues further that would be very much appreciated. Thanks! |
Sorry for the confusion. I've ran a couple of more tests, and despite my original claim, it seems like it's an issue with the learning rate, and I should have realized the issue earlier. Edit: I realize this doesn't match up with the original post, where lr seems to continue. I will investigate some more on my own and repopen if I have a clear repro case |
True. We did a couple of recent fine-tuning runs with full-finetuning and obtained decent results: Perhaps you could give those settings a try? |
System Info / 系統信息
I noticed that my lora was degrading significantly after resuming training. Checking the logs before and after resuming shows this:
Note the loss difference. Lr scheduling looks like expected. I see now that the epochs are different when resuming 666 vs 1000.
I confirmed with another lora, trained in fp8. When resuming without fp8 the loss continues like in the previous session, when using fp8, it's wildly off.
I'm using the ui, so technically it could be an issue there, but I've had no problems resuming before, and no problems with fp8 on its own.
If resuming from checkpoint with fp8 has been tested before, I'll assume it's some issue on my part.
In case it's a config issue, here's the one I used:
Information / 问题信息
Reproduction / 复现过程
Expected behavior / 期待表现
resuming from checkpoint with fp8 should not be different from not using fp8
The text was updated successfully, but these errors were encountered: