-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adam mini can't save when using with FSDP in Huggingface Trainer #5
Comments
Hi, i'm from the same team with @hahuyhoang411 , we have been able to mitigate the issue by turning this option to False Why it is the case? I noticed some speed loss turning this off, is there anyways to enable it back without causing issue ? |
Thanks @tikikun for the update. We are still working on this issue of saving checkpoint, but good to know that you guys work out a solution. I wonder what do you mean by "speed loss". Did you observe: Case 1: slow down in loss v.s. iteration. I personally guess you mean case 2 because the option of "use_orig_params" does not seem to affect the optimizer trajectory. For case 2, there are some possibilities come to my mind now: Possibility 1: "use_orig_params = False" slows down some other operations that are not related to the optimizer. Possibility 2: Current implementation of Adam-mini involves several "view(-1)" type operations, which may cause the slow down. Note that these "view(-1)" will not be activated when using "use_orig_params = True" because everything is already flattened before the training. We are still trying to fix the saving issues. Please feel free to update more of your findings. It would help a lot. |
@zyushun yes it is case 2, each iter takes more time. Thank you for the swift response, the result is very good and it is truly as good or better than AdamW with much fewer VRAM (much better than some previous one we try like lion etc), its really amazing work. We will update if we have any other issues or infor, we are using this internally now at the moment. |
@tikikun @hahuyhoang411 @zyushun Thanks for sharing. Have you tried this combination of trainer and deepspeeed? |
for deepspeed its working, only fsdp fail |
yes: same results with zero_3 = True | False |
Hi @tikikun @hahuyhoang411 @han508, thanks a lot for the valuable discussion on this checkpoint-saving issue! We tried several fixes but unfortunately, so far the only effective way is to set "use_orig_params = False", as suggested by @tikikun and @hahuyhoang411. To sum up, so far we have: trainer + deepseed = can save/load ckpt We have updated this info in the readme.md. We will keep updating if we find other approaches to fix this checkpoint saving issue. Thanks a lot for all the great suggestions and discussions! |
The use of |
Can you provide an example with traniner? I tried to override the create_optimizer method, but it failed. |
Thanks for the great suggestion! Here is an example of create_optimizer. We have also included this example in the readme.md.
|
@chcoliang |
@han508 Hi, this is an author of Adam-mini. According to our calculation, Adam-mini can save about 4GB per GPU card in your case. I conjecture that something is wrong and the used optimizer is still AdamW. Could you provide the deepspeed log to us? Or could you tell us how to reproduce your results? An example of deepspeed log is given below:
|
Thank for your reply.
Application details:
|
@han508. Hi. According to the log "Loading extension module cpu_adam...", I think you are using cpu_adam instead of Adam-mini. From my view, when "optimizer" in the deepspeed config is None, deepspeed will use trainer.create_optimizer() to generate optimizer. Thus, some content about "optimizer" may still be in the configure file. Please double-check the config file or some default configuration which is not in the config file. Further, the current version of Adam-mini does not support cpu-offload for deepspeed according to our experiment. We do not recommend to use cpu-offload when using Adam-mini in deepspeed. |
@chcoliang |
Sure! It is vitally important to support cpu-offload. As we mentioned in the readme.md, the current version of Adam-mini supports cpu-offload in FSDP, but not DeepSeed (due to some unexpected error). We are working on it and hopefully it will be done soon. |
I may not have caught up all of the context, but I wanted to mention that for Lines 53 to 56 in f98a1cd
Line 212 in f98a1cd
|
Hi @awgu ! Thanks a lot for mentioning this to use! This is a great catch. So it seems like we need another way to save ckpt for FSDP. We will work on it and will update here as soon as we make any progress. |
Is there any latest process on "deepspeed cpu offload" and i am looking forward to it. |
Hi, it's me again. The training is working great but when it comes to saving the checkpoint, I got this bug. Any ideas?
The text was updated successfully, but these errors were encountered: