-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only main process should call _save on deepspeed zero3 #25959
Conversation
src/transformers/trainer.py
Outdated
if self.args.should_save: | ||
self._save(output_dir, state_dict=state_dict) | ||
# remove the dummy state_dict | ||
remove_dummy_checkpoint(self.args.should_save, output_dir, [WEIGHTS_NAME, SAFE_WEIGHTS_NAME]) | ||
self.model_wrapped.save_checkpoint(output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not feel like the right solution, as should_save
will check if we are the main process or not, and we should only ever be saving once. You can have self.model_wrapped.save_checkpoint
under that self.args.should_save
I believe, but I'm not sure this is the right "fix". Definitely cc @pacman100 here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.model_wrapped.save_checkpoint
needs to be called on each process, because each process only have part of model weight under deepspeed zero3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, Thank you for the fix! However, it needs correction as explained in the comment,
src/transformers/trainer.py
Outdated
if self.args.should_save: | ||
self._save(output_dir, state_dict=state_dict) | ||
# remove the dummy state_dict | ||
remove_dummy_checkpoint(self.args.should_save, output_dir, [WEIGHTS_NAME, SAFE_WEIGHTS_NAME]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would remove the legitimate model checkpoint when stage3_gather_16bit_weights_on_model_save=True
. remove_dummy_checkpoint
should only be called in the exception block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for your explanation, it has been fixed.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@pacman100 any things i need to do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for iterating, LGTM!
@zjjMaiMai One of the hub tests are failing, complaining that the base_model is empty when pushing to the hub. Could you try running this test locally to see whether it's a result of the changes in this PR? |
|
@zjjMaiMai Could you try and rebase on main? This should resolve the failing tests. |
All green! @amyeroberts |
…5959) only main process should call _save when deepspeed zero3
Background
trainer._save
call on all process after #25817. will raiseFileExistsError
when model save.What does this PR do?
this pr fix it,
trainer._save
will call on main process only.