-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] DeepSpeed single file saving #6900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM ! Small nits
Codecov Report
@@ Coverage Diff @@
## master #6900 +/- ##
=======================================
- Coverage 92% 87% -5%
=======================================
Files 194 194
Lines 12346 12355 +9
=======================================
- Hits 11327 10688 -639
- Misses 1019 1667 +648 |
Hi @SeanNaren, this particular test is failing. |
I'm also noticing the container is failing when DeepSpeed is added to the extras. I might move to having DeepSpeed installed within the GPU container instead of the extras file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeanNaren totally separate from this PR, but have you seen checkpoint saving become a bottleneck at all with deepspeed? I think there's room to optimize the saving logic inside the lightning checkpoint callback, since we are potentially re-running the broadcast/state dict modification 2 times here: https://github.com/PyTorchLightning/pytorch-lightning/blob/fe0d08899eba94d275ff42253f495d9e70d86f89/pytorch_lightning/callbacks/model_checkpoint.py#L278-L286
If multiple of these modes are used, then definitely it would be an expensive operation. Currently I've only tested with
|
What does this PR do?
DeepSpeed has released a new version which contains all gathering weights to a single process and saving. As discussed in #6691 it's probably best for now to prioritise one file saving as this is easiest to deploy etc. As a result I've made this the default behaviour when using ZeRO 3.
In addition, if a user uses the
configure_sharded_model
hook, it is necessary to call this in theload_from_checkpoint
function as the test indicates via the model hook. This could potentially be done for the user, however right now I'll make it clear the docs in a separate PR that this needs to be handled manually.cc @ananthsub @tchaton
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃