-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[zero3] how to get the model reconstructed for saving? #872
Comments
@stas00, do you have repro steps? |
Here we go:
and then:
and finally:
the results:
|
Some variation of this will eventually be part of the DeepSpeed API, but if you need it sooner, here is how you get the consolidated fp16
|
This will be mostly resolved by: |
Under zero2
self.model.state_dict()
returns a fp16 version of the model, under zero3 it returns some placeholder: with all the weights being justtensor([1.],
, so how can we get the trained model out of deepspeed?This is related to #800 but there under zero2 we at least had the fp16 version save-able, now there is no way at all. This is a total lock-in, unless I'm missing on some API that was added with zero3.
Ideally it'd be awesome if it were to reconstruct it directly on the disk, since that will ensure that there will be enough memory to do so.
To summarize different requests so far - users have 3 different needs:
deepspeed.consolidate_weights()
in the rank0 process which would give users full non-partitioned weights back (perhaps with a bool arg of whether they want the fp16 or fp32 version). So now they can just save the model as they do with any other pytorch tools. This would only be practical for small-ish models. The key here is that while this would be somewhat costly they will be able to use their code almost w/o any change if they train in various ways and not just with deepspeed. I think this must happen on cpu, since it's unlikely gpus will have the memory for that. It probably will have to return a copy of the model with the consolidated weights, so that the user can continue training the original model. So probably something along the lines of Save ZeRO3 (partitioned) fp16 weights #882 but in addition the partitioning will need to be removed too. The result of this call will provide users an equivalent of what they have under zero2 at this moment (if it's fp16).I think all 3 would be more or less the same code, with just different ways of using it - (3) using the existing deepspeed engine and not needing to access the filesystem, (1) and (2) w/o needing an engine and relying exclusively on the filesystem.
Thank you!
@jeffra, @tjruwase
The text was updated successfully, but these errors were encountered: