-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DeepSpeed] ZeRO stage 3 integration: getting started and issues #11044
Comments
Superceded by #11464 |
Hi @stas00, Thank you for working on this amazing library. I looked into the deepspeed documentation for optimizers at https://deepspeed.readthedocs.io/en/latest/optimizers.html and there're a bunch of optimizers, but So what is the workaround to this? it sounds like we are not able to use Thanks! |
To use I guess I could expand on this here: |
Thanks for your response. @stas00 I tried the way you mentioned (i.e., dropping "optimizer" part from config file). But it seems that Zero Offload is just able to work with DeepSpeed optimizers. The exact traceback is given below:
Update: |
wrt removing the verification, are you sure it's actually doing the right thing? Not failing doesn't necessarily mean it's working correctly. |
@stas00 It's my intuition that the error says: if want to use optimizer(s) other than DeepSpeed default ones,
|
Let's ask Deepspeed devs: deepspeedai/DeepSpeed#1194 Meanwhile if it works for you, that's great! Thank you for doing the experiment. |
Why would you want ZeRO-3
In a few words, while ZeRO-2 was very limited scability-wise - if
model.half()
couldn't fit onto a single gpu, adding more gpus won't have helped so if you had a 24GB GPU you couldn't train a model larger than about 5B params.Since with ZeRO-3 the model weights are partitioned across multiple GPUs plus offloaded to CPU, the upper limit on model size has increased by about 2 orders of magnitude. That is ZeRO-3 allows you to scale to huge models with Trillions of parameters assuming you have enough GPUs and general RAM to support this. ZeRO-3 can benefit a lot from general RAM if you have it. If not that's OK too. ZeRO-3 combines all your GPUs memory and general RAM into a vast pool of memory.
If you don't have many GPUs but just a single one but have a lot of general RAM ZeRO-3 will allow you to fit larger models.
Of course, if you run in an environment like the free google colab, while you can use run Deepspeed there, you get so little general RAM it's very hard to make something out of nothing. Some users (or some sessions) one gets 12GB of RAM which is impossible to work with - you want at least 24GB instances. Setting is up might be tricky too, please see this notebook for an example:
https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb
Getting started
Install the latest deepspeed version:
You will want to be on a transformers master branch, if you want to run a quick test:
You will find a very detailed configuration here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed
Your new config file will look like this:
So if you were already using ZeRO-2 it's only the
zero_optimization
stage that has changed.One of the biggest nuances of ZeRO-3 is that the model weights aren't inside
model.state_dict
, as they are spread out through multiple gpus. The Trainer has been modified to support this but you will notice a slow model saving - as it has to consolidate weights from all the gpus. I'm planning to do more performance improvements in the future PRs, but for now let's focus on making things work.Issues / Questions
If you have any general questions or something is unclear/missing in the docs please don't hesitate to ask in this thread. But for any bugs or problems please open a new Issue and tag me there. You don't need to tag anybody else. Thank you!
The text was updated successfully, but these errors were encountered: