-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DeepSpeed] ZeRO-Infinity integration: getting started and issues #11464
Comments
Hi @stas00, is it normal for zero3 training to take a while to get started? I haven't put in any time to investigating yet, but I updated transformers and deepspeed to the latest masters just to see if I could get them working. My simple training script (derived from the summarization example) works fine with deepspeed and the default zero2 config, but when I run the same script with the default zero3 config, training begins but hangs with the progress bar at step 0. I let it run for about half an hour before I killed the process. The quick test zero3 in your post above seems to run fine, however. Is there some initial zero3 overhead I just need to be more patient with, or do I possibly have some deeper problem? |
Something is wrong then, deepspeed takes a bit longer to start than normal as it pre-allocates some memory, and extra so the first time if it needs to compile some cuda extensions, but once started it should work at the normal speed. Hanging on zero3 could indicate that you're on multi-gpu and doing some code that blocks on trying to sync with other gpus. Anything involving forward calls must be performed on all gpus participating in the process. If one of them is skipped all other gpus will block waiting for that gpu. For example, if you're doing some code that performs Could you please open a separate issue and help me to reproduce the problem and then we can look at it together. To help diagnose, you can add this anywhere to your code:
and it'll dump bt for all threads every 20 secs. So you will be able to see where it's hanging. |
Hello! I was trying out the command pasted above, but replacing the zero_optimization part from tests/deepspeed/ds_config_zero3.json with the configuration from the NVMe offload example (see link above). The error I get is: |
Thank you for trying this new feature. This looks like a potential bug in Deepspeed. I asked @tjruwase to have a look. May be it's worthwhile to file an Issue at https://github.com/microsoft/DeepSpeed/issues if you have a few minutes? As this is definitely not an integration issue. If you do please paste the full config you were using. thank you, @thies1006 |
@thies1006, thanks for reporting this issue. As @stas00 suggested, could please report this as a deepspeed issue? It would be great if you included the exact ds_config.json in the issue report. Thanks so much! |
Just now there appeared this issue which I guess is exactly the same case. Sorry for not posting the exact config right away. Thank you very much! Edit: Lowering "sub_group_size" from 1e14 to 1e3 solved the issue (however another one comes up, filed another issue at Deepspeed). |
@thies1006, there is now a PR for the assert: |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@stas00 I am not sure if this is the right forum to ask. Feel free to direct me to somewhere else |
indeed, but you have to do it before you called You can still add/remove params after |
@stas00 Thank you for your prompt response. so before
|
I don't think this example can work, since deepspeed installs special attributes into the tensor which would be copied and point to the wrong place. You'd have to create a normal torch param and copy the data from another param, bu perhaps you can simply ask deepspeed for adding a new util that will do the right thing for you. But let's stop this discussion here as this is offtopic to this thread and not really related to |
DeepSpeed ZeRO-Infinity HF Integration is now available in the master branch of
transformers
. Here is a quick getting started/what's new post.ZeRO-Infinity extends ZeRO-3 by extending CPU Offload with NVMe Offload, enabling training even bigger models. And it adds various other optimizations and improvements.
Getting started
Install the latest
deepspeed
version:You will want to be on a transformers master branch, if you want to run a quick test:
You will find a very detailed documentation here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed
Your new config file will look like this (for ZeRO-3 as an example):
If you want to experiment with NVMe offload, please see: https://huggingface.co/transformers/master/main_classes/trainer.html#nvme-support
Deepspeed currently runs only fp16-mixed precision
While deepspeed devs are working on the fp32 mode, at this moment only fp16-amp-like train/eval is available. So if your model struggles under fp16/amp it will have the same struggles under deepspeed.
Moreover, because deepspeed does
model.half()
forcing all weights to fp16, some models might be ready for this (under AMP things are switched dynamically to fp16 where needed). If you run into this please post a new issue and we will try to find a solution/workaround for those special cases.must use the latest
transformers
masterIf you get deepspeed errors like it doesn't know what
auto
value is, you aren't on latesttransformers
master branch,git pull
if you already have a clone and if you installed it already update your install.For those who already use DeepSpeed HF integration
As the integration part is evolving it has gone through a major revamp and various improvements.
There are 2 important changes that you need to be aware of if you're already using DeepSpeed integration in
transformers
:After this release only config params that are set to
auto
will get automatically overriden/set to the correct/recommended values, everything else is left as is. This is to avoid the previously confusing behavior of never being quite sure what gets overridden and what not despite the logger telling what it did override. The new behavior is completely unambiguous.See examples
Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#shared-configuration
If you are using massive models and aren't using example scripts, make sure to read:
Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#constructing-massive-models
Everything else should work as before or better.
The docs were revamped a lot too - if you find anything unclear or lacking please let me know.
If you encounter any problems please post an Issue and tag
@stas00
to it.Thank you!
The text was updated successfully, but these errors were encountered: