-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zero3 hangs in inference #860
Comments
@stas00 how many gpus are you running this on? If its more than one, can you please try running it on one GPU and see if it works. If it does, I am suspicious about two issues here:
|
2 gpus. and tt works fine with a single gpu I'll report back on your suggestions shortly. Thank you. |
Very possible. I tried mbart and it didn't have this problem. Albeit once it failed in inference with this:
The failure seems to be intermittent, so it's possible there is some race condition. |
You're spot on with this suggestion, one gpu seems to want to run an additional forward for specific number of samples, but not for others. I will investigate and report back. |
I understood the problem. During inference we have an option to generate predictions which can then be scored against BLEU, etc. Different sequences may take a different number of forward passes to complete this task. So when one gpu finished generating its predictions quicker than the others - say it decided using a criteria that it's done at length of 10 tokens, whereas others aren't done, and say the To ensure that this is so, I hacked the code to complete the while loop till it hit I am not at all sure this hack will be acceptable as:
But I will totally understand if you don't have any brilliant ideas to how to overcome this hurdle and we will find some way around this. |
Thank you @stas00 for digging into this. I am glad you were able to get to the core of the problem.
This makes sense. This is pretty much what I was expecting as well. Since, ZeRO-3 is a single program multiple data (SPMD) approach to parallelism with coordinated data movement, all process must be running the same program, in this case the forward on the model on each process to work correctly.
I agree that the hack is limiting but I have a slightly different view on the "designed to work with any model" part. It seems that the code is actually designed to work only with single GPU models, and is limited in that sense. As long as the model is single GPU, it will work, but it will not work with any multi-GPU model regardless of whether it is ZeRO-3 or model parallel (tensor slicing) or pipeline parallel, since each of them requires some form of special treatment that is inherent in the parallelism itself. For example, model parallelism would require the data loader to give the same sample to all GPUs, and pipeline parallelism would require the data loader to give samples only to the first stage GPU. A potential solution here could be to extend the code to support multi-GPU inference, by allowing for adaptable variations based on the type of parallelism being used?
This I think can be mitigated to a point that the waste in resource is minimal. Two potential solutions:
|
I totally hear you, @samyam, that we need to adapt this code to support parallelization. I obviously wanted to hear first if we can avoid that ;) Thank you for your detailed reply! I totally agree! Wrt item 2 I think the main concern is for situations where max_len happens to be much bigger than needed for the whole batch and then it's no longer about doing a few extra forward passes on a few gpus, but running extra forward passes on all gpus. To try to explain better, let's say our max_len is 100, and all gpus completed their criteteria for a predicted output in 50 tokens, now we are going to run unnecessarily for 50 more tokens for each gpu! Is there a way to signal that all gpus have reached their criteria and synchronize not to continue running forward? That is there is a new for a new condition that if it returns true the loop can be exited on all gpus in a synchronious way. Implementation-wise I'm thinking that will take 2 APIs
With such API in place, if all gpus finished their work in (And yes |
Of course :) I thought a bit about if this would be possible, but could not figure out a solution without code change on the client side, and eventually realized that the core of the issue is support for different types of parallelism.
This makes sense. I was just pointing out that the likely hood of at least one of the samples generating all 100 tokens increases proportionally with batch size, and ZeRO-3 should allow for much larger batch sizes. But since its ok to make changes to the generation code pipeline, I completely agree with doing something smarter like the solution you have below.
This makes sense. We could maybe simplify by doing a single all_reduce, where gpus that are done will use a tensor with 0.0 and those that are not done will use 1.0. If the result of all reduce is 0.0 then everyone can stop, otherwise gpus that are done will do fake forward.
I think you can extend this even further, by checking for the early termination condition not at the end of a batch but at the end of the entire epoch, if you can store the results of the prediction for all the samples and process them at the very end. This would result in even fewer wasted forward. I might not have understood your pipeline fully though, so take this suggestion with a grain of salt.
I |
Thank you for the recipe, @samyam! I think it's the safest to the use the last valid input to ensure that all I don't think in our particular setup we could do the syncronization on the epoch-level loop. I will save it for later, as for now we really want to make the training fast and efficient, and inference possible. We want inference so that we can quickly eval the outcome of training. |
Closing this for now, as I have a working solution. Thanks again, @samyam |
This is a really old ticket and a ton of things have changed since then in both Deepspeed and HF Transformers so I'd recommend opening a new ticket explaining the problem, including your exact versions of the packages you're using and how the issue can be reproduced. |
Hi @lavine-lmu, I am facing the same issue when trying to use zero3 to fine-tune m2m100. It is fine when I use zero2 to fine-tune m2m100. It is also fine when I use zero3 to fine-tune t5 with the same data. Did you manage to solve this error? I tried running the following command, as you also mentioned in: huggingface/transformers#15570 deepspeed examples/pytorch/translation/run_translation.py It works with ds_config_zero2.json but it does not work when I use ds_config_zero3.json. Can you help please?:) |
@evros-chris, as I suggested above, this is a really old Issue and any further comments are very likely irrelevant since the code base has changed since then. Please open a new issue with the full traceback of the error you experience, and your environment, and any other details that can be used to reproduce the problem you are experiencing. Please avoid using "it doesn't work" alone - when reporting bugs, since we have no idea what that means. You may tag me on the issue and I'd be happy to take a look. Thank you. |
Thank you @stas00! I have opened a new issue and tagged you here: |
So training works with zero3 and then I do inference calling
deepspeed.forward()
and while it works on a very small sample, with just slightly bigger sample it hangs with 100% gpu utilization:the trace is from
faulthandler
so please read in reverse.I'm not sure if you have inference tests - may be this can be reproduced with just
model.eval()
?Config:
Thanks.
The text was updated successfully, but these errors were encountered: