-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble with the backward pass in ZeRO 3 #846
Comments
Here's the full stacktrace
cc'ing my colleagues @sdtblck @ShivanshuPurohit |
@StellaAthena, thank you for checking out Z3 Offload, and pointing this issue. I may have a potential fix, working on it now. Will keep you posted |
@StellaAthena it will take us a few days to merge the fix in. In the meantime, can you try the branch with the fix: https://github.com/microsoft/DeepSpeed/tree/fix-misaligned-grad |
@samyam I pulled that and it runs! Now time to figure out if my model is actually running faster.... |
Actually, it looks like the loss isn't going down, and those FLOPS/s/GPU numbers look extremely suspicious. |
@StellaAthena that's strange.. Did you just turn on Stage 3 in the config file or, did you also register the necessary external parameters? If you have not registered the external parameters yet, you can find instructions on which parameters needs to be registered here: https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-3-offload If you have already done this, can you please point me to the commit that contains the changes you made for Z3 then I can take a look and see if anything else is missing. |
The commit history is a bit of a mess, but we made a consolidated squash commit here that shows the difference between our main branch and the ZeRO-3 integration. The code generally follows your Megatron repo, though one major difference is that we've reformulated and consolidated how we handle arguments. I'm using the config file @salanki helps us with the hardware and network topology side of things. |
@StellaAthena the model size you can run would depend on how much CPU memory you have with Offload. Generally a 10B parameter will take about 200 GB of CPU memory with offload. If you can give some more details on the system you are running (exact number of GPUs on a node, number of nodes, exact amount of CPU per node), I can give you an estimation of what is that max model size you should be able to run with Z3 Offload. Regarding your port of Z3, I think the issue might be that you are initializing some of the embedding parameters outside the class where the parameters were created. To do it correctly, you need to first gather those parameters before initializing them, as shown here: https://github.com/microsoft/DeepSpeedExamples/blob/20ea07a2a069696abec212e25476a9bf76aced70/Megatron-LM-v1.1.5-ZeRO3/megatron/model/language_model.py#L133. This was the very last step of our tutorial: https://www.deepspeed.ai/tutorials/zero/#training-trillion-scale-models-with-zero-3-offload so its easy to miss From a cursory look at your code base, the place where you need to make this changes are here: If there are other places where you access parameters outside of the module where it was created, then you need to do the Gather there as well, except for if its in the forward pass. Then it will be handled by the register_external_parameters Please let us know if this fixes your issue. |
Just to be clear, it means every time
Do I understand it correctly? Because |
@ShivanshuPurohit your understanding is correct. @StellaAthena I cloned your branch from here and was able to repro your results with the small-zero3.yml config. I do see that the loss does not drop, but I also noticed that the loss does not drop regardless of whether ZeRO-3 is enabled or ZeRO is disabled completely using stage 0. I also tried turning off DeepSpeed entirely, and the loss still doesn't drop even without DeepSpeed. So the issue does not seem to be related to DeepSpeed or ZeRO Stage 3. I did notice that when you use DeepSpeed with pipeline parallelism model using config/small.yml, then the loss drops, but when I set pipeline_parallelism=0, then the loss stays the same regardless of whether DeepSpeed is enabled or disabled. May I suggest trying out the Megatron example we created to work with ZeRO-3 from here. Adding @ShadenSmith for visibility and any feedback he may have regarding the megatron with DeepSpeed pipeline parallelism example. @StellaAthena I am assuming here that your current code is based off of this example? |
Very interesting. I ran I made a code diff between your example code and the code on our |
Okay, so there are two potential failure points:
@ShivanshuPurohit can you take a look at this? |
Can confirm. For some reason setting pipeline parallelism to 0 (currently on |
Look at that beautiful learning curve! The problem was on our end, we were handling non-pipeline models incorrectly. Once we got that fixed ZeRO-3 ran straight away. The model is still not as efficient as I had hoped (6.1 e12 flops/s/gpu) but this is with extremely unoptimized settings. Time to do benchmarking! |
@StellaAthena this is great to hear! Please do keep us posted on the benchmarking results. :) |
Please reopen if there are further issues |
I have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error
RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.
When I turn off CPU adam, I instead get this error
RuntimeError: start (0) + length (174763) exceeds dimension size (174761)
I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what's causing this. My code is overall extremely similar to yours, though as I note at deepspeedai/DeepSpeedExamples#92 I cannot get your code to run either (though for different reasons).
The text was updated successfully, but these errors were encountered: