-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error building extension 'fused_adam' with DeepSpeed==0.3.13 #885
Comments
Here is the config file that I'm using for DeepSpeed -
When I use
|
I created a colab notebook that took quite a lot of trial and error to figure out the right versions of everything to make DeepSpeed compile. As you can see in the notebook I'm using I will try to keep it up-to-date as colab changes its setup so ping me if it stops working (or file an issue) and requires a tune up. I documented the critical components to successfully building deepspeed here. |
Hi @stas00 , I ran your notebook in colab and it gave this error -
Here is the full notebook with outputs. All the critical components that you mentioned for building deepspeed are not valid in my case as I'm using system wide cuda version while installing torch. Also I don't have multiple cuda versions in my system and using |
This is odd, since I have just re-run my notebook on the free version of colab and it didn't have any problems. So you have may have noticed you made a progress, since you managed to now build the deepspeed extensions using this notebook. But then something killed the process immediately after it built the extension. so now you have a correct combination of the packages. Try to re-run that last cell again - since the extension is now built and cached (that is if you're in the same session, if not start a new and re-run this cell second time if it dies again the first time). In theory everybody gets mostly the same environment, but perhaps it's not so. Could you monitor that you disk space and RAM are not at 100% - perhaps the watchdog kills the process when resources are exhausted? I'm curious what happens if you run the training cell the 2nd time. |
You're absolutely correct @stas00 RAM is reaching 100% and process is getting killed. I ran 2nd time by changing Also, I'm thinking how is this setup related to my issue. Are you suggesting me to upgrade my system-wide cuda to 11? |
Glad you figured it out! We don't know if you get the same environment as I do. Actually it looks pretty random. I just tried 2 different notebooks and in the deepspeed one it gave me 25GB RAM and in another one only 12GB! Run a cell with:
I will add this to the notebook with a note, so others will know.
Not at all. You just need to have the same cuda as your pytorch was built with, so just install a pytorch build that matches your system-wide cuda and make sure you got https://huggingface.co/transformers/main_classes/trainer.html#installation-notes |
You can probably close this issue now. |
Hi @stas00 previously my paths were -
and now they are modified as -
I have my cuda-10.1 in /usr/local/, so above paths are correct.
Also tried this from similar issue but didn't work. |
Oh, so it's not colab that you're trying to get it to work on. OK!
Can you build it by downgrading to torch-1.7.1+cu101? Just to validate that deepspeed master is not at fault since you had it working with 0.3.10 - but as you see you changed the pytorch version as well. Where did you find torch-1.8.0-cu101? I can see here only 10.2 or 11.1 at https://pytorch.org/get-started/locally/ Alternatively if your nvidia driver supports it move to 11.1, you also get a better cudnn along with newer drivers/cuda, which you want to upgrade too then. I use cuda-11.1 at the moment and it works well. If you want to save the hassle of upgrading to 11.1, and keep 10.1, I'd do pre-building from the source, since it'd help you identify any problems easier. See the details here: My build script is:
you just need to adjust the arch list to match your hardware. and may be -j to match how many parallel makes you'd like to run. And it does the develop install. You can remove |
Hi @stas00 ,
Tried this and still facing same issue.
I downloaded
If this downloads and installs modules from external sources, my VM won't have open internet access and it has to go through my company's firewall. If this (downloading from external sources) is the case, I may not pre-build from source.
I want this to be last option as it's not in my control and have to contact other team to upgrade. Meanwhile, I tried prebuilding in colab with diff. combinations and all those worked fine and you can find detailed o/p's here |
OK, so since you prebuilt from source on colab (thank you for sharing the outcomes), you now know what's involved. It'll install dependencies just like when you don't pre-build from source. So if you are able to do Here is yet another approach to consider. Build a binary wheel on whatever normal machine where you have a similar cuda setup:
adjust Now you have Now you can install it on your VM and you don't need to build anything at run time, you just do:
I presume you will already have the other dependencies installed since you already did that for I wonder if DeepSpeed should document this approach on their advanced install page. |
Thanks a lot @stas00 Finally it worked. As colab is having python-3.7, I replicated what you've said in my AWS EC2 instance where I had many CUDAs including 10.1 and 11.1. Reiterating the steps I followed so that it can help someone with similar issues -
verify torch versions with
Check whether compatible op's were installed or not with
|
Awesome! Thank you for the report, @saichandrapandraju Except you don't need step 4. Step 5 is all you need after you cloned the repo. Step 4 is for when you want to install it locally. and is similar to Steps 5+6 but you don't get a wheel to take to another machine. I had the same issue with fairscale on several setups no matter what I tried it won't build at runtime, but prebuilding into a wheel and installing that worked. |
BTW, I do recommend you use an explicit |
Yes. In my case both my build and target machines are same, so didn't use But yeah, it's always better to explicitly mention. For reference, I used I'm not very sure but I thought my device architecture is Not sure whether this is the correct way to check. May be @stas00 can confirm. |
That's the correct way: Also you can find the full list of all archs at https://developer.nvidia.com/cuda-gpus Incidentally I have just added all this information to the docs, hopefully should be merged in the next few days:
|
Also added a deepspeed PR with various docs including how to build the binary wheel: #909 |
Hi,
I upgraded
DeepSpeed
to0.3.13
andTorch
to1.8.0
and while using DeepSpeed with HF (HuggingFace), I'm getting below error -RuntimeError: Error building extension 'fused_adam' and here is the stacktrace -
Versions that I'm using are -
transformers==4.4.2
DeepSpeed==0.3.13
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
But I was able to run
DeepSpeed-0.3.10
withHuggingFace-4.3.2
andTorch-1.7.1+cu101
without any issue.Plz suggest how to proceed further..
The text was updated successfully, but these errors were encountered: