-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error building extension 'cpu_adam' (Hugging Face integration) #1127
Comments
Update: Following the steps taken in https://seo-explorer.io/blog/configuring-centos-7-to-finetune-eleutherai-gpt-neo-2-7b-with-torch-and-deepspeed/ I managed to get the However, as mentioned in that issue, I too experieced a significant (2x) decrease in training speed, does anyone have an idea of why this might be happening? |
Hi again, another update: After coming back to the issue after some time I realized that when running the nvcc command (below) the gcc compiler being used was not the one I installed system-wide, but rather the one that came with anaconda which was pointed to with the softlink
I managed to get the
And then re-pointing the softlink used by nvcc by running
NOTE: re-pointing the softlink to the system-wide gcc compiler did not work, it threw all kinds of strange errors This solved my issue though, there are two things worth noting:
Hope this helps someone out there! :D |
Hi @Gabriel-Macias - thank you for sharing your result and that you were able to get this to work. I'm going to close this issue as it is stale and with changes to python/nvcc/cuda/etc, its less likely that someone would hit the same issue, but closing as it is resolved. |
Hi guys,
I'm trying to use the
cpu_offload
function ofDeepSpeed
integration withHuggingFace
'sTrainer
integration on a single GPU on AWS Sagemaker (ml.p2.xlarge
instance). However I've been struggling for quite some time to get it to work properly. Here are the current versions I'm using:I've taken a look at similar issues (#889, #694, #885) but haven't had any success so far. So far I have tried:
DS_BUILD_OPS=1
andDS_BUILD_CPU_ADAM=1
)Here is a simplified version of the code I'm running on Sagemaker:
My json config file:
The output of running
ds_report
:This is the stack trace I get if I try to run the code without pre-building
And finally, the stack trace I get if I try to pre-build while installing with
DS_BUILD_CPU_ADAM=1 pip install deepspeed
:Any input on how to solve this issue would be very much appreciated!
The text was updated successfully, but these errors were encountered: