-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SageMaker p3.16xlarge failure on running HuggingFace tutorial: FAILED: multi_tensor_adam.cuda.o
#1435
Comments
Full stack-trace:
|
Hey @franckjay, Could you please tell more about your environment? In the title you are saying you are using SageMaker, but from going through your steps on how to reproduce it seems you are starting an EC2 instance, which is not related to SageMaker and running training on SageMaker. Can you please share, which AMI you have used? |
@philschmid , I apologize for the confusion. Yes, we are running this in Sagemaker. |
Could you share your |
|
Ah okay so you are not using SageMaker Training jobs you are using a SageMaker notebook instance then execute the DeepSpeed command in it? Or which service are you using? |
Exactly! |
When creating your Notebook instance did you use the AL1 or AL2 based on? it is possible that the notebook instance uses old/ not up-to-date dependencies, e.g. for |
@philschmid, thanks for helping with this issue. @franckjay, similar to @philschmid's suspicions I also think you are using old compiler tools. I noticed the following in your log. Can you please try newer versions of the compiler tools? |
@philschmid , yes, I am on AL1 . Will AL2 solve this issue? |
@franckjay it appears AL2 has gcc 7.3 which should be new enough to compile our kernels. |
@philschmid , @jeffra , and @tjruwase , thank you for the help! Spinning up an instance with AL2 worked perfectly. You are all wizards of the highest order. |
have the same bug, so how to solve the problem ? |
Describe the bug
I am trying to reproduce the HuggingFace + DeepSpeed https://huggingface.co/transformers/main_classes/deepspeed.html training example on a SageMaker p3.16xlarge instance (8 Tesla V100s) . However, we cannot seem to fix a
FAILED: fused_adam_frontend.o
andFAILED: multi_tensor_adam.cuda.o
issues. It may also be related to our gcc version (?):We have tried to install deepspeed from:
Unfortunately for security reasons, we do not have access to the root of this instance, so we cannot directly upgrade the CUDA/gcc version, which seemed to work for these related issues: #694 and microsoft/DeepSpeedExamples#85, among others.
To Reproduce
Steps to reproduce the behavior:
translation.py
Expected behavior
A very fast training time. Using
python -m torch.distributed.launch
without DeepSpeed runs as expected in our environment.ds_report output
System info (please complete the following information):
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
-> (7, 0)Launcher context
The text was updated successfully, but these errors were encountered: