-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Multi-GPU DDP training hang on destroy_process_group()
with wandb
option 3
#5160
Comments
@Yoon5 thanks for the bug report! If you uninstall wandb before training ( @AyushExel I did some testing and it seems like wandb may be causing issues with DDP. I train all of my DDP models already logged in, but if not logged in and presented with 1,2,3 options query training may crash as above, or if training completes process group is not destroyed and system hangs. The steps I used to reproduce are on a 2-GPU training are here. Can you try to reproduce on your end? # Pull image
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t
# Train 3 epochs COCO128 with DDP
python -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3 Hang looks like this, seems to occur with wandb installed and enabled: EDIT1: Summary is here:
|
Thank you I will try))))) |
destroy_process_group()
with wandb
option 3
I tried above lines (# Pull image Train 3 epochs COCO128 with DDPpython -m torch.distributed.launch --nproc_per_node 1 --master_port 2 train.py --data coco128.yaml --epochs 3). And I do not have wandb. I am did not installed it to my env. And I got this after ~/Desktop/yolov5-master$ t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all $t =============
|
@Yoon5 before you do anything you need to update your nvidia drivers as your error message states:
@AyushExel I'm manually pushing a new |
Thank you |
@glenn-jocher I'm testing this now |
@glenn-jocher The problem doesn't occur for me. I'm running on 2 T4 GPUs and the program exited fine. I've tried this 2 times. Full trace:
|
@AyushExel oh interesting. Can you try again and enter |
@AyushExel also I just noticed in your output your training is only using 1 GPU. When you use multiple devices they will be listed together. Ah sorry, I see my command to reproduce above was incorrect. This is the correct 2-gpu training command:
|
@glenn-jocher thanks. I tried it again. It's not getting stuck. Here's the traceback:
|
@glenn-jocher Ok I was able to reproduce. It occurs on manually choosing option 3. I think I know the source of the problem. I'll push a fix |
@glenn-jocher ok I found the root cause of the problem. The import checks are happening in loggers/init.py which makes the checks at wandb_utils.py redundant. I've moved the checks to init.py now. The PR should fix the problem. Also, it'd be nice to catch these problems early on during CI checks but it's mostly limited because there's no backdoor to stop/resume runs during tests. I think setting up a revamped testing suite to test DDP, integrations etc. might be worth it! |
@Yoon5 good news 😃! Your original issue may now be fixed ✅ in PR #5163 by @AyushExel. To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
I have the same problem using Docker. However I run the container using the env variable WANDB_API_KEY, which does not require wandb login. Using this method, the training hangs at the end |
@Zegorax @AyushExel this issue should be resolved in #5163, so please ensure you are using the very latest code or Docker image. To pull the latest docker image use |
@glenn-jocher I'm using the latest version of the code, the problem is still present even with fix #5163 |
@Zegorax are reaching the end of training? I tried to reproduce this in the latest version of the repo and I'm getting this error:
This seems unrelated to W&B as I've disbaled it |
@AyushExel EDIT: this comes back to our general lack of DDP CI. It's an open issue, still don't have a solution for this. |
@glenn-jocher @AyushExel It's the exact same problem as the original issue. The training finishes, but the process hangs and never returns |
@Zegorax you're probably on an older version of the repo because the latest version has another bug which won't let the training start. try running @glenn-jocher sure. Let me know once the issue is fixed and I'll try to confirm if the wandb issue still exists |
As I said earlier, no I'm not. I'm using the latest YOLOv5 on master branch. |
@AyushExel
Please wait 15 min for Docker Autobuild to complete and deploy this latest merge, then update your Docker image with
|
@Zegorax @glenn-jocher I just tested using the lastest master branch, I can run DDP with wand disabled without any hang.
|
@AyushExel Can you check again using Docker and the env variable I've mentioned earlier ? |
@Zegorax just tested the latest docker image
|
@AyushExel Can you try to reproduce it using a zero-interaction method ? (DEBIAN_FRONTEND=noninteractive) and by using only predefined option when launching the script |
@Zegorax I can't repro. Will you please paste you output? |
@AyushExel The training happens normally. Only at the end, the process never returns and I have to ctrl-c manually (Therefore, the Jenkins job runs forever)
|
@Zegorax that's very strange. On disabling wandb, you should not see wandb termlogs |
@AyushExel Should I create a new issue ? Because I need to have WandB enabled |
@Zegorax oh okay.. I thought we were just talking about wandb disabled. I'll check with wandb enabled |
@Zegorax it worked with wandb enabled
What version of wandb client are you using? Please try to update it using |
@AyushExel I'm also using the latest version of W&B. My system is based on a Jenkins job, so everything is always re-installed at each run, and using the latest version of all repos |
@AyushExel Can try to repro using a non-interactive environment ? By setting WANDB_API_KEY=your-key for example |
@AyushExel Have you been able to reproduce the problem? |
@Zegorax yes I ran this in a non-interactive docker environment and the process finished successfully.
|
I'm also seeing this behavior thinking it was because I'm training on 2xA100 |
@Davidnet you should be able to train DDP 8x A100 successfully in Docker. Can you verify your error is reproducible with the latest Docker image and provide @AyushExel steps to reproduce please? Thanks! |
@Davidnet yes, please. I'm curious to reproduce this so I can get someone to look into this asap. Please verify with wandb enabled and disabled. If the error is caused by wandb, it should only occur when wandb is enabled. Fixing all DDP problems is a very high priority us. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
Hello, when I try to training using multi gpu based on docker file images. I got the below error. I use Ubuntu 18.04, python 3.8.
<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
The text was updated successfully, but these errors were encountered: