-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YOLOv5 issues with torch==1.12
on Multi-GPU systems
#8395
YOLOv5 issues with torch==1.12
on Multi-GPU systems
#8395
Comments
👋 Hello @glenn-jocher, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://ultralytics.com or email [email protected]. RequirementsPython>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started: git clone https://github.com/ultralytics/yolov5 # clone
cd yolov5
pip install -r requirements.txt # install EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit. |
I wonder how a bug this big made into the release |
@glenn-jocher I cannot confirm this. Tested on gpu servers with 2,4,8 GPUS (RTX 2080TI/A6000) |
@AyushExel it's probably not a torch bug but instead related to our specific implementation for selecting devices in select_device() below which relies on defining CUDA_VISIBLE_DEVICES in the workspace before torch reads it. Lines 52 to 86 in 8983324
If I run the reproduce command above in Docker image, |
@UnglvKitDe you tested a single-GPU training command above |
@glenn-jocher not exactly, but in the end the same. i tested it on coco128 ( torch 1.12 and cuda 11.6). so |
@UnglvKitDe oh, strange. This is basically the same command I used, but I used device 7. If I try --device 0 I also get the bug though. We'll ok I'll experiment some more. In my experiments 1.11 works correctly but 1.12 does not (master branch in Docker) with CUDA 11.3 |
torch==1.12
on Multi-GPU systems
It appears that this problem occurred because of a change in the timing of reading environment variables in PyTorch 1.12. My GPU environment is here. CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB The following code worked correctly in version 1.11. import os
import torch
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")
for i in range(torch.cuda.device_count()):
info = torch.cuda.get_device_properties(i)
print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB") In version 1.11, the output of this code is below. Using GPU is CUDA:1
CUDA:0 NVIDIA GeForce RTX 3090, 24268.3125MB But version 1.12's output is below. Using GPU is CUDA:1
CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB As additional information, the environment variable change worked correctly when done before import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")
for i in range(torch.cuda.device_count()):
info = torch.cuda.get_device_properties(i)
print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB") Output is below. Using GPU is CUDA:1
CUDA:0 NVIDIA GeForce RTX 3090, 24268.3125MB |
@mjun0812 interesting, thanks for the info! Unfortunately we can't specify the environment variables before loading torch. Does re-loading torch after defining the environment variables have any effect? i.e.: import os
import torch
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")
for i in range(torch.cuda.device_count()):
info = torch.cuda.get_device_properties(i)
print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB") |
@glenn-jocher Thank you for your reply! Your suggestion code: import os
import torch
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")
for i in range(torch.cuda.device_count()):
info = torch.cuda.get_device_properties(i)
print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB") and import os
import torch
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
del torch
import torch
# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")
for i in range(torch.cuda.device_count()):
info = torch.cuda.get_device_properties(i)
print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB") and import os
import torch
import importlib
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
importlib.reload(torch)
# print using GPU Info
print(f"Using GPU is CUDA:{os.environ['CUDA_VISIBLE_DEVICES']}")
for i in range(torch.cuda.device_count()):
info = torch.cuda.get_device_properties(i)
print(f"CUDA:{i} {info.name}, {info.total_memory / 1024 ** 2}MB") The above outputs were the same. Using GPU is CUDA:1
CUDA:0 NVIDIA RTX A6000, 48685.3125MB
CUDA:1 NVIDIA GeForce RTX 3090, 24268.3125MB This issue could not be resolved... |
@mjun0812 too bad. Yes let me know if you find a solution! |
@glenn-jocher I raised this issue in the PyTorch repository, and it has been fixed in the latest master branch. Therefore, it may be necessary to modify Lines 12 to 13 in fdc9d91
+ torch>=1.7.0,!=1.12.0 # https://github.com/ultralytics/yolov5/issues/8395
+ torchvision>=0.8.1,!=0.13.0 # https://github.com/ultralytics/yolov5/issues/8395 I am ready to make a pull request for the above fixes. |
@mjun0812 got it, thanks for the update! I see your PR, will take a look there. |
@AyushExel torch 1.12 issue resolved in upcoming torch 1.12.1, so our fix is to simply excluded 1.12 in requirements.txt in #8497. Problem solved :) |
…tralytics#8497) Exclude torch==1.12.0, torchvision==0.13.0
How did you guys fix the pytorch error? should I install 1.12.1? |
@DaliaMahdy unfortunately 1.12.1 is not out currently, latest stable is 1.12.0, but you can install nightly for example to resolve the issue, or simply use 1.11.0 |
…tralytics#8497) Exclude torch==1.12.0, torchvision==0.13.0
Search before asking
YOLOv5 Component
Training, Multi-GPU
Bug
All GPUs are utilized by torch 1.12 with current YOLOv5 master when a single-GPU command is run, i.e.:
Error does not occur with torch==1.11
@AyushExel FYI
Environment
Docker image
Minimal Reproducible Example
Additional
Temp workaround is to use torch 1.11
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: