-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using the model with four GPUs #4
Comments
@abidmalikwaterloo the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request. Also, see ultralytics/yolov3#21 for details. https://github.com/ultralytics/yolov3 is the base repo that this repo was built off of, so when the issue gets fixed there I can port the solution here. |
@glenn-jocher Thanks. Working on it. |
@glenn-jocher did you try to distribute using MPI library? I am thinking of using Horovod for this. Do you think the problem we have now is due to torch internal distributed model and would not give us a problem when we will use the MPI framework, |
I was able to parallelize the model using Horovod. It ran on 3 nodes. However, data is not being distributed. It should be divided among three nodes which will reduce the number of iterations per epoch. This is not happening. From one of the example using ResNet-50 . with imagenet, Horovod is using the following to distribute the data: kwargs = {'num_workers': 4, 'pin_memory': True} if args.cuda else {}
train_dataset = \
datasets.ImageFolder(args.train_dir,
transform=transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
]))
# Horovod: use DistributedSampler to partition data among workers. Manually specify
# `num_replicas=hvd.size()` and `rank=hvd.rank()`.
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs) I am doing the following: # Get dataloader
dataloader = ListDataset(train_path, batch_size=opt.batch_size, img_size=opt.img_size, targets_path=targets_path)
#For Horovod
kwargs = { 'num_workers':1, 'pin_memory':True} if cuda else {}
train_sampler = torch.utils.data.distributed.DistributedSampler(dataloader, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader( dataloader, batch_size=opt.batch_size, sampler=train_sampler, **kwargs) Do you think this make sence? |
I have not tried using those packages. There should be a way to natively use multi-GPU within pytorch. I have not had access to multi-GPU machines to debug however. |
@glenn-jocher Any progress? I got some idea and would like to work on it. But like to know where the effort is. Do not want to spend time on the stuff you already did. |
Sorry, still got the same 1 GPU machine here, I simply can't debug multi-GPU currently. If you come up with a solution, let me know! Thanks. |
@abidmalikwaterloo this issue is resolved in our main YOLOv3 repository: Be advised that the https://github.com/ultralytics/xview-yolov3 repository is not under active development anymore. We recommend you use https://github.com/ultralytics/yolov3 instead. |
I am trying to run the model with 4 GPUs and get the following error:
However, the model works when I use one GPU. Any comments!
The text was updated successfully, but these errors were encountered: