Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using the model with four GPUs #4

Closed
abidmalikwaterloo opened this issue Nov 7, 2018 · 8 comments
Closed

using the model with four GPUs #4

abidmalikwaterloo opened this issue Nov 7, 2018 · 8 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@abidmalikwaterloo
Copy link

abidmalikwaterloo commented Nov 7, 2018

I am trying to run the model with 4 GPUs and get the following error:

222 layers, 6.26582e+07 parameters, 6.26582e+07 gradients
     Epoch     Batch         x         y         w         h      conf       cls     total         P         R       nGT        TP        FP        FN      time
/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    main(opt)
  File "train.py", line 131, in main
    weight=class_weights, epoch=epoch)
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/gpfshome01/u/amalik/OGA/Yolov3/xview-yolov3/models.py", line 231, in forward
    x, *losses = module[0](x, targets, requestPrecision, weight, epoch)
  File "/sdcc/u/amalik/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/gpfshome01/u/amalik/OGA/Yolov3/xview-yolov3/models.py", line 155, in forward
    requestPrecision)
  File "/gpfshome01/u/amalik/OGA/Yolov3/xview-yolov3/utils/utils.py", line 195, in build_targets
    inter_area = torch.min(box1, box2).prod(2)
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.FloatTensor for argument #2 'other'

However, the model works when I use one GPU. Any comments!

@glenn-jocher
Copy link
Member

@abidmalikwaterloo the code does not support multi-GPU yet unfortunately. I only have a single-GPU machine so I have not been able to debug this issue. If you come up with a solution please advise me, or submit a pull request.

Also, see ultralytics/yolov3#21 for details. https://github.com/ultralytics/yolov3 is the base repo that this repo was built off of, so when the issue gets fixed there I can port the solution here.

@glenn-jocher glenn-jocher added bug Something isn't working help wanted Extra attention is needed labels Nov 7, 2018
@glenn-jocher glenn-jocher self-assigned this Nov 7, 2018
@abidmalikwaterloo
Copy link
Author

@glenn-jocher Thanks. Working on it.

@abidmalikwaterloo
Copy link
Author

@glenn-jocher did you try to distribute using MPI library? I am thinking of using Horovod for this. Do you think the problem we have now is due to torch internal distributed model and would not give us a problem when we will use the MPI framework,

@abidmalikwaterloo
Copy link
Author

abidmalikwaterloo commented Nov 11, 2018

I was able to parallelize the model using Horovod. It ran on 3 nodes. However, data is not being distributed. It should be divided among three nodes which will reduce the number of iterations per epoch. This is not happening.

From one of the example using ResNet-50 . with imagenet, Horovod is using the following to distribute the data:

kwargs = {'num_workers': 4, 'pin_memory': True} if args.cuda else {}
train_dataset = \
    datasets.ImageFolder(args.train_dir,
                         transform=transforms.Compose([
                             transforms.RandomResizedCrop(224),
                             transforms.RandomHorizontalFlip(),
                             transforms.ToTensor(),
                             transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                  std=[0.229, 0.224, 0.225])
                         ]))
# Horovod: use DistributedSampler to partition data among workers. Manually specify
# `num_replicas=hvd.size()` and `rank=hvd.rank()`.
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, sampler=train_sampler, **kwargs)

I am doing the following:

 # Get dataloader
    dataloader = ListDataset(train_path, batch_size=opt.batch_size, img_size=opt.img_size, targets_path=targets_path)

    #For Horovod
    kwargs = { 'num_workers':1, 'pin_memory':True} if cuda else {}
    train_sampler = torch.utils.data.distributed.DistributedSampler(dataloader, num_replicas=hvd.size(), rank=hvd.rank())
    train_loader = torch.utils.data.DataLoader( dataloader, batch_size=opt.batch_size, sampler=train_sampler, **kwargs)

Do you think this make sence?

@glenn-jocher
Copy link
Member

I have not tried using those packages. There should be a way to natively use multi-GPU within pytorch. I have not had access to multi-GPU machines to debug however.

@abidmalikwaterloo
Copy link
Author

@glenn-jocher Any progress? I got some idea and would like to work on it. But like to know where the effort is. Do not want to spend time on the stuff you already did.

@glenn-jocher
Copy link
Member

Sorry, still got the same 1 GPU machine here, I simply can't debug multi-GPU currently. If you come up with a solution, let me know! Thanks.

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 20, 2019

@abidmalikwaterloo this issue is resolved in our main YOLOv3 repository:
https://github.com/ultralytics/yolov3

Be advised that the https://github.com/ultralytics/xview-yolov3 repository is not under active development anymore. We recommend you use https://github.com/ultralytics/yolov3 instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants