Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Update balanced_positive_negative_sampler.py #93

Merged
merged 1 commit into from
Nov 1, 2018

Conversation

Jiayuan-Gu
Copy link

Avoid unnecessary and unexpected copy from gpu to cpu.

I am trying to implement a multi-thread data parallel module instead of distributed modules. I have modified the data structure and almost finish it. I find a fatal bug when you use cpu version torch.randperm and multi-thread, it will raise Segmentation Fault during forward around a certain iteration.

I heard this repo is created for a benchmark performance contest, so I wonder whether you care about flexibility apart from speed. Distributed learning has to load the whole dataset multiple times. I guess it is why you do not use fixed proposals like Detectron, since proposal files need large memory, which your implementation is hard to support.

use cuda version torch.randperm to avoid copy from gpu to cpu and a fatal bug in multi-thread cpu version
@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Nov 1, 2018
@fmassa
Copy link
Contributor

fmassa commented Nov 1, 2018

Hi,

Thanks for the PR!

The reason why I didn't use a device-specific randperm was that it was not available when I first wrote it.

About using DataParallel instead of DistributedDataParallel, I believe it might end up being a bit harder to implement, and also slower (at least if we use native DataParallel, which is bottlenecked by Python).

It might also be worth noting that for small tensors ( <30000), randperm on cuda of loads the computation to the CPU, see https://github.com/pytorch/pytorch/blob/b818d31a3e8bbe8fd13c520de145d84b0641cf70/aten/src/ATen/native/cuda/TensorFactories.cu#L65

But thanks for the PR, it looks good to me!

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa fmassa merged commit b418212 into facebookresearch:master Nov 1, 2018
@Jiayuan-Gu Jiayuan-Gu deleted the Jiayuan-Gu-patch-1 branch November 2, 2018 18:55
@Jiayuan-Gu
Copy link
Author

Jiayuan-Gu commented Nov 2, 2018

The fatal bug I encounter is just like pytorch/pytorch#1868.
There is something wrong with THRandom_random () in multi thread. I think it is fatal and must have a higher priority to fix?

If you need to reproduce the error, you could use my branch https://github.com/Jiayuan-Gu/maskrcnn-benchmark/tree/dev. Using quick_schedules/res50_faster_rcnn_fpn, you might get error around before 2000 iterations. For full schedules, res50_faster_rcnn_fpn, it raises error near 580 iterations.

However, using cuda wrapper of randperm could avoid this bug.

@fmassa
Copy link
Contributor

fmassa commented Nov 2, 2018

@Jiayuan-Hi thanks for the link, it is very interesting!
I have a few questions:

  • Did you benchmark the runtime speed compared to using DistributedDataParallel?
  • How would you need to adapt the code to use multiple machines for training?

@Jiayuan-Gu
Copy link
Author

Jiayuan-Gu commented Nov 3, 2018

I test it on 2 1080Ti GPUs with total batch size 4. Same accuracy for training res50FPN with 1epoch on coco2017train.

  1. Speed goes from 0.45 -> 0.48 s/iter. Data time from 0.005 -> 0.01s/batch
  2. I think most guys including me do not have multiple machines to run codes... Distributed programming is not straightforward.

What is your memory usage because you have to load annotation 8 times if you use 8 GPUs?

@fmassa
Copy link
Contributor

fmassa commented Nov 3, 2018

Thanks for those numbers!

I don't have the exact numbers now, but as we load images on the fly, and we can control the number of data loading threads per process, we can control the amount of memory that is used.

I understand that it is easier to have a single file for the proposals, to a single dataset. But given that the current PyTorch + data loading model is to load images / data on the fly using multiple threads, I'd say that it might be more adapted to instead split the single file containing the proposals into multiple files, one per image. This way, augmenting the number of processes doesn't increase in a significant manner the amount of memory used, and we can amortized the loading cost by using more data loading processes.

Thoughts?

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Nov 12, 2018
Summary:
When we added `randperm_cpu` and `THTensor_(randperm)` we forgot to lock the `THGenerator` mutex before calling `THRandom_random`, which causes segfault error mentioned in facebookresearch/maskrcnn-benchmark#93 (comment). This PR fixes the bug.

Closes #1868.
Pull Request resolved: #13832

Differential Revision: D13025453

Pulled By: yf225

fbshipit-source-id: 6e363a35c72b4862412eaea6516a154126634c9d
zdevito pushed a commit to zdevito/ATen that referenced this pull request Nov 12, 2018
Summary:
When we added `randperm_cpu` and `THTensor_(randperm)` we forgot to lock the `THGenerator` mutex before calling `THRandom_random`, which causes segfault error mentioned in facebookresearch/maskrcnn-benchmark#93 (comment). This PR fixes the bug.

Closes pytorch/pytorch#1868.
Pull Request resolved: pytorch/pytorch#13832

Differential Revision: D13025453

Pulled By: yf225

fbshipit-source-id: 6e363a35c72b4862412eaea6516a154126634c9d
nprasad2021 pushed a commit to nprasad2021/maskrcnn-benchmark that referenced this pull request Jan 29, 2019
use cuda version torch.randperm to avoid copy from gpu to cpu and a fatal bug in multi-thread cpu version
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants