-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Update balanced_positive_negative_sampler.py #93
Conversation
use cuda version torch.randperm to avoid copy from gpu to cpu and a fatal bug in multi-thread cpu version
Hi, Thanks for the PR! The reason why I didn't use a device-specific randperm was that it was not available when I first wrote it. About using DataParallel instead of DistributedDataParallel, I believe it might end up being a bit harder to implement, and also slower (at least if we use native DataParallel, which is bottlenecked by Python). It might also be worth noting that for small tensors ( <30000), randperm on cuda of loads the computation to the CPU, see https://github.com/pytorch/pytorch/blob/b818d31a3e8bbe8fd13c520de145d84b0641cf70/aten/src/ATen/native/cuda/TensorFactories.cu#L65 But thanks for the PR, it looks good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The fatal bug I encounter is just like pytorch/pytorch#1868. If you need to reproduce the error, you could use my branch https://github.com/Jiayuan-Gu/maskrcnn-benchmark/tree/dev. Using quick_schedules/res50_faster_rcnn_fpn, you might get error around before 2000 iterations. For full schedules, res50_faster_rcnn_fpn, it raises error near 580 iterations. However, using cuda wrapper of randperm could avoid this bug. |
@Jiayuan-Hi thanks for the link, it is very interesting!
|
I test it on 2 1080Ti GPUs with total batch size 4. Same accuracy for training res50FPN with 1epoch on coco2017train.
What is your memory usage because you have to load annotation 8 times if you use 8 GPUs? |
Thanks for those numbers! I don't have the exact numbers now, but as we load images on the fly, and we can control the number of data loading threads per process, we can control the amount of memory that is used. I understand that it is easier to have a single file for the proposals, to a single dataset. But given that the current PyTorch + data loading model is to load images / data on the fly using multiple threads, I'd say that it might be more adapted to instead split the single file containing the proposals into multiple files, one per image. This way, augmenting the number of processes doesn't increase in a significant manner the amount of memory used, and we can amortized the loading cost by using more data loading processes. Thoughts? |
Summary: When we added `randperm_cpu` and `THTensor_(randperm)` we forgot to lock the `THGenerator` mutex before calling `THRandom_random`, which causes segfault error mentioned in facebookresearch/maskrcnn-benchmark#93 (comment). This PR fixes the bug. Closes #1868. Pull Request resolved: #13832 Differential Revision: D13025453 Pulled By: yf225 fbshipit-source-id: 6e363a35c72b4862412eaea6516a154126634c9d
Summary: When we added `randperm_cpu` and `THTensor_(randperm)` we forgot to lock the `THGenerator` mutex before calling `THRandom_random`, which causes segfault error mentioned in facebookresearch/maskrcnn-benchmark#93 (comment). This PR fixes the bug. Closes pytorch/pytorch#1868. Pull Request resolved: pytorch/pytorch#13832 Differential Revision: D13025453 Pulled By: yf225 fbshipit-source-id: 6e363a35c72b4862412eaea6516a154126634c9d
use cuda version torch.randperm to avoid copy from gpu to cpu and a fatal bug in multi-thread cpu version
Avoid unnecessary and unexpected copy from gpu to cpu.
I am trying to implement a multi-thread data parallel module instead of distributed modules. I have modified the data structure and almost finish it. I find a fatal bug when you use cpu version torch.randperm and multi-thread, it will raise Segmentation Fault during forward around a certain iteration.
I heard this repo is created for a benchmark performance contest, so I wonder whether you care about flexibility apart from speed. Distributed learning has to load the whole dataset multiple times. I guess it is why you do not use fixed proposals like Detectron, since proposal files need large memory, which your implementation is hard to support.