-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Increasing memory consumption when training Retina Net #884
Comments
I solved by add maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py Lines 77 to 85 in 55796a0
When gtbox >=500, it cause OOM(need 4.7G memory to caculate iou_rotate),so I set the max gtbox number to 300. |
Thanks a lot ! How you find this method ? |
But after added this, it only cost about 2~4G GPU memory, before that it will cost about 10G+ GPU memory, is it normal? By the way, I use the ResXNet101 with FPN....... |
❓ Questions and Help
@fmassa
@chengyangfu
Hi,
Thanks for reading my issue!
When I am training Retina Net, the memory consumption keep increasing until OOM.
I have read several issues related to OOM. The causes to OOM assumed by you can be concluded as follows:
With the solution provided by you and help from Internet, I make some "improvement" with regard to these potential causes as follows:
torch.jit.script
according toI replace
with
torch.cuda.empty_cache()
after the OOM happens. Also, when a input causes OOM rightly after I runtorch.cuda.empty_cache()
, I skip it as it has large number of gt bboxes. I replacewith
With these "improvement", memory consumption still increases in training. At first, the memory consumption reported by pytorch is 4 or 5G. Only inputs whose gt bboxes more than 800 cause OOM, and they are skipped after OOM. Gradually, the memory occupation increase. There always is a 2.7G leap after hundreds of iterations. Then, inputs who has 260+ gt bboxes can cause OOM. Thousands of iterations later, memory occupation showed by
nvidia-smi
approach the max memory of my GPU, that is 12 G. The minimum of the numbers of gt bboxes of the images causing OOM can be 100+. At last, OOM happens at every iteration, even with the images who has only 1 gt bboxes. The training cannot continue any more. I have to kill the program and restart it at the last checkpoint. At the first iterations after restart, the memory consumption is 4 or 5G, as little as the one at the start of training. Instead of rapidly increasing to the large memory occupation when I killed it, the memory occupation gradually increases just like I start the training from 0 iteration. It makes me feel like the training doesn't need that much memory at all.For the phenomenon I describe above, I believe the increasing memory occupation cannot be simply explained by some attributes of certain inputs because the number of "problem images" keep increasing with the number of epochs increasing, that is to say an input is OK at earlier iterations but causes OOM later. It is more like a memory leak. The "improvement" I have made allow me to run the program with larger input for longer time before the program cannot keep running. But it does not solve the fundamental problems, the increasing memory occupation. I have to restart the program every 10k iterations.
Writing this issue, I reverently ask that:
Thank you very much!
The text was updated successfully, but these errors were encountered: