Large Memory Differences with DP vs. DDP accelerator #8826
Labels
distributed
Generic distributed-related topic
help wanted
Open to be worked on
question
Further information is requested
strategy: dp (removed in pl)
DataParallel
waiting on author
Waiting on user action, correction, or update
won't fix
This will not be worked on
🐛 Bug
I am running a training loop with a Transformer model with Pytorch Lightning and trying to use ddp as the accelerator. I run into CUDA OOM issues due to the large memory requirement of the multihead attention module, however I do not run into this issue when using DP as the accelerator. When tracking the GPU memory usage, DP runs through a batch using 25 GB of memory however DDP needs more than 45 GB.
To Reproduce
Console Output:
Save this script as
memory_error.py
and runpython memory_error.py
on any machine with 2+ GPUs with each GPU having >40 GB of memory. The GPU model that I am using is the NVIDIA A40 which has roughly 45 GB of memory.Expected behavior
Both dp and ddp should use similar amounts of memory to run this training loop, yet ddp uses significantly more memory.
Environment
conda
,pip
, source): condatorch.__config__.show()
:Additional context
The text was updated successfully, but these errors were encountered: