-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoint saving stuck when use multiple GPUs #6495
Comments
it is possible that this PR will solve it: #6410 |
This is a huge pain for me at the moment too. The model checkpointing routine is stuck at the end of epoch X where X is random (e.g. 5,10 or whatever). Best, |
I just saw the issue and the other PR and it's just a guess. I can't say for sure that it fixes this problem, also because the code sample you provided does not run directly. One thing I noticed however is this suspicious list of arguments: checkpoint_callback = ModelCheckpoint(
dirpath=ckpt_path,
save_top_k=save_top_k,
verbose=True,
monitor = "val_loss",
save_last= False,
period = args["check_val_every_n_epoch"],
save_weights_only=args['save_weights_only']
) This doesn't look right. The checkpoint does not accept an argument I suggest these steps:
|
I tried running my experiments from EDIT: I can, for instance, provide a minimum working example tailored to my use case. Note that I just mainly tweaked the official ImageNet PL example. |
Sorry for the standard answer but a minimal working example (ready to run) would be the best, because then we can directly start debugging with minimal guess work. I understand that given the conference deadline this is probably too much work (I myself am also submitting to ICCV next week). |
I will see what I can do. My only concern for reproducibility is that the issue seems quite random so far (not all runs are impacted, it happens at a random epoch). I'm not sure I would be able to reproduce it given a simple BoringModel for example. But yes, I could give it a try! |
@awaelchli Thanks. I can remove the Thank you once again. |
@awaelchli @inzouzouwetrust Hi, just want to update that when I switch to "dp" backend, everything is OK. Hope this help you identify the problem. Thanks. |
Dear @sun-peach, Any chance you could provide a reproducible script ? Would you mind trying to following:
Best, |
@tchaton I see this issue with pytorch 1.7 and PL 1.2.4, which has those fixes that are supposed to fix DDP hanging. |
Dear @azouaoui-cv, @yukw777, Would it be possible for you to share a reproducible script for us to work on ? Best, |
@tchaton I no longer have access to a multi-gpu machine. I'd like to try reproducing it once I get a hold of one. :/ @azouaoui-cv did the issue ever go away? |
I've had exactly the same problem, and found that the problem is caused specifically if (using ddp on multiple GPUs) checkpoint saving is done based on monitoring a quantity. @sun-peach do the hangups go away if you remove the monitor-based saving and simply checkpoint at every epoch? |
Closing for now, please reopen with a reproducible script. |
🐛 Bug
When I use multiple GPUs, the model saving step will be stuck, while it works perfectly when I use only one GPU.
Please reproduce using the BoringModel
To Reproduce
My checkpoint_callback is
I use the command
python main.py --gpus 4 --distributed_backend ddp
for multiple-GPU running, while I usepython main.py --gpus 1
for single GPU running. I did not change anything else.Expected behavior
Model supposed to be saved smoothly, however, it is stuck at the step of saving the checkpoint. The GPU utilization shows 100% and never change. Please see the figures below:
![image](https://user-images.githubusercontent.com/14579257/110971754-1cc01d80-8310-11eb-8698-83307af32ddc.png)
GPU utilization stay at 100% forever
Saving is stuck at epoch 0
![image](https://user-images.githubusercontent.com/14579257/110971942-52fd9d00-8310-11eb-809f-b1bcdce28e0d.png)
Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: