Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DataParallel container in SS-VAE model #3227

Merged
merged 2 commits into from
Jun 8, 2023

Conversation

martinrohbeck
Copy link
Contributor

This PR removes the usage of the DataParallel container, because it seems to cause issues.

  1. The issues only come with cuda enabled, because otherwise the DataParallel is not used. When running python ss_vae_M2.py --cuda memory is allocated on more than one GPU, but nothing seems to happen.
    However, after dropping the --cuda, i.e. running python ss_vae_M2.py, everything works fine. Code also works with CUDA_VISIBLE_DEVICE=1 python ss_vae_M2.py --cuda, hence the multi-gpu training create the trouble.

  2. On the other hand, it is recommended to use DistributedDataParallel, see here.

  3. I think the lines can be dropped, since MNIST is not a dataset where multi-GPU training is needed anymore ;).

I installed pyro from the latest dev branch (v1.8.5) and pytorch v2.0.1.
PR also contains minor housekeeping.

Copy link
Member

@fritzo fritzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning up!

@fritzo fritzo merged commit 727aff7 into pyro-ppl:dev Jun 8, 2023
@martinrohbeck martinrohbeck deleted the fix-parallelisation-ss-vae branch June 9, 2023 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants