Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for devices flag to Trainer #8440

Merged
merged 22 commits into from
Jul 20, 2021

Conversation

kaushikb11
Copy link
Contributor

@kaushikb11 kaushikb11 commented Jul 16, 2021

What does this PR do?

Part of #6090

trainer = Trainer(accelerator='cpu', devices=3)
trainer = Trainer(accelerator='gpu', devices=4)
trainer = Trainer(accelerator='tpu', devices=8)
trainer = Trainer(accelerator='ipu', devices=8)

Does your PR introduce any breaking changes ? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)
  • Did you list all the breaking changes introduced by this pull request?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Jul 16, 2021

Codecov Report

Merging #8440 (25456e0) into master (8d0df6f) will decrease coverage by 4%.
The diff coverage is 77%.

@@           Coverage Diff           @@
##           master   #8440    +/-   ##
=======================================
- Coverage      92%     88%    -4%     
=======================================
  Files         217     217            
  Lines       14258   14328    +70     
=======================================
- Hits        13161   12625   -536     
- Misses       1097    1703   +606     

@kaushikb11 kaushikb11 changed the title Support devices flag to Trainer Add support for devices flag to Trainer Jul 16, 2021
@kaushikb11 kaushikb11 self-assigned this Jul 16, 2021
@kaushikb11 kaushikb11 added the feature Is an improvement or enhancement label Jul 16, 2021
@kaushikb11 kaushikb11 marked this pull request as ready for review July 16, 2021 07:18
Copy link
Contributor

@SeanNaren SeanNaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and clear, thanks @kaushikb11 for taking this up!

@awaelchli awaelchli added the design Includes a design discussion label Jul 16, 2021
@pep8speaks
Copy link

pep8speaks commented Jul 16, 2021

Hello @kaushikb11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-20 03:51:49 UTC

@Borda
Copy link
Member

Borda commented Jul 16, 2021

For me it runs on GPU 0.

I would expect single machine with GPU 1 and 3

@kaushikb11
Copy link
Contributor Author

kaushikb11 commented Jul 16, 2021

@kaushikb11 What is the expected behavior for something like this:

python pl_examples/basic_examples/simple_image_classifier.py --trainer.devices "1,3" --trainer.accelerator ddp

For me it runs on GPU 0.

@awaelchli Yup, the devices flag is only considered when you pass accelerator flag as well. So, the Trainer knows what accelerator to map the devices to (Works with auto as well). Tbh, I think it's desirable behavior. Wdyt? We could set accelerator to auto when devices is passed, but it could cause issues when it replaces num_processes.

Yup, it would be ideal to raise a warning if devices flag is not being considered.

@carmocca
Copy link
Contributor

carmocca commented Jul 16, 2021

@awaelchli Yup, the devices flag is only considered when you pass accelerator flag as well. So, the Trainer knows what accelerator to map the devices to (Works with auto as well). Tbh, I think it's desirable behavior. Wdyt? We could set accelerator to auto when devices is passed, but it could cause issues when it replaces num_processes.

So I expect it to work like this

GPUs available devices accelerator Expected
Yes Not set Not set Runs on 1 GPU
Yes 2 Not set Runs on 2 GPU. DDP
Yes "2,3" Not set Runs on "2", "3" GPUs. DDP
No Not set Not set Runs on CPU
No 2 Not set Runs on CPU (2 processes). DDP CPU
No "2, 3" Not set Either runs on CPU (2 processes) with a warning OR Misconfiguration

Can you point out what is not correct and what would be the issues?

@awaelchli
Copy link
Contributor

@kaushikb11 I definitely think we should error when devices is passed in but none of the options accelerator="cpu/tpu/gpu" are specified. We are getting into trouble otherwise with plugins, because they can also be specified via the accelerator argument and something like

python pl_examples/basic_examples/simple_image_classifier.py --trainer.accelerator ddp_cpu --trainer.devices 2

not launching 2 cpu processes is just too confusing IMO, even with a warning. Based on the scope of this PR, I vote for a error.

@awaelchli
Copy link
Contributor

@carmocca your table suggests that we should automatically select the GPU if it is available, but this is neither happening in Lightning today nor can we allow it, because we don't know if there are other processes already running on a GPU.

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good.

@kaushikb11
Copy link
Contributor Author

@awaelchli @carmocca Few updates here.

  • I have added two-way mapping for devices, and could be accessed via trainer.devices.
  • It will raise an Error, if the User passes devices but not accelerator="auto|"tpu"|"gpu"....
  • It will raise a warning that the devices flag would be ignored, if the User pass both gpus and `devices.

@mergify mergify bot added ready PRs ready to be merged has conflicts labels Jul 19, 2021
@awaelchli
Copy link
Contributor

There is one more warning popping up, but we probably can't take care of it so easily now:

python pl_examples/basic_examples/simple_image_classifier.py  --trainer.accelerator gpu  --trainer.devices 2

/home/adrian/repositories/pytorch-lightning/pytorch_lightning/trainer/connectors/accelerator_connector.py:731: 
UserWarning: You requested multiple GPUs but did not specify a backend, e.g. 
`Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.

@mergify mergify bot removed the has conflicts label Jul 19, 2021
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAUTIOUS accept ;-)

I have tested the ddp plugins for gpu and cpu, some error handling is incomplete as I have mentioned before.

One more is the following

--trainer.accelerator cpu --trainer.devices "1,3"
which for

--trainer.accelerator cpu --trainer.num_processes "1,3"
previously had the correct error reporting, but the "devices" argument does not.
not sure how important it is to handle that case.

@kaushikb11
Copy link
Contributor Author

CAUTIOUS accept ;-)

Not planning to merge it, if you are not happy about it.

One more is the following

--trainer.accelerator cpu --trainer.devices "1,3"
which for

--trainer.accelerator cpu --trainer.num_processes "1,3"
previously had the correct error reporting, but the "devices" argument does not.
not sure how important it is to handle that case.

What error are you talking about?

There is one more warning popping up, but we probably can't take care of it so easily now:

UserWarning: You requested multiple GPUs but did not specify a backend, e.g.
Trainer(accelerator="dp"|"ddp"|"ddp2"). Setting accelerator="ddp_spawn" for you.

Right.

@awaelchli
Copy link
Contributor

@kaushikb11 all good, it's a nice addition, love it <3

What I meant was, previously num_processes anything other than int was not valid and the type hint / cli parsing reflected that. However, now that devices is replacing it, and accepts list, string, etc, we would have to put error handling into place where accelerator=cpu AND devices != int would be invalid user input. Because the type hint alone is not enough anymore

@kaushikb11 kaushikb11 enabled auto-merge (squash) July 20, 2021 03:52
@kaushikb11 kaushikb11 merged commit 556879e into Lightning-AI:master Jul 20, 2021
@kaushikb11
Copy link
Contributor Author

I have added a warning for devices not being integer for accelerator="cpu"

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaushikb11 how about the Trainer argument accelerator, here you test it for CPU/GPU/TPU, but doc still say:

accelerator: Previously known as distributed_backend (dp, ddp, ddp2, etc...).

so if the accelerator is breaking and has different usage how user can pass distrib type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants