-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for devices flag to Trainer #8440
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8440 +/- ##
=======================================
- Coverage 92% 88% -4%
=======================================
Files 217 217
Lines 14258 14328 +70
=======================================
- Hits 13161 12625 -536
- Misses 1097 1703 +606 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clear, thanks @kaushikb11 for taking this up!
Co-authored-by: Adrian Wälchli <[email protected]>
Hello @kaushikb11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-07-20 03:51:49 UTC |
for more information, see https://pre-commit.ci
I would expect single machine with GPU 1 and 3 |
@awaelchli Yup, the Yup, it would be ideal to raise a warning if |
So I expect it to work like this
Can you point out what is not correct and what would be the issues? |
@kaushikb11 I definitely think we should error when
not launching 2 cpu processes is just too confusing IMO, even with a warning. Based on the scope of this PR, I vote for a error. |
@carmocca your table suggests that we should automatically select the GPU if it is available, but this is neither happening in Lightning today nor can we allow it, because we don't know if there are other processes already running on a GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good.
@awaelchli @carmocca Few updates here.
|
There is one more warning popping up, but we probably can't take care of it so easily now:
|
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAUTIOUS accept ;-)
I have tested the ddp plugins for gpu and cpu, some error handling is incomplete as I have mentioned before.
One more is the following
--trainer.accelerator cpu --trainer.devices "1,3"
which for
--trainer.accelerator cpu --trainer.num_processes "1,3"
previously had the correct error reporting, but the "devices" argument does not.
not sure how important it is to handle that case.
Not planning to merge it, if you are not happy about it.
What error are you talking about?
Right. |
@kaushikb11 all good, it's a nice addition, love it <3 What I meant was, previously |
I have added a warning for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaushikb11 how about the Trainer argument accelerator
, here you test it for CPU/GPU/TPU, but doc still say:
accelerator: Previously known as distributed_backend (dp, ddp, ddp2, etc...).
so if the accelerator
is breaking and has different usage how user can pass distrib type?
What does this PR do?
Part of #6090
Does your PR introduce any breaking changes ? If yes, please list them.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃