Add support for devices flag to Trainer #8440

kaushikb11 · 2021-07-16T05:43:47Z

What does this PR do?

Part of #6090

trainer = Trainer(accelerator='cpu', devices=3)
trainer = Trainer(accelerator='gpu', devices=4)
trainer = Trainer(accelerator='tpu', devices=8)
trainer = Trainer(accelerator='ipu', devices=8)

Does your PR introduce any breaking changes ? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)
Did you list all the breaking changes introduced by this pull request?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-07-16T05:44:56Z

Codecov Report

Merging #8440 (25456e0) into master (8d0df6f) will decrease coverage by 4%.
The diff coverage is 77%.

@@           Coverage Diff           @@
##           master   #8440    +/-   ##
=======================================
- Coverage      92%     88%    -4%     
=======================================
  Files         217     217            
  Lines       14258   14328    +70     
=======================================
- Hits        13161   12625   -536     
- Misses       1097    1703   +606

pytorch_lightning/trainer/connectors/accelerator_connector.py

tests/accelerators/test_accelerator_connector.py

CHANGELOG.md

SeanNaren

Nice and clear, thanks @kaushikb11 for taking this up!

tests/accelerators/test_accelerator_connector.py

pytorch_lightning/trainer/trainer.py

Co-authored-by: Adrian Wälchli <[email protected]>

pep8speaks · 2021-07-16T12:36:19Z

Hello @kaushikb11! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-20 03:51:49 UTC

for more information, see https://pre-commit.ci

…ightning into add/devices

Borda · 2021-07-16T13:10:21Z

For me it runs on GPU 0.

I would expect single machine with GPU 1 and 3

kaushikb11 · 2021-07-16T13:54:51Z

@kaushikb11 What is the expected behavior for something like this:

python pl_examples/basic_examples/simple_image_classifier.py --trainer.devices "1,3" --trainer.accelerator ddp

For me it runs on GPU 0.

@awaelchli Yup, the devices flag is only considered when you pass accelerator flag as well. So, the Trainer knows what accelerator to map the devices to (Works with auto as well). Tbh, I think it's desirable behavior. Wdyt? We could set accelerator to auto when devices is passed, but it could cause issues when it replaces num_processes.

Yup, it would be ideal to raise a warning if devices flag is not being considered.

carmocca · 2021-07-16T15:19:50Z

@awaelchli Yup, the devices flag is only considered when you pass accelerator flag as well. So, the Trainer knows what accelerator to map the devices to (Works with auto as well). Tbh, I think it's desirable behavior. Wdyt? We could set accelerator to auto when devices is passed, but it could cause issues when it replaces num_processes.

So I expect it to work like this

GPUs available	`devices`	`accelerator`	Expected
Yes	Not set	Not set	Runs on 1 GPU
Yes	2	Not set	Runs on 2 GPU. DDP
Yes	"2,3"	Not set	Runs on "2", "3" GPUs. DDP
No	Not set	Not set	Runs on CPU
No	2	Not set	Runs on CPU (2 processes). DDP CPU
No	"2, 3"	Not set	Either runs on CPU (2 processes) with a warning OR Misconfiguration

Can you point out what is not correct and what would be the issues?

awaelchli · 2021-07-16T15:51:02Z

@kaushikb11 I definitely think we should error when devices is passed in but none of the options accelerator="cpu/tpu/gpu" are specified. We are getting into trouble otherwise with plugins, because they can also be specified via the accelerator argument and something like

python pl_examples/basic_examples/simple_image_classifier.py --trainer.accelerator ddp_cpu --trainer.devices 2

not launching 2 cpu processes is just too confusing IMO, even with a warning. Based on the scope of this PR, I vote for a error.

awaelchli · 2021-07-16T15:52:58Z

@carmocca your table suggests that we should automatically select the GPU if it is available, but this is neither happening in Lightning today nor can we allow it, because we don't know if there are other processes already running on a GPU.

tchaton

Overall, looks good.

kaushikb11 · 2021-07-19T08:19:03Z

@awaelchli @carmocca Few updates here.

I have added two-way mapping for devices, and could be accessed via trainer.devices.
It will raise an Error, if the User passes devices but not accelerator="auto|"tpu"|"gpu"....
It will raise a warning that the devices flag would be ignored, if the User pass both gpus and `devices.

pytorch_lightning/trainer/connectors/accelerator_connector.py

awaelchli · 2021-07-19T13:59:32Z

There is one more warning popping up, but we probably can't take care of it so easily now:

python pl_examples/basic_examples/simple_image_classifier.py  --trainer.accelerator gpu  --trainer.devices 2

/home/adrian/repositories/pytorch-lightning/pytorch_lightning/trainer/connectors/accelerator_connector.py:731: 
UserWarning: You requested multiple GPUs but did not specify a backend, e.g. 
`Trainer(accelerator="dp"|"ddp"|"ddp2")`. Setting `accelerator="ddp_spawn"` for you.

pytorch_lightning/trainer/connectors/accelerator_connector.py

Co-authored-by: Jirka Borovec <[email protected]>

Co-authored-by: Adrian Wälchli <[email protected]>

awaelchli

CAUTIOUS accept ;-)

I have tested the ddp plugins for gpu and cpu, some error handling is incomplete as I have mentioned before.

One more is the following

--trainer.accelerator cpu --trainer.devices "1,3"
which for

--trainer.accelerator cpu --trainer.num_processes "1,3"
previously had the correct error reporting, but the "devices" argument does not.
not sure how important it is to handle that case.

kaushikb11 · 2021-07-19T17:04:22Z

CAUTIOUS accept ;-)

Not planning to merge it, if you are not happy about it.

One more is the following

--trainer.accelerator cpu --trainer.devices "1,3"
which for

--trainer.accelerator cpu --trainer.num_processes "1,3"
previously had the correct error reporting, but the "devices" argument does not.
not sure how important it is to handle that case.

What error are you talking about?

There is one more warning popping up, but we probably can't take care of it so easily now:

UserWarning: You requested multiple GPUs but did not specify a backend, e.g.
Trainer(accelerator="dp"|"ddp"|"ddp2"). Setting accelerator="ddp_spawn" for you.

Right.

awaelchli · 2021-07-19T17:12:10Z

@kaushikb11 all good, it's a nice addition, love it <3

What I meant was, previously num_processes anything other than int was not valid and the type hint / cli parsing reflected that. However, now that devices is replacing it, and accepts list, string, etc, we would have to put error handling into place where accelerator=cpu AND devices != int would be invalid user input. Because the type hint alone is not enough anymore

…ightning into add/devices

kaushikb11 · 2021-07-20T04:55:45Z

I have added a warning for devices not being integer for accelerator="cpu"

Borda

@kaushikb11 how about the Trainer argument accelerator, here you test it for CPU/GPU/TPU, but doc still say:

accelerator: Previously known as distributed_backend (dp, ddp, ddp2, etc...).

so if the accelerator is breaking and has different usage how user can pass distrib type?

Support devices flag to Trainer

61fc951

kaushikb11 added 3 commits July 16, 2021 11:55

Update logic for cpus

341c556

Add tests for cpus

fe0cb21

Add tests for gpus

1b54893

kaushikb11 changed the title ~~Support devices flag to Trainer~~ Add support for devices flag to Trainer Jul 16, 2021

Add tests for tpus & ipus

400df10

kaushikb11 self-assigned this Jul 16, 2021

kaushikb11 added the feature Is an improvement or enhancement label Jul 16, 2021

Update changelog

a4eeed7

kaushikb11 marked this pull request as ready for review July 16, 2021 07:18

kaushikb11 requested review from awaelchli, Borda, carmocca, justusschock, SeanNaren, tchaton and williamFalcon as code owners July 16, 2021 07:18

tchaton reviewed Jul 16, 2021

View reviewed changes

Update test

7525294

awaelchli reviewed Jul 16, 2021

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

Update test

f89beca

SeanNaren approved these changes Jul 16, 2021

View reviewed changes

awaelchli added the design Includes a design discussion label Jul 16, 2021

awaelchli reviewed Jul 16, 2021

View reviewed changes

tests/accelerators/test_accelerator_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

Update tests/accelerators/test_accelerator_connector.py

87fe136

Co-authored-by: Adrian Wälchli <[email protected]>

pre-commit-ci bot and others added 3 commits July 16, 2021 12:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f82a3e

for more information, see https://pre-commit.ci

Update test

e44cf7b

Merge branch 'add/devices' of https://github.com/kaushikb11/pytorch-l…

8779c1d

…ightning into add/devices

kaushikb11 added 3 commits July 17, 2021 21:24

Add set devices if none

0927d7f

Add tests

1c5dbad

Warn if devices flag ignored

b579377

tchaton approved these changes Jul 19, 2021

View reviewed changes

Borda reviewed Jul 19, 2021

View reviewed changes

mergify bot added ready PRs ready to be merged has conflicts labels Jul 19, 2021

awaelchli reviewed Jul 19, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/accelerator_connector.py Outdated Show resolved Hide resolved

kaushikb11 and others added 2 commits July 19, 2021 19:59

Make certain methods protected

669095c

Merge branch 'master' into add/devices

4ce25c2

mergify bot removed the has conflicts label Jul 19, 2021

kaushikb11 and others added 2 commits July 19, 2021 20:01

Update pytorch_lightning/trainer/connectors/accelerator_connector.py

408846f

Co-authored-by: Jirka Borovec <[email protected]>

Update pytorch_lightning/trainer/connectors/accelerator_connector.py

188f862

Co-authored-by: Adrian Wälchli <[email protected]>

awaelchli approved these changes Jul 19, 2021

View reviewed changes

kaushikb11 added 3 commits July 19, 2021 23:16

Update tests

5657881

Merge branch 'add/devices' of https://github.com/kaushikb11/pytorch-l…

12634ca

…ightning into add/devices

Raise error for devices for cpu not being int

25456e0

kaushikb11 enabled auto-merge (squash) July 20, 2021 03:52

kaushikb11 merged commit 556879e into Lightning-AI:master Jul 20, 2021

Borda reviewed Jul 22, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for devices flag to Trainer #8440

Add support for devices flag to Trainer #8440

kaushikb11 commented Jul 16, 2021 •

edited

Loading

codecov bot commented Jul 16, 2021 •

edited

Loading

SeanNaren left a comment

pep8speaks commented Jul 16, 2021 •

edited

Loading

Borda commented Jul 16, 2021

kaushikb11 commented Jul 16, 2021 •

edited

Loading

carmocca commented Jul 16, 2021 •

edited

Loading

awaelchli commented Jul 16, 2021

awaelchli commented Jul 16, 2021

tchaton left a comment

kaushikb11 commented Jul 19, 2021

awaelchli commented Jul 19, 2021

awaelchli left a comment •

edited

Loading

kaushikb11 commented Jul 19, 2021

awaelchli commented Jul 19, 2021

kaushikb11 commented Jul 20, 2021

Borda left a comment

Add support for devices flag to Trainer #8440

Add support for devices flag to Trainer #8440

Conversation

kaushikb11 commented Jul 16, 2021 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes ? If yes, please list them.

Before submitting

PR review

Did you have fun?

codecov bot commented Jul 16, 2021 • edited Loading

Codecov Report

SeanNaren left a comment

Choose a reason for hiding this comment

pep8speaks commented Jul 16, 2021 • edited Loading

Comment last updated at 2021-07-20 03:51:49 UTC

Borda commented Jul 16, 2021

kaushikb11 commented Jul 16, 2021 • edited Loading

carmocca commented Jul 16, 2021 • edited Loading

awaelchli commented Jul 16, 2021

awaelchli commented Jul 16, 2021

tchaton left a comment

Choose a reason for hiding this comment

kaushikb11 commented Jul 19, 2021

awaelchli commented Jul 19, 2021

awaelchli left a comment • edited Loading

Choose a reason for hiding this comment

kaushikb11 commented Jul 19, 2021

awaelchli commented Jul 19, 2021

kaushikb11 commented Jul 20, 2021

Borda left a comment

Choose a reason for hiding this comment

kaushikb11 commented Jul 16, 2021 •

edited

Loading

codecov bot commented Jul 16, 2021 •

edited

Loading

pep8speaks commented Jul 16, 2021 •

edited

Loading

kaushikb11 commented Jul 16, 2021 •

edited

Loading

carmocca commented Jul 16, 2021 •

edited

Loading

awaelchli left a comment •

edited

Loading