-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Neptune logger creating multiple experiments when gpus > 1 #3256
Conversation
kindly asking @jakubczakon for review. Can we delay the creation of the experiment or does i thave to happen in init? |
Hi @psinger, could you please paste a snippet that reproduces this issue? From what I remember, the idea was to have NeptuneLogger initiated before any forking happens. Then, when the logger is pickled, the experiment that was created in |
@pitercl I am creating the logger before initializing the Trainer.
|
Thanks @psinger. I just checked and you're right - it was working as I described up to PL 0.7.6 and from 0.8.1 something's changed and the results are as you say. I'll need some time to understand what has changed and how to approach this. |
@pitercl @psinger I can explain. distributed_backend = "ddp" is special in that it launches your script multiple times in a new subprocess. This means init is called several times, but the way we designed loggers is that the logger.experiment only returns the true logger object on rank == 0. On all other ranks, it returns a dummy object. |
Hi! @awaelchli Thanks for the explanation - it helped a lot in understanding what's going on. @psinger Your idea for the fix looks good to me 👍 As for the tests that stopped passing, I had 2 goals with them:
So, I'd propose something along the lines of: @patch('pytorch_lightning.loggers.neptune.neptune')
def test_neptune_online(neptune):
logger = NeptuneLogger(api_key='test', project_name='project')
experiment = logger.experiment # force the actual creation of an experiment object
assert experiment == neptune.Session.with_default_backend().get_project().create_experiment()
assert logger.name == experiment.name
assert logger.version == experiment.id
@patch('pytorch_lightning.loggers.neptune.neptune')
def test_neptune_offline(neptune):
logger = NeptuneLogger(offline_mode=True)
experiment = logger.experiment # force the actual creation of an experiment object
neptune.Session.assert_called_once_with(backend=neptune.OfflineBackend())
assert experiment == neptune.Session().get_project().create_experiment()``` |
0dd7d32
to
a1d3044
Compare
This pull request is now in conflict... :( |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
@psinger can we finish this one? |
From my perspective yes |
@psinger Also I don't understand your fix? It seems you always set the experiment to None? Doesnt this remove the experiment even for rank_zero? |
@Parskatt I cannot give you a status on this. My fix solves the issue though. |
# It's important to check if the internal variable _experiment was initialized in __init__. | ||
# Calling logger.experiment would cause a side-effect of initializing _experiment, | ||
# if it wasn't already initialized. | ||
assert logger._experiment is None | ||
_ = logger.experiment | ||
assert logger._experiment == created_experiment | ||
assert logger.name == created_experiment.name | ||
assert logger.version == created_experiment.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psinger I rebased the branch and updated the tests so that they pass with the change you made in neptune.
When doing that, I saw this comment in the test. I'm not sure what this is about. I see no evidence that we are forced to initialize the neptune experiment at init. How do you see it?
Let's finalize this PR, it has waited long enough :) |
Codecov Report
@@ Coverage Diff @@
## master #3256 +/- ##
======================================
Coverage 93% 93%
======================================
Files 135 135
Lines 10005 10005
======================================
+ Hits 9339 9340 +1
+ Misses 666 665 -1 |
* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Potential fix to #3255