Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Neptune logger creating multiple experiments when gpus > 1 #3256

Merged
merged 6 commits into from
Jan 23, 2021

Conversation

psinger
Copy link
Contributor

@psinger psinger commented Aug 29, 2020

Potential fix to #3255

@mergify mergify bot requested a review from a team August 29, 2020 11:43
@awaelchli awaelchli changed the title Neptune logger fix Neptune logger creating multiple experiments when gpus > 1 Aug 29, 2020
@awaelchli
Copy link
Contributor

kindly asking @jakubczakon for review. Can we delay the creation of the experiment or does i thave to happen in init?

@pitercl
Copy link
Contributor

pitercl commented Aug 31, 2020

Hi @psinger, could you please paste a snippet that reproduces this issue? From what I remember, the idea was to have NeptuneLogger initiated before any forking happens. Then, when the logger is pickled, the experiment that was created in __init__ can be reused in all children processes.

@psinger
Copy link
Contributor Author

psinger commented Aug 31, 2020

@pitercl I am creating the logger before initializing the Trainer.

logger = NeptuneLogger()
trainer = Trainer(logger=logger,  distributed_backend = "ddp")
model = Model()
trainer.fit(model)

@pitercl
Copy link
Contributor

pitercl commented Aug 31, 2020

Thanks @psinger. I just checked and you're right - it was working as I described up to PL 0.7.6 and from 0.8.1 something's changed and the results are as you say. I'll need some time to understand what has changed and how to approach this.

@awaelchli
Copy link
Contributor

awaelchli commented Aug 31, 2020

@pitercl @psinger I can explain. distributed_backend = "ddp" is special in that it launches your script multiple times in a new subprocess. This means init is called several times, but the way we designed loggers is that the logger.experiment only returns the true logger object on rank == 0. On all other ranks, it returns a dummy object.
So what we have to make sure is that we don't create any files in the init, because that would run on all ranks.

@edenlightning edenlightning linked an issue Sep 1, 2020 that may be closed by this pull request
@pitercl
Copy link
Contributor

pitercl commented Sep 3, 2020

Hi!

@awaelchli Thanks for the explanation - it helped a lot in understanding what's going on.

@psinger Your idea for the fix looks good to me 👍

As for the tests that stopped passing, I had 2 goals with them:

  1. I wanted to check if online/offline modes do what they're supposed to
  2. Make sure that the experiment was created in __init__. This was important in the earlier version of PL, where the pickled logger was passed to child processes and used there (I didn't want to have a new experiment in each child process). I don't think it matters now, after the @rank_zero_experiment change.

So, I'd propose something along the lines of:

@patch('pytorch_lightning.loggers.neptune.neptune')
def test_neptune_online(neptune):
    logger = NeptuneLogger(api_key='test', project_name='project')

    experiment = logger.experiment  # force the actual creation of an experiment object

    assert experiment == neptune.Session.with_default_backend().get_project().create_experiment()
    assert logger.name == experiment.name
    assert logger.version == experiment.id


@patch('pytorch_lightning.loggers.neptune.neptune')
def test_neptune_offline(neptune):
    logger = NeptuneLogger(offline_mode=True)

    experiment = logger.experiment  # force the actual creation of an experiment object

    neptune.Session.assert_called_once_with(backend=neptune.OfflineBackend())
    assert experiment == neptune.Session().get_project().create_experiment()```

@mergify
Copy link
Contributor

mergify bot commented Oct 11, 2020

This pull request is now in conflict... :(

@Borda Borda added this to the 1.0.x milestone Oct 20, 2020
@stale
Copy link

stale bot commented Nov 3, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Nov 3, 2020
@stale
Copy link

stale bot commented Nov 8, 2020

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

@stale stale bot closed this Nov 8, 2020
@Borda
Copy link
Member

Borda commented Nov 8, 2020

@psinger can we finish this one?

@psinger
Copy link
Contributor Author

psinger commented Nov 8, 2020

From my perspective yes

@Parskatt
Copy link

@psinger
Status on this? Doesn't seem to be on master?

Also I don't understand your fix? It seems you always set the experiment to None? Doesnt this remove the experiment even for rank_zero?
Why not just put a rank_zero_only decorator on _create_or_get_experiment ?

@psinger
Copy link
Contributor Author

psinger commented Jan 12, 2021

@Parskatt I cannot give you a status on this. My fix solves the issue though.

@awaelchli awaelchli added this to the 1.1.x milestone Jan 13, 2021
@awaelchli awaelchli added the logger Related to the Loggers label Jan 13, 2021
@awaelchli awaelchli self-assigned this Jan 13, 2021
Comment on lines 29 to 36
# It's important to check if the internal variable _experiment was initialized in __init__.
# Calling logger.experiment would cause a side-effect of initializing _experiment,
# if it wasn't already initialized.
assert logger._experiment is None
_ = logger.experiment
assert logger._experiment == created_experiment
assert logger.name == created_experiment.name
assert logger.version == created_experiment.id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psinger I rebased the branch and updated the tests so that they pass with the change you made in neptune.
When doing that, I saw this comment in the test. I'm not sure what this is about. I see no evidence that we are forced to initialize the neptune experiment at init. How do you see it?

@awaelchli awaelchli added ready PRs ready to be merged and removed has conflicts labels Jan 21, 2021
@awaelchli
Copy link
Contributor

Let's finalize this PR, it has waited long enough :)

@codecov
Copy link

codecov bot commented Jan 21, 2021

Codecov Report

Merging #3256 (ed89104) into master (6ab5417) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #3256   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         135     135           
  Lines       10005   10005           
======================================
+ Hits         9339    9340    +1     
+ Misses        666     665    -1     

@mergify mergify bot requested a review from a team January 21, 2021 16:14
@mergify mergify bot requested a review from a team January 21, 2021 16:51
@SeanNaren SeanNaren merged commit 052bc00 into Lightning-AI:master Jan 23, 2021
tchaton pushed a commit that referenced this pull request Feb 3, 2021
* DP device fix

* potential fix

* fix merge

* update tests

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Borda pushed a commit that referenced this pull request Feb 4, 2021
* DP device fix

* potential fix

* fix merge

* update tests

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
tchaton pushed a commit that referenced this pull request Feb 4, 2021
* DP device fix

* potential fix

* fix merge

* update tests

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Borda pushed a commit that referenced this pull request Feb 4, 2021
* DP device fix

* potential fix

* fix merge

* update tests

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Borda pushed a commit that referenced this pull request Feb 4, 2021
* DP device fix

* potential fix

* fix merge

* update tests

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Borda pushed a commit that referenced this pull request Feb 4, 2021
* DP device fix

* potential fix

* fix merge

* update tests

Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working logger Related to the Loggers ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NeptuneLogger creates multiple experiments in DDP mode
7 participants