fix Neptune logger creating multiple experiments when gpus > 1 #3256

psinger · 2020-08-29T11:29:05Z

Potential fix to #3255

pytorch_lightning/loggers/neptune.py

awaelchli · 2020-08-29T23:21:42Z

kindly asking @jakubczakon for review. Can we delay the creation of the experiment or does i thave to happen in init?

pitercl · 2020-08-31T07:49:31Z

Hi @psinger, could you please paste a snippet that reproduces this issue? From what I remember, the idea was to have NeptuneLogger initiated before any forking happens. Then, when the logger is pickled, the experiment that was created in __init__ can be reused in all children processes.

psinger · 2020-08-31T08:16:48Z

@pitercl I am creating the logger before initializing the Trainer.

logger = NeptuneLogger()
trainer = Trainer(logger=logger,  distributed_backend = "ddp")
model = Model()
trainer.fit(model)

pitercl · 2020-08-31T08:52:27Z

Thanks @psinger. I just checked and you're right - it was working as I described up to PL 0.7.6 and from 0.8.1 something's changed and the results are as you say. I'll need some time to understand what has changed and how to approach this.

awaelchli · 2020-08-31T13:40:04Z

@pitercl @psinger I can explain. distributed_backend = "ddp" is special in that it launches your script multiple times in a new subprocess. This means init is called several times, but the way we designed loggers is that the logger.experiment only returns the true logger object on rank == 0. On all other ranks, it returns a dummy object.
So what we have to make sure is that we don't create any files in the init, because that would run on all ranks.

pitercl · 2020-09-03T11:11:47Z

Hi!

@awaelchli Thanks for the explanation - it helped a lot in understanding what's going on.

@psinger Your idea for the fix looks good to me 👍

As for the tests that stopped passing, I had 2 goals with them:

I wanted to check if online/offline modes do what they're supposed to
Make sure that the experiment was created in __init__. This was important in the earlier version of PL, where the pickled logger was passed to child processes and used there (I didn't want to have a new experiment in each child process). I don't think it matters now, after the @rank_zero_experiment change.

So, I'd propose something along the lines of:

@patch('pytorch_lightning.loggers.neptune.neptune')
def test_neptune_online(neptune):
    logger = NeptuneLogger(api_key='test', project_name='project')

    experiment = logger.experiment  # force the actual creation of an experiment object

    assert experiment == neptune.Session.with_default_backend().get_project().create_experiment()
    assert logger.name == experiment.name
    assert logger.version == experiment.id


@patch('pytorch_lightning.loggers.neptune.neptune')
def test_neptune_offline(neptune):
    logger = NeptuneLogger(offline_mode=True)

    experiment = logger.experiment  # force the actual creation of an experiment object

    neptune.Session.assert_called_once_with(backend=neptune.OfflineBackend())
    assert experiment == neptune.Session().get_project().create_experiment()```

mergify · 2020-10-11T05:08:07Z

This pull request is now in conflict... :(

stale · 2020-11-03T08:32:02Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions.

stale · 2020-11-08T09:51:59Z

This pull request is going to be closed. Please feel free to reopen it create a new from the actual master.

Borda · 2020-11-08T10:13:40Z

@psinger can we finish this one?

psinger · 2020-11-08T10:28:37Z

From my perspective yes

Parskatt · 2021-01-12T17:43:07Z

@psinger
Status on this? Doesn't seem to be on master?

Also I don't understand your fix? It seems you always set the experiment to None? Doesnt this remove the experiment even for rank_zero?
Why not just put a rank_zero_only decorator on _create_or_get_experiment ?

psinger · 2021-01-12T18:01:43Z

@Parskatt I cannot give you a status on this. My fix solves the issue though.

awaelchli · 2021-01-21T10:31:34Z

tests/loggers/test_neptune.py

    # It's important to check if the internal variable _experiment was initialized in __init__.
    # Calling logger.experiment would cause a side-effect of initializing _experiment,
    # if it wasn't already initialized.
+    assert logger._experiment is None
+    _ = logger.experiment
    assert logger._experiment == created_experiment
    assert logger.name == created_experiment.name
    assert logger.version == created_experiment.id


@psinger I rebased the branch and updated the tests so that they pass with the change you made in neptune.
When doing that, I saw this comment in the test. I'm not sure what this is about. I see no evidence that we are forced to initialize the neptune experiment at init. How do you see it?

awaelchli · 2021-01-21T10:32:45Z

Let's finalize this PR, it has waited long enough :)

codecov · 2021-01-21T12:22:30Z

Codecov Report

Merging #3256 (ed89104) into master (6ab5417) will increase coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #3256   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         135     135           
  Lines       10005   10005           
======================================
+ Hits         9339    9340    +1     
+ Misses        666     665    -1

* DP device fix * potential fix * fix merge * update tests Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

mergify bot requested a review from a team August 29, 2020 11:29

psinger mentioned this pull request Aug 29, 2020

NeptuneLogger creates multiple experiments in DDP mode #3255

Closed

awaelchli suggested changes Aug 29, 2020

View reviewed changes

pytorch_lightning/loggers/neptune.py Show resolved Hide resolved

mergify bot requested a review from a team August 29, 2020 11:43

awaelchli changed the title ~~Neptune logger~~ fix Neptune logger creating multiple experiments when gpus > 1 Aug 29, 2020

edenlightning linked an issue Sep 1, 2020 that may be closed by this pull request

NeptuneLogger creates multiple experiments in DDP mode #3255

Closed

pitercl mentioned this pull request Sep 9, 2020

NeptuneLogger doesn't support fetching active experiment. #3389

Closed

blatr mentioned this pull request Sep 11, 2020

Added experiment_id to NeptuneLogger #3462

Merged

Borda added the bug Something isn't working label Sep 21, 2020

Borda force-pushed the neptune_logger branch from 0dd7d32 to a1d3044 Compare September 29, 2020 14:11

awaelchli mentioned this pull request Oct 4, 2020

Mocking loggers (part 2, neptune) #3617

Merged

psinger added 2 commits October 7, 2020 11:51

DP device fix

6cc1ece

potential fix

28d9418

Borda force-pushed the neptune_logger branch from a1d3044 to 28d9418 Compare October 7, 2020 09:51

Borda added this to the 1.0.x milestone Oct 20, 2020

stale bot added the won't fix This will not be worked on label Nov 3, 2020

stale bot closed this Nov 8, 2020

awaelchli added this to the 1.1.x milestone Jan 13, 2021

awaelchli added the logger Related to the Loggers label Jan 13, 2021

awaelchli self-assigned this Jan 13, 2021

github-actions bot added the has conflicts label Jan 13, 2021

awaelchli added 2 commits January 21, 2021 11:19

Merge branch 'master' into neptune_logger

b68470a

fix merge

4f4c9c2

psinger requested review from Borda, justusschock, SeanNaren, tchaton and williamFalcon as code owners January 21, 2021 10:21

update tests

2d1b866

awaelchli reviewed Jan 21, 2021

View reviewed changes

awaelchli added ready PRs ready to be merged and removed has conflicts labels Jan 21, 2021

awaelchli approved these changes Jan 21, 2021

View reviewed changes

mergify bot requested a review from a team January 21, 2021 16:14

rohitgr7 approved these changes Jan 21, 2021

View reviewed changes

mergify bot requested a review from a team January 21, 2021 16:51

Merge branch 'master' into neptune_logger

ed89104

SeanNaren approved these changes Jan 23, 2021

View reviewed changes

SeanNaren merged commit 052bc00 into Lightning-AI:master Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Neptune logger creating multiple experiments when gpus > 1 #3256

fix Neptune logger creating multiple experiments when gpus > 1 #3256

psinger commented Aug 29, 2020

awaelchli commented Aug 29, 2020

pitercl commented Aug 31, 2020

psinger commented Aug 31, 2020 •

edited

Loading

pitercl commented Aug 31, 2020

awaelchli commented Aug 31, 2020 •

edited

Loading

pitercl commented Sep 3, 2020

mergify bot commented Oct 11, 2020

stale bot commented Nov 3, 2020

stale bot commented Nov 8, 2020

Borda commented Nov 8, 2020

psinger commented Nov 8, 2020

Parskatt commented Jan 12, 2021

psinger commented Jan 12, 2021

awaelchli Jan 21, 2021

awaelchli commented Jan 21, 2021

codecov bot commented Jan 21, 2021 •

edited

Loading

fix Neptune logger creating multiple experiments when gpus > 1 #3256

fix Neptune logger creating multiple experiments when gpus > 1 #3256

Conversation

psinger commented Aug 29, 2020

awaelchli commented Aug 29, 2020

pitercl commented Aug 31, 2020

psinger commented Aug 31, 2020 • edited Loading

pitercl commented Aug 31, 2020

awaelchli commented Aug 31, 2020 • edited Loading

pitercl commented Sep 3, 2020

mergify bot commented Oct 11, 2020

stale bot commented Nov 3, 2020

stale bot commented Nov 8, 2020

Borda commented Nov 8, 2020

psinger commented Nov 8, 2020

Parskatt commented Jan 12, 2021

psinger commented Jan 12, 2021

awaelchli Jan 21, 2021

Choose a reason for hiding this comment

awaelchli commented Jan 21, 2021

codecov bot commented Jan 21, 2021 • edited Loading

Codecov Report

psinger commented Aug 31, 2020 •

edited

Loading

awaelchli commented Aug 31, 2020 •

edited

Loading

codecov bot commented Jan 21, 2021 •

edited

Loading