Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

Closed
profPlum opened this issue Sep 26, 2023 · 13 comments · Fixed by #18846
Closed

Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

profPlum opened this issue Sep 26, 2023 · 13 comments · Fixed by #18846
Labels
docs Documentation related lightningdatamodule pl.LightningDataModule

Comments

@profPlum
Copy link

profPlum commented Sep 26, 2023

📚 Documentation

It is recommended in the docs to do splits in the setup() method & an example is given that does random splits in setup.

Perhaps it shouldn't be recommended to do random splits in the setup(). Because
I've found that when I do random_split inside setup with multiprocessing it results in each GPU worker having a different validation_dataset (given by random split).

image

So when training the model is trained on some (or all) of the validation data when random splits are done in setup(). (without a seed)

cc @Borda @carmocca @awaelchli

@profPlum profPlum added docs Documentation related needs triage Waiting to be triaged by maintainers labels Sep 26, 2023
@awaelchli
Copy link
Contributor

IMO if users think this is not clear, the example should be updated to set the generator on the random_split function with a fixed seed.

However, I would like to defend the example in that it is a code snippet that tries to highlight the following point:

The equivalent DataModule just organizes the same exact code, but makes it reusable across projects.

It doesn't try to give you a full end-to-end training example but rather just get this message across so you can build your own.

@awaelchli awaelchli added lightningdatamodule pl.LightningDataModule and removed needs triage Waiting to be triaged by maintainers labels Sep 26, 2023
@profPlum
Copy link
Author

profPlum commented Sep 26, 2023

@awaelchli I don't think its a good idea to encourage & even demonstrate code smells which can easily lead to silent & quite serious bugs later on. Moreover maybe we shouldn't be conveying that you can just copy and paste when the reality is different for multi-gpu training?

What if someone started their project based off a docs example then scaled to multiple GPUs & went to publish their results without reading the finer points of setup() vs prepare_data()? I think it's quite possible they'd find a new backend.

Also there are many other places in the docs which encourage splits in setup() (& random is ofc most common type). I think all of these places should heavily emphasize this GOTCHA & recommended best practice (e.g. deterministic splits).

@awaelchli
Copy link
Contributor

awaelchli commented Sep 26, 2023

Has my last reply left the wrong impression @profPlum? I'm sorry if it has frustrated you but I have never said no to any of your comments and I have not objected to anything you said. I simply stated my opinion in addition to this ticket and I made this clear in my message.

You are proposing that we don't recommend doing the split in setup(), but you haven't said where it should be done otherwise. Could you clarify please? Would my suggestion of setting the random seed on the random_split function not work and why not?

@profPlum
Copy link
Author

profPlum commented Sep 26, 2023

I'm sorry too, perhaps I conveyed the wrong tone. And you're right I'm not 100% sure of perfect solution either...

I'm just worried that your response was too narrow in regards to changing one line of code. Whereas the problem is more widespread; it would be better to change every similar part of the documentation & clearly emphasize the GOTCHA + best practice.

Honestly it still scares me that even after docs change it will be so easy to make such a serious mistake which could easily go undetected...
Ideally: Things would be designed without GOTCHAs, & so that users would simply be unable to make this kind of mistake (e.g. in keras with validation_split=0.XX). But I'm not sure how to do that here.

@profPlum
Copy link
Author

@awaelchli But in regards to doing it in setup vs somewhere else I was thinking it would be nice if it were easy to do shuffling/splitting in prepare_data(). But I'm not sure how to do that in a way which is safe when training multiple models at once in a shared file system (e.g. slurm/cloud). Without directly coupling the code with slurm or something else.

@awaelchli
Copy link
Contributor

awaelchli commented Sep 27, 2023

Sure, we can make that clearer in the docs.

I'm struggling to understand though how and why Lightning should enforce such behavior. It doesn't fit well with the core philosophy.

prepare_data() is definitely not the place to create the split, not when it's done online. However, if you do the split once and then save to disk in prepare_data, that's ok and you can then load it in setup() or anywhere else. You see, this is very dependent on the user's dataset and use case and so we need to leave the flexibility for the user to make this choice which hook is more suited.

I really don't think Lightning should change the default behavior of torch functions that the user calls, nor should it silently set the seed or random state if the user doesn't explicitly ask for it. Lightning's philosophy is to organize PyTorch code and provide good default configuration, but it does not in any way intend to override the behavior of torch on a functional level.

@profPlum
Copy link
Author

profPlum commented Sep 28, 2023

@awaelchli Like I said I don't know of a perfect solution & I'm not recommending you change behavior of pytorch functions. But the reason I think it would be desirable to find a solution to this problem is because (as you say) pytorch-lightning is about making best practices default (e.g. proper splitting) and letting users avoid engineering issues like this one.

Out of curiosity how is this problem dealt with in raw pytorch in a multiprocessing setting? There must be some kind of best practice associated with random_split right?

P.S. Yes I was considering saving the splits to disk in prepare_data & loading in setup.

@awaelchli
Copy link
Contributor

awaelchli commented Sep 28, 2023

Got it yes, so we're aligned generally.

Out of curiosity how is this problem dealt with in raw pytorch in a multiprocessing setting? There must be some kind of best practice associated with random_split right?

In PyTorch the best practice is to always set the seed for the random_split function regardless of multiprocessing actually. You'd normally want to fix the splitting completely so that different experiments are fully-comparable (e.g. when comparing validation metrics). The multiprocessing doesn't really play much into this except that when new processes are created, they all start with a different random state. That's why best practice in PyTorch is to set the global random seed. So in summary, best practice is to set

  1. torch.manual_seed(...) (also numpy etc. if necessary)
  2. Set seed in torch.random_split()

This amounts to the same practice in Lightning, except that for (1) we have the convenient seed_everything utility.

@profPlum
Copy link
Author

Oh interesting, ya I was getting that impression since keras also does deterministic splitting. I guess then encouraging seeding is probably our best bet, for now.

@profPlum
Copy link
Author

profPlum commented Oct 23, 2023

@awaelchli Hi again, it just occurred to me that this problem goes away with CLI arg --seed_everything=$RANDOM. But also I checked the default value for this arg & it's --seed_everything=True. I honestly can't tell why these two settings:
--seed_everything=$RANDOM & --seed_everything=True (default), would behave differently & it seems like a bug.

My understanding then would be that (given the current defaults) the problem I mentioned shouldn't be a problem. But it is, why is that?

tl;dr
If it turns out that --seed_everything=$RANDOM is the correct default behavior it would be the perfect solution the seeding random_split problem!

@awaelchli
Copy link
Contributor

@profPlum In addition, could you explain how you are launching the script? --seed_everything=True, if we're talking about the LightningCLI, will make it so that a random number is chosen for the seed. The best practice is however to always pass in a fixed number, so that each process has the same seed from the beginning.

@awaelchli
Copy link
Contributor

@profPlum Let me know what you think about my last reply and the PR I sent.

@profPlum
Copy link
Author

profPlum commented Dec 26, 2023

@profPlum Let me know what you think about my last reply and the PR I sent.

Perfect! I was in fact using slurm/srun.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related lightningdatamodule pl.LightningDataModule
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants