Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

profPlum · 2023-09-26T02:48:10Z

📚 Documentation

It is recommended in the docs to do splits in the setup() method & an example is given that does random splits in setup.

Perhaps it shouldn't be recommended to do random splits in the setup(). Because
I've found that when I do random_split inside setup with multiprocessing it results in each GPU worker having a different validation_dataset (given by random split).

So when training the model is trained on some (or all) of the validation data when random splits are done in setup(). (without a seed)

cc @Borda @carmocca @awaelchli

awaelchli · 2023-09-26T10:36:59Z

IMO if users think this is not clear, the example should be updated to set the generator on the random_split function with a fixed seed.

However, I would like to defend the example in that it is a code snippet that tries to highlight the following point:

The equivalent DataModule just organizes the same exact code, but makes it reusable across projects.

It doesn't try to give you a full end-to-end training example but rather just get this message across so you can build your own.

profPlum · 2023-09-26T15:51:17Z

@awaelchli I don't think its a good idea to encourage & even demonstrate code smells which can easily lead to silent & quite serious bugs later on. Moreover maybe we shouldn't be conveying that you can just copy and paste when the reality is different for multi-gpu training?

What if someone started their project based off a docs example then scaled to multiple GPUs & went to publish their results without reading the finer points of setup() vs prepare_data()? I think it's quite possible they'd find a new backend.

Also there are many other places in the docs which encourage splits in setup() (& random is ofc most common type). I think all of these places should heavily emphasize this GOTCHA & recommended best practice (e.g. deterministic splits).

awaelchli · 2023-09-26T16:22:52Z

Has my last reply left the wrong impression @profPlum? I'm sorry if it has frustrated you but I have never said no to any of your comments and I have not objected to anything you said. I simply stated my opinion in addition to this ticket and I made this clear in my message.

You are proposing that we don't recommend doing the split in setup(), but you haven't said where it should be done otherwise. Could you clarify please? Would my suggestion of setting the random seed on the random_split function not work and why not?

profPlum · 2023-09-26T20:25:44Z

I'm sorry too, perhaps I conveyed the wrong tone. And you're right I'm not 100% sure of perfect solution either...

I'm just worried that your response was too narrow in regards to changing one line of code. Whereas the problem is more widespread; it would be better to change every similar part of the documentation & clearly emphasize the GOTCHA + best practice.

Honestly it still scares me that even after docs change it will be so easy to make such a serious mistake which could easily go undetected...
Ideally: Things would be designed without GOTCHAs, & so that users would simply be unable to make this kind of mistake (e.g. in keras with validation_split=0.XX). But I'm not sure how to do that here.

profPlum · 2023-09-26T22:11:35Z

@awaelchli But in regards to doing it in setup vs somewhere else I was thinking it would be nice if it were easy to do shuffling/splitting in prepare_data(). But I'm not sure how to do that in a way which is safe when training multiple models at once in a shared file system (e.g. slurm/cloud). Without directly coupling the code with slurm or something else.

awaelchli · 2023-09-27T04:09:36Z

Sure, we can make that clearer in the docs.

I'm struggling to understand though how and why Lightning should enforce such behavior. It doesn't fit well with the core philosophy.

prepare_data() is definitely not the place to create the split, not when it's done online. However, if you do the split once and then save to disk in prepare_data, that's ok and you can then load it in setup() or anywhere else. You see, this is very dependent on the user's dataset and use case and so we need to leave the flexibility for the user to make this choice which hook is more suited.

I really don't think Lightning should change the default behavior of torch functions that the user calls, nor should it silently set the seed or random state if the user doesn't explicitly ask for it. Lightning's philosophy is to organize PyTorch code and provide good default configuration, but it does not in any way intend to override the behavior of torch on a functional level.

profPlum · 2023-09-28T15:28:09Z

@awaelchli Like I said I don't know of a perfect solution & I'm not recommending you change behavior of pytorch functions. But the reason I think it would be desirable to find a solution to this problem is because (as you say) pytorch-lightning is about making best practices default (e.g. proper splitting) and letting users avoid engineering issues like this one.

Out of curiosity how is this problem dealt with in raw pytorch in a multiprocessing setting? There must be some kind of best practice associated with random_split right?

P.S. Yes I was considering saving the splits to disk in prepare_data & loading in setup.

awaelchli · 2023-09-28T17:00:06Z

Got it yes, so we're aligned generally.

Out of curiosity how is this problem dealt with in raw pytorch in a multiprocessing setting? There must be some kind of best practice associated with random_split right?

In PyTorch the best practice is to always set the seed for the random_split function regardless of multiprocessing actually. You'd normally want to fix the splitting completely so that different experiments are fully-comparable (e.g. when comparing validation metrics). The multiprocessing doesn't really play much into this except that when new processes are created, they all start with a different random state. That's why best practice in PyTorch is to set the global random seed. So in summary, best practice is to set

torch.manual_seed(...) (also numpy etc. if necessary)
Set seed in torch.random_split()

This amounts to the same practice in Lightning, except that for (1) we have the convenient seed_everything utility.

profPlum · 2023-09-28T17:23:59Z

Oh interesting, ya I was getting that impression since keras also does deterministic splitting. I guess then encouraging seeding is probably our best bet, for now.

profPlum · 2023-10-23T20:22:04Z

@awaelchli Hi again, it just occurred to me that this problem goes away with CLI arg --seed_everything=$RANDOM. But also I checked the default value for this arg & it's --seed_everything=True. I honestly can't tell why these two settings:
--seed_everything=$RANDOM & --seed_everything=True (default), would behave differently & it seems like a bug.

My understanding then would be that (given the current defaults) the problem I mentioned shouldn't be a problem. But it is, why is that?

tl;dr
If it turns out that --seed_everything=$RANDOM is the correct default behavior it would be the perfect solution the seeding random_split problem!

awaelchli · 2023-10-23T23:44:55Z

@profPlum In addition, could you explain how you are launching the script? --seed_everything=True, if we're talking about the LightningCLI, will make it so that a random number is chosen for the seed. The best practice is however to always pass in a fixed number, so that each process has the same seed from the beginning.

awaelchli · 2023-10-25T00:02:39Z

@profPlum Let me know what you think about my last reply and the PR I sent.

profPlum · 2023-12-26T16:53:01Z

@profPlum Let me know what you think about my last reply and the PR I sent.

Perfect! I was in fact using slurm/srun.

profPlum added docs Documentation related needs triage Waiting to be triaged by maintainers labels Sep 26, 2023

awaelchli added lightningdatamodule pl.LightningDataModule and removed needs triage Waiting to be triaged by maintainers labels Sep 26, 2023

awaelchli mentioned this issue Sep 27, 2023

Set the generator seed in random_split usages in the examples #18651

Merged

awaelchli mentioned this issue Oct 23, 2023

Change dangerous default random seed selection #18846

Merged

awaelchli closed this as completed in #18846 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

profPlum commented Sep 26, 2023 •

edited by github-actions bot

Loading

awaelchli commented Sep 26, 2023

profPlum commented Sep 26, 2023 •

edited

Loading

awaelchli commented Sep 26, 2023 •

edited

Loading

profPlum commented Sep 26, 2023 •

edited

Loading

profPlum commented Sep 26, 2023

awaelchli commented Sep 27, 2023 •

edited

Loading

profPlum commented Sep 28, 2023 •

edited

Loading

awaelchli commented Sep 28, 2023 •

edited

Loading

profPlum commented Sep 28, 2023

profPlum commented Oct 23, 2023 •

edited

Loading

awaelchli commented Oct 23, 2023

awaelchli commented Oct 25, 2023

profPlum commented Dec 26, 2023 •

edited

Loading

Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

Tutorial DataModule leads to Validation data leakage in Multiprocessing #18638

Comments

profPlum commented Sep 26, 2023 • edited by github-actions bot Loading

📚 Documentation

awaelchli commented Sep 26, 2023

profPlum commented Sep 26, 2023 • edited Loading

awaelchli commented Sep 26, 2023 • edited Loading

profPlum commented Sep 26, 2023 • edited Loading

profPlum commented Sep 26, 2023

awaelchli commented Sep 27, 2023 • edited Loading

profPlum commented Sep 28, 2023 • edited Loading

awaelchli commented Sep 28, 2023 • edited Loading

profPlum commented Sep 28, 2023

profPlum commented Oct 23, 2023 • edited Loading

awaelchli commented Oct 23, 2023

awaelchli commented Oct 25, 2023

profPlum commented Dec 26, 2023 • edited Loading

profPlum commented Sep 26, 2023 •

edited by github-actions bot

Loading

profPlum commented Sep 26, 2023 •

edited

Loading

awaelchli commented Sep 26, 2023 •

edited

Loading

profPlum commented Sep 26, 2023 •

edited

Loading

awaelchli commented Sep 27, 2023 •

edited

Loading

profPlum commented Sep 28, 2023 •

edited

Loading

awaelchli commented Sep 28, 2023 •

edited

Loading

profPlum commented Oct 23, 2023 •

edited

Loading

profPlum commented Dec 26, 2023 •

edited

Loading